2026-03-07 · AI & Agents

Character Design for LLMs: What Screenwriters, Psychologists, and Roleplayers Know That Engineers Don't — March 2026

Most LLM character prompts are bad. Not "could be better" bad — structurally misconceived. They list traits like a database schema, dump backstory like a wiki article, and wonder why the character feels flat after three turns.

Meanwhile, screenwriters have spent a century figuring out what makes a character feel real. Psychologists have frameworks for modeling personality that actually predict behavior. And the roleplay community — people building character cards on SillyTavern and Character.AI — have run more A/B tests on character consistency than any research lab, even if they'd never call it that.

This is a snapshot from March 2026, written from the perspective of building persistent character systems on Orca (our homelab AI platform). We maintain character bibles for real people and AI personas, and we've iterated on what actually holds up across long conversations vs. what falls apart after the first context window rotation. This is what we've found.

The Landscape

Character representation for LLMs draws from five domains, and they barely talk to each other:

Screenwriting craft — McKee, Truby, Snyder. A century of figuring out what makes fictional people feel real on screen. Their core technology is contradiction under pressure.
Psychology frameworks — Big Five, Enneagram, attachment theory. Models of personality that predict behavior, not just describe it. Some transfer to LLMs surprisingly well; others are worse than useless.
Roleplay community — Character.AI, SillyTavern, TavernAI users. Thousands of people iterating daily on "how do I make this character stay consistent across 200 messages." The most empirical group on this list.
Real-person representation — Personal knowledge bases, AI companions, digital twins. The problem of making an LLM understand and represent an actual human being, not a fictional construct.
Intimate/ERP design — Characters built for emotional and physical intimacy. The hardest consistency test there is, because it demands register continuity (the character in conversation must feel like the same person in a different mode, not a different person).

The interesting thing is convergence. Despite different vocabularies and zero cross-pollination, these domains independently arrived at the same core insight: a character is not a list of traits. A character is a pattern of contradictions under pressure, expressed through specific behavioral tendencies.

Screenwriters call these "dimensions." Psychologists call them "trait interactions." Roleplayers figured it out through trial and error. But they all point at the same thing.

What Works

Contradictions Are the Engine

McKee counts character "dimensions" as internal contradictions that surface under pressure. Tony Soprano has 12 dimensions — violent but tender with family, strategic but impulsive when disrespected, contemptuous of therapy but desperately dependent on it. Walter White has 16.

For LLMs, this is gold. Give a model "cold but secretly desperate for approval" and it will produce more interesting, more consistent behavior than "cold" alone. The contradiction gives the model something to navigate, which produces the kind of emergent complexity that reads as depth.

Truby's desire/need split is the minimum viable version of this: what the character consciously wants vs. what they actually need. A character who wants control but needs vulnerability will behave in recognizably human ways — overreacting to small threats, dismissing genuine offers of help, occasionally cracking in ways that surprise even them.

The practical move: Every character should have 2-4 internal contradictions formatted as "X but Y" statements. These belong in Tier 2 of your character document, after voice and examples.

The First Message Is Everything

This is the single most replicated finding from the roleplay community: the opening message (greeting, first response) does more to establish character consistency than any amount of trait description.

The first message sets tone, sentence length, vocabulary level, emotional register, and interaction style. The model treats it as the strongest example of "how this character communicates." A 500-word character description with a generic first message will underperform a 100-word description with a carefully crafted opening.

This maps to Ali:Chat format's core insight: behavior demonstrated is stronger than behavior declared. Two or three example exchanges showing the character in different emotional states (calm, angry, vulnerable) anchor consistency more reliably than a paragraph of personality description.

The practical move: Write your opening message first. Then write 2-3 example exchanges of 5-8 lines each, covering different emotional registers. Then write the rest of the character document. Not the other way around.

Psychology Frameworks (When Used Right)

Not all personality frameworks transfer equally to LLMs.

Big Five (high transfer): The dimensional model — Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism. Works well when expressed as behavioral descriptions, not labels. "High openness" is useless. "Fascinated by new ideas, rearranges her entire weekend when she discovers something interesting, has three half-finished projects for every completed one" gives the model something to work with.

The PersonaLLM paper (Jiang et al., 2024) showed that LLMs prompted with Big Five profiles produce measurably consistent personality — humans identified the intended personality with roughly 80% accuracy. [Confidence: high — this paper is widely cited and the methodology is straightforward.]

Enneagram (high transfer): The motivation-based system. Each type has a core motivation, a core fear, and distinct patterns under stress vs. growth. This maps almost directly to a character consistency engine. An Enneagram Four (core fear: having no identity or significance) under stress moves toward Two behavior (people-pleasing, losing themselves in others' needs). This kind of predictable-but-not-obvious behavioral shift is exactly what makes characters feel real over long conversations.

Attachment styles (high transfer for relational characters): Secure, anxious, avoidant, disorganized. If your character has close relationships — romantic, familial, or otherwise — attachment style is one of the highest-leverage single additions you can make. An avoidant character doesn't just "not like commitment" — they withdraw when things get close, rationalize emotional distance as independence, and feel relief mixed with loss when someone stops trying to reach them.

MBTI (low transfer): Triggers stereotypes. "INTJ" doesn't give the model behavioral predictions — it gives it a caricature. Skip it.

Love Languages (low transfer): Surface-level affection style. Not useless, but low leverage compared to the frameworks above.

Values Over Demographics

For representing real people (not fictional characters), the key finding from our own work is that values beat demographics every time.

What actually matters:

The 5-10 things they organize their life around — operational values, not aspirational ones. Not "I value honesty" but "he'll burn a relationship before he'll tolerate being lied to."
Communication patterns — how they talk, what they avoid saying, how they handle conflict. Someone who goes quiet when angry is fundamentally different from someone who escalates.
Current concerns — what's actively on their mind. This changes and needs updating.
The gap between stated values and actual behavior — this is where the real person lives, and it's where the model needs the most help.

What does not matter: age, location, job title, zodiac sign, physical description (unless it drives behavior). We've seen character documents that are 80% demographic data and 20% personality, and they consistently produce flat, generic interactions.

Register Continuity in Intimate Design

The hardest test for character consistency is intimate scenarios — not because of the content, but because the character has to feel like the same person across radically different emotional registers.

The roleplay and ERP communities have converged on several findings here:

Dynamic over checklist. A relational pattern (dominant/playful/possessive/tender) generates more consistent and interesting behavior than a list of preferences. "The dynamic has teeth — dominance, submission, play, tension. It shifts" outperforms a 50-item menu every time.
Boundaries as character expression. What a character won't do reveals character. A character with specific, motivated boundaries feels more real than one without limits.
Emotional texture around physical acts. How intimacy feels to the character — possessive? vulnerable? playful? reverent? — is the consistency lever, not mechanical description.
The person in bed has to be recognizable as the person at dinner. Register continuity means the voice, the worldview, the contradictions all carry across contexts. The worst character cards are the ones where intimate mode activates a completely different personality.

What Doesn't Work

Trait Lists Without Behavior

Personality: kind, brave, intelligent, stubborn — this is the default format for most LLM character prompts, and it's nearly worthless. These are labels, not behaviors. The model can't derive "how does a kind-but-stubborn person respond when asked to compromise on something they care about?" from four adjectives.

Every trait needs grounding: either a behavioral description ("she'll apologize first even when she's right, but she won't change her position") or a demonstrated example. Ungrounded traits decay to stereotypes within a few turns.

Negation and "Hates" Fields

LLMs are bad at negation. "Does NOT like small talk" often results in a character who mentions disliking small talk... while making small talk. "Never swears" produces a character who swears and then apologizes for it. Or one who talks about not swearing.

Reframe negatives as positives: "prefers to skip pleasantries and get to the point" works. "Does not like pleasantries" doesn't.

Timeline-Heavy Backstory

Backstory matters for meaning, not for chronology. "Grew up poor, which is why she hoards resources and panics about money even when she has enough" gives the model a behavioral engine. A 2,000-word timeline of where she lived and went to school gives it nothing actionable.

Backstory should be short and causal: this happened, and as a result, this pattern exists now. Everything else is noise.

Demographic Data as Personality

Age, height, hair color, blood type, zodiac sign — unless these drive behavior, they consume tokens without producing consistency. A surprising amount of character card real estate goes to physical description that never influences how the character speaks or acts.

Over-specified Characters

A character document can be too complete. When every possible response is pre-scripted, the model has no room to navigate contradictions — and it's the navigation that produces the feeling of a real personality. Leave gaps. Let the contradictions generate emergent behavior.

Key Patterns

The Tiered Character Document

Organize by impact on LLM consistency, not by narrative logic. Our synthesis across all five domains:

Tier 1 — Voice (highest impact on consistency)

Speech patterns, vocabulary constraints, sentence structure
Example exchanges showing different emotional states (2-3 examples, 5-8 lines each)
Opening/greeting that establishes tone and length

Tier 2 — Character Engine

Desire: what they consciously want
Fear: what they're avoiding or running from
Contradictions: 2-4 internal tensions as "X but Y"
Edges: blind spots, worst tendencies, things they can't see about themselves

Tier 3 — Relational

Key relationships as dynamics, not contact lists ("controls his mother, defers to his sister" not "mother: Carol, sister: Lisa")
Attachment pattern (secure/anxious/avoidant/disorganized)
Intimacy style as emotional texture, not a menu

Tier 4 — Behavioral Calibration

Big Five as behavioral descriptions (not labels)
Operational values (not aspirational ones)
Thinking style (intuitive vs. analytical, verbal vs. spatial, etc.)

Tier 5 — Context (add as needed)

Identity facts (minimal — name, role, situation)
Backstory as meaning, not timeline
Current emotional/situational state
Environment and sensory details

The tiers are ordered by how much they contribute to the model maintaining consistent character across a long conversation. Voice is first because it's what the model actually patterns on. Identity facts are last because they contribute the least to behavioral consistency.

Cast Polarization

This is McKee's first principle and it transfers directly to multi-character LLM scenarios: characters exist in contrast to each other, never in isolation. Each major character should represent a different answer to the same question.

If you're building a cast, define the central tension first ("what does it mean to be loyal?"), then make each character embody a different response. One character is loyal to people. Another is loyal to principles. Another is loyal to themselves. The interactions write themselves because the characters have genuine, motivated disagreements.

This is also practical advice for character card economy: if you have three characters, you don't need three complete descriptions. You need one detailed description and two descriptions that focus on how they differ from each other.

Hybrid Format: Facts as Structure, Personality as Prose

The roleplay community landed on this through iteration: structured formats (key-value pairs, lists) work for factual information. Prose works for personality. Mixing them outperforms either alone.

Name: Mira
Age: 34
Role: Ship engineer

Mira talks like she's explaining something to someone slightly less
intelligent than her, but she doesn't mean it unkindly — it's just the
only register she has. She'll fix your engine and insult your maintenance
schedule in the same breath. Gets quiet when she's scared, which
is the only way you'd know she's scared, because she won't say it.
She wanted to be an academic. She wound up here. She hasn't made
peace with that yet.

The structured facts anchor retrieval. The prose gives the model a voice to pattern-match on. Trait ordering matters — traits at the end of a list or description get more weight due to recency bias in attention mechanisms.

Our Take

We run character systems on Orca, our homelab AI platform. We maintain character bibles for real people (the project sponsor, collaborators) and for AI personas. The real-person bibles have iterated through multiple structures over months of daily use. Here's where we landed:

Character bibles are organized as topic files (worldview, relationships, goals, history, work), not as monolithic documents. A compiled "static" version gets injected into context as needed. The full bibles are too large for a single context window — they're reference material, not prompts. The compilation step forces us to decide what actually matters for a given interaction, which is itself a useful discipline.

The biggest lesson from production use: update frequency matters more than initial quality. A character document that gets revised weekly based on real interactions will outperform a meticulously crafted document that's never updated. Current concerns, recent experiences, and evolving relationships are the difference between a character that feels alive and one that feels preserved in amber.

We'd change our approach if models got significantly larger context windows with better attention across the full window. Right now, the tiered structure exists partly because of attention decay — the model pays more attention to recent/proximate content. If that constraint relaxes, we'd flatten the tiers and include more raw material. But we're not holding our breath.

Summary Table

Domain	Key Contribution	Highest-Leverage Technique	Common Mistake
Screenwriting	Contradiction as depth	"X but Y" internal tensions	Backstory without behavioral payoff
Psychology	Behavioral prediction	Big Five as descriptions, Enneagram as motivation engine	MBTI stereotypes
Roleplay community	Empirical consistency testing	First message + example exchanges	Trait lists without grounding
Real-person representation	Operational values	Communication patterns + current concerns	Demographics as personality
Intimate design	Register continuity	Dynamic over checklist	Separate personality for separate modes

Sources

Robert McKee, Story (1997) — Foundational text on character dimensions as contradictions. The "how many dimensions" counting method comes from McKee's masterclass materials.
John Truby, The Anatomy of Story (2007) — Desire vs. need distinction, moral argument through cast design.
Jiang et al., "PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits" (2024) — Big Five personality expression in LLMs. [Confidence: high — peer-reviewed, widely cited, methodology is reproducible.]
SillyTavern documentation and community wikis — Primary source for Ali:Chat format, PList effectiveness, first-message consistency findings. Community-sourced, not peer-reviewed, but empirically tested across thousands of users.
Character.AI community guides — Secondary source for greeting/first-message impact on consistency.
Enneagram Institute (enneagraminstitute.com) — Reference for core motivations, stress/growth patterns by type.

A note on confidence: The screenwriting principles (McKee, Truby) are well-established craft knowledge. The psychology frameworks (Big Five, attachment theory) are backed by decades of research, though their transfer to LLMs is less studied — the PersonaLLM paper is one of the few rigorous attempts. The roleplay community findings are empirical but not peer-reviewed; they represent collective trial-and-error across a large user base. Our own production findings are n=1 (one team, one platform, specific use cases). Take the specificity as a feature, not a bug — a dated opinion from a specific vantage point is more useful than a generic overview.

Last updated: March 7, 2026. Written from production experience building character systems on a homelab AI platform. The research was designed to inform our own character bible architecture; we're publishing it because we couldn't find anything like it when we went looking.