Sometimes I Hate VO

Abstract

Problem: Voice acting is universally praised by players, but what problems does it actually cause for game designers, programmers, and producers?

Approach: Tim Cain draws on decades of experience shipping VO-heavy RPGs (Fallout, Arcanum, The Outer Worlds) to catalog the concrete design and production costs of voice-over.

Findings: VO inflates disk space and budgets, locks dialogue earlier in production, and β€” most critically β€” kills reactive dialogue systems. Generated dialogue banks, constructed sentences, emotion-based line variants, and health-state modifiers all become prohibitively expensive or impossible once every line must be recorded by a human actor. AI VO could theoretically solve every one of these problems but currently falls short in quality and raises workforce displacement concerns.

Key insight: The real cost of VO isn't money or disk space β€” it's the reactivity you silently lose. Every voiced line is a line that can't cheaply branch, emote, or adapt to player state.

Source: https://www.youtube.com/watch?v=ETCLEt-PEPM

VO Elevated Fallout β€” But It's Complicated

Tim opens by acknowledging that voice-over can genuinely elevate games. He points to the Fallout intro ("War… war never changes") as a case where Ron Perlman took his writing and turned it into something iconic β€” still referenced 27+ years later, reused in the TV show, and echoed across the franchise. He doesn't think his prose alone was a "perfect diamond"; Perlman's performance elevated it beyond what text could achieve.

But that acknowledgment is the setup for the real thesis: VO causes enormous problems that nobody outside the industry talks about.

The Problems VO Causes

Disk Space (The Old Problem)

Back in the floppy era, VO simply couldn't fit β€” 1.4 MB per disk left no room. CDs (700 MB) unlocked VO, but even today voice files contribute meaningfully to bloated download sizes alongside high-res textures.

Time and Money (The Permanent Problem)

The VO pipeline is expensive and slow:

  • Hire voice actors
  • Record in a professional studio
  • Chop recordings into individually tagged lines
  • Insert into a sound database the game engine can reference
  • Lip-sync processing (automated or manual cleanup, often both)

Full VO for every NPC multiplies these costs dramatically. Tim notes that many of his games gave VO only to important characters, with everyone else text-only β€” "primarily because of money."

Dialogue Lock (The Production Killer)

Text-only dialogue can be changed up until the day before shipping. Voiced dialogue cannot β€” changing a line means getting the actor back in the studio, which costs time and money you may not have.

Tim shares a specific Fallout story: a voiced NPC gave the player directions to a location, but the location changed late in development. They couldn't re-record the VO, so they had an unvoiced NPC run over afterward to say "we just got new evidence β€” go south instead of north." A hack born purely from VO inflexibility.

The Big One: Reactivity Dies

This is Tim's core grievance. He loves reactive dialogue β€” systems where NPCs respond differently based on game state β€” and VO makes most of his favorite techniques impossible or prohibitively expensive.

Generated Dialogue Banks

In Arcanum, Tim built a system of "op codes" pointing to banks of phrases (greetings, goodbyes, barter offers, spell purchases, healing requests). This let designers create reactive NPCs quickly. But if those NPCs are voiced, every voice actor must record every line in every bank. The combinatorial cost explodes.

Constructed Sentences

Arcanum also assembled sentences dynamically by pulling fragments from different banks β€” inserting character names, combining clauses. You simply can't do this with VO. Games that try (pre-recording a set of player names) fail the moment you pick a name they didn't record, falling back to generic titles like "the Chosen One."

Emotion-Based Line Variants

Tim's dream: the same line delivered differently based on NPC emotional state. A shopkeeper's "Do you see anything you like?" sounds completely different when the NPC hates you versus is happy to see you. One text line becomes three (or more) VO recordings.

Health-State Modifiers

A wounded NPC should sound wounded: "Oh… it's good to see you again" versus a healthy delivery. And this is multiplicative with emotion β€” four reaction levels Γ— three health levels = twelve VO recordings for a single line of text.

Player VO Constraints

When the player character is voiced, you're either locked to one voice (regardless of character appearance/personality) or limited to a small set of grunts, greetings, and hit reactions while actual dialogue stays unvoiced.

AI VO: The Theoretical Fix

Tim lists AI voice-over as a potential solution that addresses every problem he raised:

  • Generated on the fly β€” no disk storage needed
  • Emotion and health variants β€” trivially parameterized
  • Generated dialogue and constructed sentences β€” works naturally
  • Player names β€” just synthesize them

But Two Problems Remain

  1. Quality gap: AI VO is noticeably worse than human performance. Tim compares it to bad AI-narrated audiobooks β€” wrong word pronunciation, wrong intonation. He doesn't know how long this gap will persist.
  2. Workforce displacement: Choosing AI VO puts human actors out of work. Tim contextualizes this alongside faster compilers and better 3D art tools that already reduced team sizes β€” AI is in that same category of productivity tools β€” but he doesn't dismiss the concern.

Tim's Actual Preference

His preferred design choice: no VO at all. It removes the constraints on reactivity, which he values above all else.

But he acknowledges the market reality:

  • Many players prefer listening to reading
  • Players now expect VO and won't buy "silent" games
  • Not having VO also puts voice actors out of work

He offers no tidy solution. The video ends where it began: "Sometimes I hate VO β€” and that's why."

References