8-frame coherence 2×2 matrix

Same orchestrator + image config, varying only scene-diversity and number of identity refs. All four tests use model: gpt-5.4-mini, reasoning.effort: high, tools: [{type: "image_generation", quality: "low", size: "1024x1024"}], max_tool_calls: 8, parallel_tool_calls: true.

Character A

character A reference female_asian

Character B

character B reference male_asian

Pinned variables

Endpoint
POST https://api.openai.com/v1/responses
Orchestrator
gpt-5.4-mini · reasoning.effort: "high"
Image tool
gpt-image-2 · quality: "low" · size: "1024x1024"
Tool budget
max_tool_calls: 8 · parallel_tool_calls: true
Est. cost / test
~$0.07 (8 × $0.006 image + ~$0.025 text)
Matrix total cost
~$0.30

The matrix

Single ref (A only)Multi ref (A + B)
Diverse scenes (8 locations, one day)Test ATest B
Same scene (one kitchen, 8 moments)Test CTest D

What each test isolates. A — single-character coherence across diverse locations (baseline). B — multi-character ensemble + diverse locations (hardest combo). Clocation coherence across 8 angles in one kitchen. D — joint identity + location lock with two people.

Test A — diverse scenes, single ref

A: diverse + single awaiting outputs…

One person (character A), eight chronological beats across one day in 8 different locations: dawn bedroom → coffee → train → cafe → park → reading window → cooking → balcony dusk.

Test B — diverse scenes, multi ref

B: diverse + multi awaiting outputs…

Two people (A + B as an ensemble) across eight chronological beats in eight different locations of one shared day. Tests whether both identities lock independently across diverse rooms.

Test C — same scene, single ref

C: same + single awaiting outputs…

One person, ALL EIGHT beats in the SAME single location (a small home kitchen) at different moments / camera angles. Tests location coherence — room layout, cabinet placement, window position, lighting register, palette — across the set.

Test D — same scene, multi ref

D: same + multi awaiting outputs…

Two people, ALL EIGHT beats in the same small home kitchen. The most demanding combination: two distinct identities AND a single location must both stay consistent across all 8 panels.

Test E — same scene, single ref + LOCATION BIBLE

E: same + single + bible 7231-char prompt · 992 reasoning tokens · 413 s · ~$0.08

Follow-up to Test C's location drift. Same model, same image-tool config, same 8 beats — only the prompt changes. Adds: a 400-word LOCATION BIBLE (terracotta floor, sage-green walls + lower cabinets, cream uppers, brass knobs, blue-and-white delft backsplash, butcher-block counter, six-pane window with countryside + leaning fence post, white farmhouse sink, cream gas stove, copper pot rack with 3 pots, oak table with yellow vase of white daisies + stack of red/navy/cream books + sage-green mug), a PRESERVE LIST naming 15 invariants, and "the SAME X" re-anchors in every beat.

Result: location locks. The kitchen is visibly the same room in all 8 panels — same floor, same cabinets, same delft backsplash, same window-and-countryside view, same yellow-vase-with-daisies on the table. Cost: ~$0.08 (same as Test C); only the prompt did the work.


Findings

All 4 tests succeeded — 32 images, ~$0.30 total spend. Wall-clock per test: ~5 min. Reasoning fired on every call (328 / 547 / 590 / 872 reasoning tokens respectively — complexity scales with prompt difficulty).

1. Identity lock holds in every cell of the matrix

Single ref (A, C) and multi ref (B, D) both produce visibly consistent faces across all 8 beats. With two photos attached, the model correctly anchors each character to its respective reference: the woman matches photo 1's face / hair / age throughout, the man matches photo 2's. They don't swap. They don't blend. The "FIRST photo / SECOND photo" naming convention in the prompt is honored. Multi-character ensembles are production-viable on this surface.

2. Location coherence is the weakness — same-scene drifts

Tests C and D were the location-lock stress test: "all 8 in the SAME single location (a small home kitchen)". The model rendered a small home kitchen in every panel — but not the SAME kitchen. Cabinet color, wall color, window placement, floor pattern, layout, and props all vary panel-to-panel. Beat 1's wooden-table-by-window doesn't match beat 4's white-cabinet kitchen with green door; beat 8 closes in on yet another room.

Why it fails. The thinking step decomposes the prompt into 8 separate image_generation calls, but each call appears to be issued with only its beat's text — no shared "location sheet" passed between calls. There's no equivalent of Pikumo's CONTINUITY ANCHOR reference image being threaded through the set. Identity holds because each call sees the attached identity photo(s); location doesn't hold because no location image is attached.

The likely fix is to render beat 1 first, then attach that rendered image as a location reference for beats 2–8. This is a multi-turn /v1/responses pattern (or two-stage orchestration) rather than the single-call thinking-mode capability. A separate follow-up experiment.

3. Diverse-scenes mode is the natural fit for the API as it stands today

Tests A and B (diverse locations) play to the model's strength: each beat names its own setting, identity locks via the attached photo, and the per-beat scene description carries the world. The result is the cleanest of the four: a believable day-in-the-life arc with faithful identity across the set. For Pikumo's "moments of a day / trip / event" story format, this is the production-ready path.

4. Cost lever confirmed at scale

4 tests × 8 images each = 32 image generations + 4 reasoning passes for an estimated ~$0.30 total. The previous single gpt-5.4 high-quality 8-panel run cost $1.78 by itself. This matrix run cost less than that single high-quality call AND produced 4× the output. The gpt-5.4-mini + quality:low + size:1024x1024 recipe is the right default — quality-sensitive surfaces can selectively escalate.

Bottom line

Updated recommendation (Test E result, 2026-05-26)

For Pikumo's existing storytelling pipeline — which already extracts a sceneBible with palette + setting + time period during Pass B — the integration looks like:

  1. Pre-render a LOCATION BIBLE text block from the sceneBible at thinking-mode call time. Named colors, named props, named layout. ~400 words is enough; the model rewards specificity.
  2. Add a PRESERVE LIST that enumerates the 10-15 invariants every panel must hold.
  3. In every per-beat description, re-anchor with "the SAME [bible-element]" verbatim language for at least 2-3 named elements.
  4. Use gpt-5.4-mini + quality: low + size: 1024x1024. max_tool_calls: N matched to the story's panel count. reasoning.effort: high.

For 4-panel stories (Pikumo's current shape), this reduces to a ~$0.04 per-story image budget (4 × $0.006 + ~$0.025 text) with both identity AND location coherence across the set — a step change over today's per-panel-isolated pipeline.

Generated 2026-05-26. All 4 tests cost < $0.40 total. Source headshots from eval/pipeline/cast/.