8-frame coherence 2×2 matrix
Same orchestrator + image config, varying only scene-diversity and number of identity refs.
All four tests use model: gpt-5.4-mini, reasoning.effort: high,
tools: [{type: "image_generation", quality: "low", size: "1024x1024"}],
max_tool_calls: 8, parallel_tool_calls: true.
Character A
female_asian
Character B
male_asian
Pinned variables
The matrix
| Single ref (A only) | Multi ref (A + B) | |
|---|---|---|
| Diverse scenes (8 locations, one day) | Test A | Test B |
| Same scene (one kitchen, 8 moments) | Test C | Test D |
What each test isolates. A — single-character coherence across diverse locations (baseline). B — multi-character ensemble + diverse locations (hardest combo). C — location coherence across 8 angles in one kitchen. D — joint identity + location lock with two people.
Test A — diverse scenes, single ref
A: diverse + singleOne person (character A), eight chronological beats across one day in 8 different locations: dawn bedroom → coffee → train → cafe → park → reading window → cooking → balcony dusk.
Test B — diverse scenes, multi ref
B: diverse + multiTwo people (A + B as an ensemble) across eight chronological beats in eight different locations of one shared day. Tests whether both identities lock independently across diverse rooms.
Test C — same scene, single ref
C: same + singleOne person, ALL EIGHT beats in the SAME single location (a small home kitchen) at different moments / camera angles. Tests location coherence — room layout, cabinet placement, window position, lighting register, palette — across the set.
Test D — same scene, multi ref
D: same + multiTwo people, ALL EIGHT beats in the same small home kitchen. The most demanding combination: two distinct identities AND a single location must both stay consistent across all 8 panels.
Tests M / O / P / Q — Pikumo production probes
Pikumo-specificFour tests targeting things that matter for Pikumo's actual production pipeline. Each test uses the K-recipe (gpt-5.4-mini + photo + sheet refs + low-q + 1024² + reasoning:high) and asks: "does this hold when we hit a real Pikumo-shaped surface?"
Test M — pet inclusion (text-only pet ref)
Pikumo's PET REFERENCE role in the canonical JSON spec normally takes a photo of the
user's pet. Question: does the thinking-mode pipeline hold pet identity from text alone
(a pet bible naming species + coat color + markings)? 6 beats with female_asian + "Whiskers, a fluffy
long-haired ginger tabby with white socks, white chest, green eyes, long fluffy tail."
Result: yes. Whiskers reads as the same cat in every panel — same orange tabby coat, same white markings, same fluffy tail. Beat 4 (pet alone in frame) correctly renders ONLY the cat with the woman's shoes visible in the background — the model didn't invent a human. Pikumo can ship pet support without adding a pet-photo upload step.
6 panels · 219.6 s wall · 689 reasoning tokens · ~$0.06 total.
Test O — cross-style transfer (oil + sketch)
Question: does the recipe generalize across Pikumo's STYLE_GUIDE catalog, or does it overfit to
dreamscape? Same 8-beat day-in-life prompt, only the style block changes. Two new styles:
Oil — gallery oil-painting register (visible brushwork, impasto, linen canvas)
8 panels · 334.4 s wall · 726 reasoning tokens · ~$0.075 total.
Sketch — pencil sketchbook register (loose linework, cross-hatching, paper grain)
8 panels · 298.5 s wall · 529 reasoning tokens · ~$0.07 total.
Result: clean transfer in both. Oil shows the requested impasto + warm earthy palette + atmospheric depth; sketch shows the requested loose pencil + cross-hatching + visible paper grain + minimal washes. Identity locks in both. No per-style optimization required.
Test P — dialogue / pull-quote handling
Pikumo's extract pipeline pulls quotes per beat (e.g., "Mom said 'I didn't know there were that many people who still loved me'"). Production prompts have strong NO TYPOGRAPHY rules. Does the thinking-mode pipeline honor that boundary when beats include dialogue language? 4 beats, each with a verbatim "she said X" cue + explicit "depict via expression, never as text" instruction.
Result: 4/4 panels have zero rendered text. No speech bubbles, no captions, no letters of any language. Emotional tone (surprise + gratitude / reluctance / helpless laughter / quiet wonder) conveyed entirely through facial expression, mouth shape, hand position, body language. Beat 1 has the woman mid-sentence with a hand to her chest and the brother attentive across the table — exactly the production-pattern. Direct migration from Pikumo's per-panel pipeline preserves this behavior.
4 panels · 157.1 s wall · 656 reasoning tokens · ~$0.05 total.
Test Q — panel-count scaling (N=2, N=4)
Pikumo's default story shape is 2-4 panels. We've measured 8-panel runs extensively; this test measures the actual 2-panel and 4-panel cost + latency curves Pikumo's pipeline would see.
| N panels | Total tokens | Reasoning tokens | Wall-clock | Cost (text+images) | First-paint with streaming |
|---|---|---|---|---|---|
| N=2 | 7,167 | 557 | 73.7 s | $0.025 | ~33 s |
| N=4 | 10,628 | 917 | 160.2 s | $0.045 | ~33 s |
| N=8 (from Test E) | 20,103 | 992 | 413 s | $0.08 | ~49 s |
N=2 sample
N=4 sample
The headline number for Pikumo: a 2-panel story renders in 74 s for $0.025 and a 4-panel renders in 160 s for $0.045 — both ending in complete rendering. With streaming, first paint is at ~33 s regardless of N. The 2-panel shape fits comfortably inside Pikumo's ~90 s wizard budget end-to-end. The 4-panel shape needs streaming to feel acceptable, but with streaming the first illustration lands while the user is still reading their wizard caption.
All four Pikumo probes — one-line synthesis
- M — pets need no separate upload; text bible holds identity. Ship Pikumo pet support directly on this pipeline.
- O — recipe generalizes across STYLE_GUIDE; no per-style tuning. Ship one pipeline for all styles.
- P — dialogue language in beats does not cause text-in-image violations. Pull-quote extraction stays usable.
- Q — typical 2-panel Pikumo story costs $0.025 in 74 s; 4-panel costs $0.045 in 160 s. With streaming, first paint at ~33 s regardless. The economics work.
Tests J / K / L — re-evaluating identity refs in the thinking-mode workflow
J vs K vs L
The original Azure /v1/images/edits cohort study concluded: ship "photo + character sheet"
(approach 3) for multi-panel sets — it won on set-wide consistency over photo-only (approach 2) and
sheet-only (approach 1). Does that finding still hold in the new gpt-5.4-mini + thinking-mode
pipeline?
Three runs against the locked-kitchen prompt from Test E. Identical model, tools, beats, location bible. The ONLY variable is what's attached as identity refs.
Photo (used in J & K)
Character sheet (used in K & L)
Token + cost ledger
| Test | Refs attached | Reasoning tokens | Total tokens | Text cost | + 8 images | Total |
|---|---|---|---|---|---|---|
| J — photo only | 1 (photo) | 1,070 | 21,109 | $0.033 | $0.048 | $0.081 |
| K — photo + sheet | 2 | 561 ← lowest | 18,950 | $0.025 | $0.048 | $0.073 ← cheapest |
| L — sheet only | 1 (sheet) | 892 | 19,264 | $0.027 | $0.048 | $0.075 |
Test J — photo only (today's production)
Test K — photo + character sheet (recommended)
Test L — character sheet only
Verdict
- All three configurations produce competitive identity hold across the 8 panels. Visually, K (both refs) has the most "settled" face proportions; J (photo only) is a close second; L (sheet only) shows mild drift toward generic-anime defaults but remains recognizably the same person.
- K is both the cheapest AND the strongest match. Adding the character sheet to the attached refs REDUCES reasoning-token consumption from 1,070 (photo only) to 561 — the sheet acts as a "hair + silhouette stabilizer" that the model would otherwise have to derive from scratch on each panel.
- The original "approach 3" recommendation holds — but for different reasons. In the Azure /edits single-shot world, photo+sheet won on cross-panel identity consistency. In the thinking-mode world, the reasoning step makes all three competitive on consistency; photo+sheet now wins on cost (cheaper reasoning) AND on face fidelity (more reference signal). Same conclusion, new mechanism.
- Pikumo's character-bible pipeline becomes more valuable, not less. The generated multi-view sheet (one-time cost per user) saves reasoning tokens on every subsequent story — payback period is essentially zero.
Updated production recipe
// Pikumo's thinking-mode pipeline, end of investigation:
POST https://api.openai.com/v1/responses
{
"model": "gpt-5.4-mini", // 22× cheaper than gpt-5.4 at scale
"input": [{ "role": "user", "content": [
{ "type": "input_text", "text": "<style + identity + LOCATION BIBLE + beats>" },
{ "type": "input_image", "image_url": "<photo>" }, // K-recipe: both refs
{ "type": "input_image", "image_url": "<character_sheet>" }
]}],
"tools": [{ "type": "image_generation",
"quality": "low", // 35× cheaper than high
"size": "1024x1024" }],
"tool_choice": "auto",
"parallel_tool_calls": true,
"max_tool_calls": N, // panel count
"reasoning": { "effort": "high" },
"stream": true // first paint at ~33-49s
}
Test G — TWO interleaved locations, K K A K A K A A
G: 2 locations interleavedFollow-up to Test E: does the LOCATION BIBLE technique scale to MULTIPLE locked locations in the same call? 8 beats alternating between a KITCHEN bible (beats 1, 2, 4, 6) and an ATTIC STUDY bible (beats 3, 5, 7, 8). Each beat is tagged. Identical model, tools, quality, size, max_tool_calls, reasoning effort as Test E.
Result: 8/8 panels correctly routed. Kitchen panels share the same kitchen (terracotta + sage cabinets + delft + 6-pane window + leaning fence post). Attic panels share the same attic (honey pine floor + ivory sloped walls + dark oak beam + diamond-pane dormer + navy window seat + kilim pillows + green-glass banker's lamp + Persian rug). Zero cross-contamination. Beat 8 even rendered the requested "dusk" lighting in the same attic — same room, evening sky.
Tests H + I — Streaming: panels arrive incrementally
stream: true · SSE
All the prior tests waited for the full response before showing anything. The OpenAI
/v1/responses endpoint supports stream: true with Server-Sent Events.
Each image_generation_call result lands on the stream as soon as it's
generated, not at the end of the response. That moves time-to-first-paint from
~5 min to ~50 s for an 8-panel story.
8-panel streaming timeline (gpt-5.4-mini · low quality · 1024²)
T+0.0 s request sent T+4.1 s image 1 starts T+48.6 s image 1 FINAL ← first paint at ~49s T+85.5 s image 2 FINAL (+37 s) T+118.2 s image 3 FINAL (+33 s) T+156.4 s image 4 FINAL (+38 s) T+190.7 s image 5 FINAL (+34 s) T+223.2 s image 6 FINAL (+33 s) T+263.2 s image 7 FINAL (+40 s) T+324.0 s image 8 FINAL (+61 s) T+327.1 s response.completed
Cadence is ~39s per panel. Images stream sequentially even with
parallel_tool_calls: true — the model reasons, fires one image, waits, then reasons
and fires the next. Wall-clock total is unchanged from non-streamed (still ~5 min for 8 panels) —
but the user-experience shape is completely different.
Wait-time UX shift
| Non-streamed | Streamed | |
|---|---|---|
| Time to first visible panel | ~327 s (5.4 min) | ~49 s |
| Time to second panel | same 327 s | ~86 s |
| Time to last panel (image 8) | ~327 s | ~324 s (essentially same) |
| Loading-state duration | 5.4 min blank | 49 s blank, then 33-60 s gaps |
| Cost added by streaming | — | $0 (same per-call cost) |
The 8 streamed panels (low quality, same prompt as Test A)
Integration recipe for Pikumo
The Worker already speaks SSE for the existing /generate endpoint
(per CLAUDE.md). The integration is two surfaces:
// Worker side — pass through OpenAI SSE to client SSE
const upstream = await fetch("https://api.openai.com/v1/responses", {
method: "POST",
headers: { Authorization: `Bearer ${env.OPENAI_API_KEY}`,
"Content-Type": "application/json" },
body: JSON.stringify({
model: "gpt-5.4-mini",
tools: [{ type: "image_generation",
quality: "low", size: "1024x1024" }],
parallel_tool_calls: true,
max_tool_calls: N, // panel count
reasoning: { effort: "high" },
stream: true, // ← the lever
input: [/* user message with attached identity photo + prompt */],
}),
});
// Parse SSE; forward each image_generation_call.output_item.done
// as a `panel` SSE event to the frontend with the panel index + b64.
// Frontend (main/app.js) — pop each panel into the album as it arrives
const es = new EventSource(`${API_BASE}/generate?sid=${storyId}`);
es.addEventListener("panel", (e) => {
const { index, b64 } = JSON.parse(e.data);
els.panels[index].src = `data:image/png;base64,${b64}`;
});
Notes
partial_images: 2didn't help much. A 4-panel test with partials enabled showed the "partial" preview landed only 1-3 s before the final — essentially the same image. Skip the parameter; saves ~$0.012 per 4-panel call.- Cadence has occasional outliers — beat 8 took 61 s instead of ~33-40 s. Plan UI to render any time within a ~30-65 s window after the prior panel.
- The "first paint" budget on Pikumo's wizard is ~90 s. With streaming, the first illustration lands inside that budget; without, the user stares at "Loading…" for the full 5 minutes.
Test E — same scene, single ref + LOCATION BIBLE
E: same + single + bibleFollow-up to Test C's location drift. Same model, same image-tool config, same 8 beats — only the prompt changes. Adds: a 400-word LOCATION BIBLE (terracotta floor, sage-green walls + lower cabinets, cream uppers, brass knobs, blue-and-white delft backsplash, butcher-block counter, six-pane window with countryside + leaning fence post, white farmhouse sink, cream gas stove, copper pot rack with 3 pots, oak table with yellow vase of white daisies + stack of red/navy/cream books + sage-green mug), a PRESERVE LIST naming 15 invariants, and "the SAME X" re-anchors in every beat.
Result: location locks. The kitchen is visibly the same room in all 8 panels — same floor, same cabinets, same delft backsplash, same window-and-countryside view, same yellow-vase-with-daisies on the table. Cost: ~$0.08 (same as Test C); only the prompt did the work.
Findings
All 4 tests succeeded — 32 images, ~$0.30 total spend. Wall-clock per test: ~5 min. Reasoning fired on every call (328 / 547 / 590 / 872 reasoning tokens respectively — complexity scales with prompt difficulty).
1. Identity lock holds in every cell of the matrix
Single ref (A, C) and multi ref (B, D) both produce visibly consistent faces across all 8 beats. With two photos attached, the model correctly anchors each character to its respective reference: the woman matches photo 1's face / hair / age throughout, the man matches photo 2's. They don't swap. They don't blend. The "FIRST photo / SECOND photo" naming convention in the prompt is honored. Multi-character ensembles are production-viable on this surface.
2. Location coherence is the weakness — same-scene drifts
Tests C and D were the location-lock stress test: "all 8 in the SAME single location (a small home kitchen)". The model rendered a small home kitchen in every panel — but not the SAME kitchen. Cabinet color, wall color, window placement, floor pattern, layout, and props all vary panel-to-panel. Beat 1's wooden-table-by-window doesn't match beat 4's white-cabinet kitchen with green door; beat 8 closes in on yet another room.
Why it fails. The thinking step decomposes the prompt
into 8 separate image_generation calls, but each call
appears to be issued with only its beat's text — no shared "location
sheet" passed between calls. There's no equivalent of Pikumo's
CONTINUITY ANCHOR reference image being threaded through
the set. Identity holds because each call sees the attached identity
photo(s); location doesn't hold because no location image is attached.
The likely fix is to render beat 1 first, then attach that rendered image as a location reference for beats 2–8. This is a multi-turn /v1/responses pattern (or two-stage orchestration) rather than the single-call thinking-mode capability. A separate follow-up experiment.
3. Diverse-scenes mode is the natural fit for the API as it stands today
Tests A and B (diverse locations) play to the model's strength: each beat names its own setting, identity locks via the attached photo, and the per-beat scene description carries the world. The result is the cleanest of the four: a believable day-in-the-life arc with faithful identity across the set. For Pikumo's "moments of a day / trip / event" story format, this is the production-ready path.
4. Cost lever confirmed at scale
4 tests × 8 images each = 32 image generations + 4 reasoning passes
for an estimated ~$0.30 total. The previous single
gpt-5.4 high-quality 8-panel run cost $1.78 by itself. This matrix run
cost less than that single high-quality call AND produced 4× the
output. The gpt-5.4-mini + quality:low + size:1024x1024
recipe is the right default — quality-sensitive surfaces can selectively
escalate.
Bottom line
- Ship the diverse-scene + multi-ref ensemble (Test B) configuration as a "story preview" surface — works at ~$0.07 per 8-panel story.
- Same-location mode works IF the prompt does the work. The default "in the same kitchen" wording drifts (Test C/D). Adding a structured LOCATION BIBLE + PRESERVE LIST + "the SAME X" re-anchors locks the room across all 8 panels at the same ~$0.08 cost (Test E). The Pikumo
sceneBibleproduction block already maps cleanly to this shape. - Identity is not the bottleneck. The bottleneck was passed-through world state, and the bible-style prompt is enough to inject it without a multi-turn chain.
Updated recommendation (Test E result, 2026-05-26)
For Pikumo's existing storytelling pipeline — which already extracts a sceneBible
with palette + setting + time period during Pass B — the integration looks like:
- Pre-render a LOCATION BIBLE text block from the sceneBible at thinking-mode call time. Named colors, named props, named layout. ~400 words is enough; the model rewards specificity.
- Add a PRESERVE LIST that enumerates the 10-15 invariants every panel must hold.
- In every per-beat description, re-anchor with "the SAME [bible-element]" verbatim language for at least 2-3 named elements.
- Use
gpt-5.4-mini+quality: low+size: 1024x1024.max_tool_calls: Nmatched to the story's panel count.reasoning.effort: high.
For 4-panel stories (Pikumo's current shape), this reduces to a ~$0.04 per-story image budget (4 × $0.006 + ~$0.025 text) with both identity AND location coherence across the set — a step change over today's per-panel-isolated pipeline.
Generated 2026-05-26. All 4 tests cost < $0.40 total. Source headshots from eval/pipeline/cast/.