8-frame coherence 2×2 matrix

Same orchestrator + image config, varying only scene-diversity and number of identity refs. All four tests use model: gpt-5.4-mini, reasoning.effort: high, tools: [{type: "image_generation", quality: "low", size: "1024x1024"}], max_tool_calls: 8, parallel_tool_calls: true.

Character A

character A reference female_asian

Character B

character B reference male_asian

Pinned variables

Endpoint
POST https://api.openai.com/v1/responses
Orchestrator
gpt-5.4-mini · reasoning.effort: "high"
Image tool
gpt-image-2 · quality: "low" · size: "1024x1024"
Tool budget
max_tool_calls: 8 · parallel_tool_calls: true
Est. cost / test
~$0.07 (8 × $0.006 image + ~$0.025 text)
Matrix total cost
~$0.30

The matrix

Single ref (A only)Multi ref (A + B)
Diverse scenes (8 locations, one day)Test ATest B
Same scene (one kitchen, 8 moments)Test CTest D

What each test isolates. A — single-character coherence across diverse locations (baseline). B — multi-character ensemble + diverse locations (hardest combo). Clocation coherence across 8 angles in one kitchen. D — joint identity + location lock with two people.

Test A — diverse scenes, single ref

A: diverse + single awaiting outputs…

One person (character A), eight chronological beats across one day in 8 different locations: dawn bedroom → coffee → train → cafe → park → reading window → cooking → balcony dusk.

Test B — diverse scenes, multi ref

B: diverse + multi awaiting outputs…

Two people (A + B as an ensemble) across eight chronological beats in eight different locations of one shared day. Tests whether both identities lock independently across diverse rooms.

Test C — same scene, single ref

C: same + single awaiting outputs…

One person, ALL EIGHT beats in the SAME single location (a small home kitchen) at different moments / camera angles. Tests location coherence — room layout, cabinet placement, window position, lighting register, palette — across the set.

Test D — same scene, multi ref

D: same + multi awaiting outputs…

Two people, ALL EIGHT beats in the same small home kitchen. The most demanding combination: two distinct identities AND a single location must both stay consistent across all 8 panels.

Tests M / O / P / Q — Pikumo production probes

Pikumo-specific 4 tests · ~$0.33 total · all under $0.50/test budget

Four tests targeting things that matter for Pikumo's actual production pipeline. Each test uses the K-recipe (gpt-5.4-mini + photo + sheet refs + low-q + 1024² + reasoning:high) and asks: "does this hold when we hit a real Pikumo-shaped surface?"

Test M — pet inclusion (text-only pet ref)

Pikumo's PET REFERENCE role in the canonical JSON spec normally takes a photo of the user's pet. Question: does the thinking-mode pipeline hold pet identity from text alone (a pet bible naming species + coat color + markings)? 6 beats with female_asian + "Whiskers, a fluffy long-haired ginger tabby with white socks, white chest, green eyes, long fluffy tail."

Result: yes. Whiskers reads as the same cat in every panel — same orange tabby coat, same white markings, same fluffy tail. Beat 4 (pet alone in frame) correctly renders ONLY the cat with the woman's shoes visible in the background — the model didn't invent a human. Pikumo can ship pet support without adding a pet-photo upload step.

6 panels · 219.6 s wall · 689 reasoning tokens · ~$0.06 total.

Test O — cross-style transfer (oil + sketch)

Question: does the recipe generalize across Pikumo's STYLE_GUIDE catalog, or does it overfit to dreamscape? Same 8-beat day-in-life prompt, only the style block changes. Two new styles:

Oil — gallery oil-painting register (visible brushwork, impasto, linen canvas)

8 panels · 334.4 s wall · 726 reasoning tokens · ~$0.075 total.

Sketch — pencil sketchbook register (loose linework, cross-hatching, paper grain)

8 panels · 298.5 s wall · 529 reasoning tokens · ~$0.07 total.

Result: clean transfer in both. Oil shows the requested impasto + warm earthy palette + atmospheric depth; sketch shows the requested loose pencil + cross-hatching + visible paper grain + minimal washes. Identity locks in both. No per-style optimization required.

Test P — dialogue / pull-quote handling

Pikumo's extract pipeline pulls quotes per beat (e.g., "Mom said 'I didn't know there were that many people who still loved me'"). Production prompts have strong NO TYPOGRAPHY rules. Does the thinking-mode pipeline honor that boundary when beats include dialogue language? 4 beats, each with a verbatim "she said X" cue + explicit "depict via expression, never as text" instruction.

Result: 4/4 panels have zero rendered text. No speech bubbles, no captions, no letters of any language. Emotional tone (surprise + gratitude / reluctance / helpless laughter / quiet wonder) conveyed entirely through facial expression, mouth shape, hand position, body language. Beat 1 has the woman mid-sentence with a hand to her chest and the brother attentive across the table — exactly the production-pattern. Direct migration from Pikumo's per-panel pipeline preserves this behavior.

4 panels · 157.1 s wall · 656 reasoning tokens · ~$0.05 total.

Test Q — panel-count scaling (N=2, N=4)

Pikumo's default story shape is 2-4 panels. We've measured 8-panel runs extensively; this test measures the actual 2-panel and 4-panel cost + latency curves Pikumo's pipeline would see.

N panelsTotal tokensReasoning tokensWall-clockCost (text+images)First-paint with streaming
N=27,16755773.7 s$0.025~33 s
N=410,628917160.2 s$0.045~33 s
N=8 (from Test E)20,103992413 s$0.08~49 s

N=2 sample

N=4 sample

The headline number for Pikumo: a 2-panel story renders in 74 s for $0.025 and a 4-panel renders in 160 s for $0.045 — both ending in complete rendering. With streaming, first paint is at ~33 s regardless of N. The 2-panel shape fits comfortably inside Pikumo's ~90 s wizard budget end-to-end. The 4-panel shape needs streaming to feel acceptable, but with streaming the first illustration lands while the user is still reading their wizard caption.

All four Pikumo probes — one-line synthesis

Tests J / K / L — re-evaluating identity refs in the thinking-mode workflow

J vs K vs L ~$0.24 total · 3 × 8 panels

The original Azure /v1/images/edits cohort study concluded: ship "photo + character sheet" (approach 3) for multi-panel sets — it won on set-wide consistency over photo-only (approach 2) and sheet-only (approach 1). Does that finding still hold in the new gpt-5.4-mini + thinking-mode pipeline?

Three runs against the locked-kitchen prompt from Test E. Identical model, tools, beats, location bible. The ONLY variable is what's attached as identity refs.

Photo (used in J & K)

source photo

Character sheet (used in K & L)

character sheet

Token + cost ledger

TestRefs attachedReasoning tokensTotal tokensText cost+ 8 imagesTotal
J — photo only1 (photo)1,07021,109$0.033$0.048$0.081
K — photo + sheet2561 ← lowest18,950$0.025$0.048$0.073 ← cheapest
L — sheet only1 (sheet)89219,264$0.027$0.048$0.075

Test J — photo only (today's production)

Test K — photo + character sheet (recommended)

Test L — character sheet only

Verdict

Updated production recipe

// Pikumo's thinking-mode pipeline, end of investigation:

POST https://api.openai.com/v1/responses
{
  "model": "gpt-5.4-mini",                              // 22× cheaper than gpt-5.4 at scale
  "input": [{ "role": "user", "content": [
    { "type": "input_text",  "text": "<style + identity + LOCATION BIBLE + beats>" },
    { "type": "input_image", "image_url": "<photo>" },            // K-recipe: both refs
    { "type": "input_image", "image_url": "<character_sheet>" }
  ]}],
  "tools": [{ "type": "image_generation",
              "quality": "low",                          // 35× cheaper than high
              "size":    "1024x1024" }],
  "tool_choice":         "auto",
  "parallel_tool_calls": true,
  "max_tool_calls":      N,                              // panel count
  "reasoning":           { "effort": "high" },
  "stream":              true                            // first paint at ~33-49s
}

Test G — TWO interleaved locations, K K A K A K A A

G: 2 locations interleaved 894 reasoning tokens · 296 s · ~$0.076

Follow-up to Test E: does the LOCATION BIBLE technique scale to MULTIPLE locked locations in the same call? 8 beats alternating between a KITCHEN bible (beats 1, 2, 4, 6) and an ATTIC STUDY bible (beats 3, 5, 7, 8). Each beat is tagged. Identical model, tools, quality, size, max_tool_calls, reasoning effort as Test E.

Result: 8/8 panels correctly routed. Kitchen panels share the same kitchen (terracotta + sage cabinets + delft + 6-pane window + leaning fence post). Attic panels share the same attic (honey pine floor + ivory sloped walls + dark oak beam + diamond-pane dormer + navy window seat + kilim pillows + green-glass banker's lamp + Persian rug). Zero cross-contamination. Beat 8 even rendered the requested "dusk" lighting in the same attic — same room, evening sky.

Tests H + I — Streaming: panels arrive incrementally

stream: true · SSE 8-panel: first paint T+49s · last paint T+324s · 39s/panel

All the prior tests waited for the full response before showing anything. The OpenAI /v1/responses endpoint supports stream: true with Server-Sent Events. Each image_generation_call result lands on the stream as soon as it's generated, not at the end of the response. That moves time-to-first-paint from ~5 min to ~50 s for an 8-panel story.

8-panel streaming timeline (gpt-5.4-mini · low quality · 1024²)

T+0.0 s     request sent
T+4.1 s     image 1 starts
T+48.6 s    image 1 FINAL  ← first paint at ~49s
T+85.5 s    image 2 FINAL  (+37 s)
T+118.2 s   image 3 FINAL  (+33 s)
T+156.4 s   image 4 FINAL  (+38 s)
T+190.7 s   image 5 FINAL  (+34 s)
T+223.2 s   image 6 FINAL  (+33 s)
T+263.2 s   image 7 FINAL  (+40 s)
T+324.0 s   image 8 FINAL  (+61 s)
T+327.1 s   response.completed

Cadence is ~39s per panel. Images stream sequentially even with parallel_tool_calls: true — the model reasons, fires one image, waits, then reasons and fires the next. Wall-clock total is unchanged from non-streamed (still ~5 min for 8 panels) — but the user-experience shape is completely different.

Wait-time UX shift

Non-streamedStreamed
Time to first visible panel~327 s (5.4 min)~49 s
Time to second panelsame 327 s~86 s
Time to last panel (image 8)~327 s~324 s (essentially same)
Loading-state duration5.4 min blank49 s blank, then 33-60 s gaps
Cost added by streaming$0 (same per-call cost)

The 8 streamed panels (low quality, same prompt as Test A)

Integration recipe for Pikumo

The Worker already speaks SSE for the existing /generate endpoint (per CLAUDE.md). The integration is two surfaces:

// Worker side — pass through OpenAI SSE to client SSE
const upstream = await fetch("https://api.openai.com/v1/responses", {
  method: "POST",
  headers: { Authorization: `Bearer ${env.OPENAI_API_KEY}`,
             "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gpt-5.4-mini",
    tools: [{ type: "image_generation",
              quality: "low", size: "1024x1024" }],
    parallel_tool_calls: true,
    max_tool_calls: N,           // panel count
    reasoning: { effort: "high" },
    stream: true,                // ← the lever
    input: [/* user message with attached identity photo + prompt */],
  }),
});

// Parse SSE; forward each image_generation_call.output_item.done
// as a `panel` SSE event to the frontend with the panel index + b64.

// Frontend (main/app.js) — pop each panel into the album as it arrives
const es = new EventSource(`${API_BASE}/generate?sid=${storyId}`);
es.addEventListener("panel", (e) => {
  const { index, b64 } = JSON.parse(e.data);
  els.panels[index].src = `data:image/png;base64,${b64}`;
});

Notes

Test E — same scene, single ref + LOCATION BIBLE

E: same + single + bible 7231-char prompt · 992 reasoning tokens · 413 s · ~$0.08

Follow-up to Test C's location drift. Same model, same image-tool config, same 8 beats — only the prompt changes. Adds: a 400-word LOCATION BIBLE (terracotta floor, sage-green walls + lower cabinets, cream uppers, brass knobs, blue-and-white delft backsplash, butcher-block counter, six-pane window with countryside + leaning fence post, white farmhouse sink, cream gas stove, copper pot rack with 3 pots, oak table with yellow vase of white daisies + stack of red/navy/cream books + sage-green mug), a PRESERVE LIST naming 15 invariants, and "the SAME X" re-anchors in every beat.

Result: location locks. The kitchen is visibly the same room in all 8 panels — same floor, same cabinets, same delft backsplash, same window-and-countryside view, same yellow-vase-with-daisies on the table. Cost: ~$0.08 (same as Test C); only the prompt did the work.


Findings

All 4 tests succeeded — 32 images, ~$0.30 total spend. Wall-clock per test: ~5 min. Reasoning fired on every call (328 / 547 / 590 / 872 reasoning tokens respectively — complexity scales with prompt difficulty).

1. Identity lock holds in every cell of the matrix

Single ref (A, C) and multi ref (B, D) both produce visibly consistent faces across all 8 beats. With two photos attached, the model correctly anchors each character to its respective reference: the woman matches photo 1's face / hair / age throughout, the man matches photo 2's. They don't swap. They don't blend. The "FIRST photo / SECOND photo" naming convention in the prompt is honored. Multi-character ensembles are production-viable on this surface.

2. Location coherence is the weakness — same-scene drifts

Tests C and D were the location-lock stress test: "all 8 in the SAME single location (a small home kitchen)". The model rendered a small home kitchen in every panel — but not the SAME kitchen. Cabinet color, wall color, window placement, floor pattern, layout, and props all vary panel-to-panel. Beat 1's wooden-table-by-window doesn't match beat 4's white-cabinet kitchen with green door; beat 8 closes in on yet another room.

Why it fails. The thinking step decomposes the prompt into 8 separate image_generation calls, but each call appears to be issued with only its beat's text — no shared "location sheet" passed between calls. There's no equivalent of Pikumo's CONTINUITY ANCHOR reference image being threaded through the set. Identity holds because each call sees the attached identity photo(s); location doesn't hold because no location image is attached.

The likely fix is to render beat 1 first, then attach that rendered image as a location reference for beats 2–8. This is a multi-turn /v1/responses pattern (or two-stage orchestration) rather than the single-call thinking-mode capability. A separate follow-up experiment.

3. Diverse-scenes mode is the natural fit for the API as it stands today

Tests A and B (diverse locations) play to the model's strength: each beat names its own setting, identity locks via the attached photo, and the per-beat scene description carries the world. The result is the cleanest of the four: a believable day-in-the-life arc with faithful identity across the set. For Pikumo's "moments of a day / trip / event" story format, this is the production-ready path.

4. Cost lever confirmed at scale

4 tests × 8 images each = 32 image generations + 4 reasoning passes for an estimated ~$0.30 total. The previous single gpt-5.4 high-quality 8-panel run cost $1.78 by itself. This matrix run cost less than that single high-quality call AND produced 4× the output. The gpt-5.4-mini + quality:low + size:1024x1024 recipe is the right default — quality-sensitive surfaces can selectively escalate.

Bottom line

Updated recommendation (Test E result, 2026-05-26)

For Pikumo's existing storytelling pipeline — which already extracts a sceneBible with palette + setting + time period during Pass B — the integration looks like:

  1. Pre-render a LOCATION BIBLE text block from the sceneBible at thinking-mode call time. Named colors, named props, named layout. ~400 words is enough; the model rewards specificity.
  2. Add a PRESERVE LIST that enumerates the 10-15 invariants every panel must hold.
  3. In every per-beat description, re-anchor with "the SAME [bible-element]" verbatim language for at least 2-3 named elements.
  4. Use gpt-5.4-mini + quality: low + size: 1024x1024. max_tool_calls: N matched to the story's panel count. reasoning.effort: high.

For 4-panel stories (Pikumo's current shape), this reduces to a ~$0.04 per-story image budget (4 × $0.006 + ~$0.025 text) with both identity AND location coherence across the set — a step change over today's per-panel-isolated pipeline.

Generated 2026-05-26. All 4 tests cost < $0.40 total. Source headshots from eval/pipeline/cast/.