Inside GoClaw — v3.11.0 Release

Native image generation with GPT Image 2 via Codex OAuth

ChatGPT subscriptions (Plus / Pro / Team / Enterprise) already include gpt-image-2 in their quota — Plus has a generous limit, Pro the highest limit plus priority processing.

PR #1002 wires GoClaw's create_image tool into a native pipeline through Codex OAuth. Codex here is the OAuth flow GoClaw uses to call the OpenAI Responses API without needing an API key.

The gpt-image-2-pro-max skill by Richard Ng — a GoClaw contributor — is open-source, MIT-licensed, packaged to the Claude Code Skill spec, and compatible with GoClaw's Skill system. Zip the skill directory, upload through the Dashboard, and agents can use it. The 3,000+ prompt template corpus belongs to Twitter/X creators; the skill only indexes them for faster search. Combine PR #1002 with this skill and the agent can read the skill, brainstorm prompts on its own, then call create_image to render poster-quality images via the ChatGPT subscription quota — no API Usage cost.

01The problem before PR #1002

The old create_image only worked with an API key

Before PR #1002, GoClaw's create_image tool looked up providers through the credentialProvider interface — providers had to expose two methods, APIKey() string and APIBase() string. CodexProvider uses an OAuth flow and intentionally implements neither: tokens rotate via the refresh_token grant (see internal/oauth/token.go), and the default backend URL https://chatgpt.com/backend-api (constant DefaultProviderAPIBase) is internal only — not part of the credential surface that the tool layer is allowed to touch.

The consequence: ChatGPT Plus/Pro subscriptions have gpt-image-2 in their quota — usable on the web at chatgpt.com/images, usable in the Codex CLI — but a GoClaw agent could not call it because create_image was blocked at the APIKey()/APIBase() check. Users had to fall back to a parallel API-key path. The most popular was Gemini Nano Banana 2 via OpenRouter because:

GoClaw's default imageGenModelDefaults chain maps each provider to one model: openai → gpt-image-1.5, openrouter → google/gemini-2.5-flash-image (Nano Banana 2), gemini → gemini-2.5-flash-image, minimax → image-01, dashscope → wan2.6-image, byteplus → seedream-5-0-260128. Provider priority order is openrouter → gemini → openai → minimax → dashscope → byteplus. All of them require their respective API key.

The problem: users already pay for ChatGPT Plus/Pro every month for their coding workflow, and gpt-image-2 is one of the highest-quality image models — yet that quota sat idle, unusable from the agent.

Before PR #1002After PR #1002
Single path: API key (required APIKey() + APIBase())Native path via OAuth token added (API-key path still available for openai, gemini, openrouter, …)
gpt-image-1.5 via openai provider + OPENAI_API_KEY (calling api.openai.com/v1/chat/completions); Nano Banana 2 via OpenRoutergpt-image-2 (default) + gpt-image-1.5 (legacy) via Codex OAuth, calling chatgpt.com/backend-api/codex/responses; consumes subscription quota
Pay-per-image costAlready paid via subscription
Timeout 120s × 2 retriesTimeout 600s × 1 retry
No prompt provenancePNG tEXt chunk embedded
1-tier gate (does the provider have credentials?)2 tiers (capability + agent config)

Prompt engineering scattered everywhere

The gpt-image-2 community shares prompt templates on X/Twitter in a fairly scattered way. EvoLinkAI maintains the awesome-gpt-image-2-prompts repo (~370 cases), but the data lives as a README plus tweet links — no search engine. An agent wanting to use them has to read and filter by hand.

02Native image architecture for OAuth

PR #1002 introduces a new interface in internal/providers/native_image.go. The end-to-end flow is shown in the diagram below:

Native image path for OAuth providers Tool: create_image · MediaProviderChain · 2-tier gate AGENT LOOP · buildFilteredTools() GATE 1 Loop.allowImage Generation == true AND GATE 2 provider.Capabilities() .ImageGeneration == true append imageGenToolDef tool available to model Tool: create_image user request → tool call MediaProviderChain.ExecuteWithChain loop over chain entries · resolve provider Provider implements NativeImageProvider? YES · OAuth NO · API-key params["_native_provider"] = rawProvider (Codex OAuth) CodexProvider .GenerateImage(ctx, req) POST chatgpt.com/backend-api /codex/responses stream:true · tool_choice forced parseNativeImageSSE SSE → base64 → PNG bytes guard: cp != nil credentialProvider check APIKey() + APIBase() openai · gemini · openrouter · … Provider.Chat(messages, tools) with image_generation tool api.openai.com/v1/... parse JSON / tool result → PNG bytes Result.Media → bus → UI OAuth path API-key path

The actual code is a short interface:

type NativeImageProvider interface {
    GenerateImage(ctx context.Context, req NativeImageRequest) (*NativeImageResult, error)
}

create_image looks up providers through MediaProviderChain. When the chain entry is Codex OAuth, the raw provider object is passed into params["_native_provider"]. Inside callProvider:

if rawProvider, ok := params["_native_provider"]; ok {
    if np, ok := rawProvider.(providers.NativeImageProvider); ok {
        // Native path — no APIKey/APIBase needed
        return np.GenerateImage(ctx, ...)
    }
}
if cp == nil {
    return error("provider does not expose API credentials")
}
// Credential path for openai, gemini, minimax, dashscope, byteplus, openrouter

The native check runs before the cp == nil guard. The OAuth provider deliberately doesn't expose credentials; if you flip the order, the request fails with a misleading error — the user sees "missing credential" and goes off configuring credentials, while the real cause is the wrong native path.

2-tier gate

The image_generation tool is only attached to a request when both conditions hold:

  1. Provider capability: the provider implements CapabilitiesAware and Capabilities().ImageGeneration == true. The Codex provider hardcodes true because the OpenAI Responses API supports the image_generation tool.
  2. Agent config: the AllowImageGeneration field on AgentConfig, default true. An admin can set other_config.allow_image_generation: false to forbid image generation.

Logic in loop_tool_filter.go::buildFilteredTools:

if l.allowImageGeneration {
    if aware, ok := l.provider.(providers.CapabilitiesAware); ok {
        if aware.Capabilities().ImageGeneration {
            toolDefs = append(toolDefs, imageGenToolDef)
        }
    }
}

imageGenToolDef is just the flag {Type: "image_generation"} — no name, no parameters schema. With ordinary function calling the client must define everything (e.g. {type: "function", function: {name, description, parameters}}) so the LLM knows how to pass arguments. Here image_generation is a built-in tool of the Codex Responses API — the server attaches the schema, runs the handler, and the client only needs to enable it via the sentinel.

03Codex Responses API wire format

Endpoint: POST {apiBase}/codex/responses

map[string]any{
    "model":        "gpt-5.4",  // parent LLM, not the image model
    "stream":       true,        // mandatory — false is rejected with HTTP 400
    "store":        false,
    "instructions": "Generate an image matching the user's description using the image_generation tool. Return only the image; do not describe it in text.",
    "input":        []any{...user message...},
    "tools": []map[string]any{{
        "type":          "image_generation",
        "action":        "generate",
        "model":         "gpt-image-2",   // image model goes here
        "output_format": "png",
        "size":          "1024x1024",
    }},
    "tool_choice": map[string]any{"type": "image_generation"},  // force tool call
}

Things to watch:

Parsing the response

The primary path is the SSE stream, since stream:true is mandatory. End-to-end flow:

Codex Responses API · request → SSE stream → parse Agent loop CodexProvider chatgpt.com/codex GenerateImage(ctx, req) build request body stream:true · tool_choice forced POST /codex/responses SSE STREAM · loop data: response.created data: response.in_progress data: output_item.done item.result = base64 PNG data: response.completed [DONE] parseNativeImageSSE → base64 decode → PNG bytes return NativeImageResult Defensive non-streaming JSON parse fallback in case the server unexpectedly returns a raw object

The code still has a non-streaming JSON parse fallback. If the server unexpectedly returns a raw JSON object (not data: lines), parseNativeImageResponse walks output[], finds type == "image_generation_call", and base64-decodes result. You cannot trigger this path by setting stream:false — the request would be rejected with HTTP 400 first.

SSE path: scan data: lines, prioritising two events:

for line := range bytes.SplitSeq(data, []byte("\n")) {
    if !bytes.HasPrefix(line, []byte("data: ")) { continue }
    payload := line[len("data: "):]
    if bytes.Equal(payload, []byte("[DONE]")) { break }
    // ... unmarshal event, switch event.Type
}

04Model whitelist + chain config

Server-side whitelist

var allowedImageModels = map[string]bool{
    "gpt-image-2":   true, // default — latest quality
    "gpt-image-1.5": true, // legacy fallback
}

ValidateImageModel rejects any other value (e.g. dall-e-3) right at the provider — don't let upstream reject silently. Users pick a model via the chain entry param image_model; the value flows into NativeImageRequest.ImageModel, gets validated, then attaches to tools[0].model on the Responses API request.

Note on gpt-image-1.5: Before PR #1002 it was the default model for the openai provider (imageGenModelDefaults["openai"] = "gpt-image-1.5") — running on the API-key path, calling api.openai.com/v1/chat/completions with OPENAI_API_KEY. After PR #1002 it stays in the native-path whitelist as a legacy fallback — agents can still pick it via Codex OAuth. The PR adds a new path; it doesn't remove the old model. Both models run native; the default is gpt-image-2.

Chain config

The default MediaProviderChainEntry:

Timeout    int  // seconds, default 600 (10 min — image/video gen is slow)
MaxRetries int  // default 1 (image gen rarely succeeds on retry)

The timeout was bumped from 120s to 600s because gpt-image-2 in practice takes 30–180s. If a timeout fires mid-flight, the upstream request can keep running anyway, so retries usually don't help. MaxRetries dropped from 2 to 1 for the same reason — surface failures faster instead of hiding them in a retry loop.

The UI renders the configuration form from the schema in ui/web/src/pages/builtin-tools/media-provider-params-schema.ts, so admins can tweak it per tenant and per tool.

05Provenance: PNG tEXt chunk + caption UI

When create_image saves a file, pngEmbedPrompt rewrites the PNG byte stream and inserts a tEXt chunk Description just before IEND. The function lives in tools/png_embed.go to avoid a tools→agent import cycle:

Description = <prompt sent by the user>

The standard PNG tEXt chunk format and its insertion point in the byte stream:

PNG tEXt chunk insertion before IEND ORIGINAL PNG (input bytes) Signature 8 bytes IHDR 13 bytes IDAT... N bytes IDAT... N bytes IDAT... N bytes IEND 12 bytes ... insert tEXt here scan for IEND signature, splice before INSERTED tEXt CHUNK Length 4 bytes (BE uint32) Type "tEXt" 4 bytes ASCII keyword \0 text Description\0<prompt> CRC32 4 bytes OUTPUT PNG (with provenance) Signature 8 bytes IHDR IDAT... IDAT... IDAT... tEXt NEW · prompt IEND 12 bytes Failure-safe: if input lacks the PNG signature or no IEND is found, the function returns the bytes unchanged (no error).
Note: The module agent.EmbedPNGPrompt additionally writes a Software = goclaw chunk, but the current create_image path doesn't call into it — only Description is embedded.

The function silently skips if the input lacks a PNG signature (8-byte magic) or has no IEND chunk. Failure mode: return the original bytes, no error.

Consequence: the original prompt travels with the downloaded file. Read it back via exiftool image.png (the Description field) or any PNG metadata viewer. Auditing, regenerating similar images, and debugging prompts all get easier.

The UI renders a caption beneath the image inside MediaGallery: muted italic, two-line clamp, hover tooltip showing the full prompt. Files are saved at: {workspace}/generated/{YYYY-MM-DD}/image_{hint}_{timestamp}.png.

06Skill gpt-image-2-pro-max: prompt registry via the Skill system

The gpt-image-2-pro-max skill, by GoClaw contributor Richard Ng, is open-source under the MIT license. The 3,000+ prompt templates in the corpus belong to Twitter/X creators; the skill only indexes them for faster search, every returned prompt cites the original author, and the corpus credits the upstream EvoLinkAI/awesome-gpt-image-2-prompts repo. The skill is packaged per Claude Code skill convention; GoClaw ships a compatible Skill system, so the same skill directory works on both.

Anatomy

Skill resolution: 5-tier hierarchy + skill anatomy RESOLUTION ORDER · highest priority first Tier 1 workspace/skills/ per-project, highest priority Tier 2 workspace/.agents/skills/ project agent skills Tier 3 ~/.agents/skills/ personal agent skills Tier 4 ~/.goclaw/skills/ global / managed (DB-versioned) Tier 5 builtin bundled with the binary Loader.LoadForContext walks tiers, first match wins SKILL DIRECTORY ANATOMY gpt-image-2-pro-max/ ├── SKILL.md frontmatter (name, desc) + workflow body injected into the agent's system prompt └── scripts/ search.py — Python HTTP client agent calls it via the existing bash tool Watcher hot-reload internal/skills/watcher.go detects SKILL.md changes → bumps version → next agent turn picks up the new skill, no restart

Corpus: 3,238 community-vetted prompts hosted at https://gpt-image-2-prompts.goclawoffice.com. Tags are inferred along 10 facets: subjects, styles, lighting, cameras, moods, palettes, compositions, mediums, techniques, usecases. Each record carries the prompt body, Twitter/X attribution, and a reference image. The endpoint is IP rate-limited and fair-use friendly.

CLI surface the agent calls:

python scripts/search.py "luxury shoe ecommerce ad cream pastel" -n 5
python scripts/search.py "perfume bottle" --shape ecommerce -n 3
python scripts/search.py "neon ui" --persist plans/neon-refs.md

Sample output for the first query (3 hits shown, prompt bodies trimmed):

#1  bm25=-10.76  shape=ecommerce  source=None
  id    : 64dp29km
  title : E-commerce Main Image - Luxury Perfume Ad on Marble Vanity
  author: @MiguelMaestroIA
  tweet : https://x.com/MiguelMaestroIA/status/2047555836252151831
  image : https://gpt-image-2-prompts.goclawoffice.com/img/64dp29km_0
  imgid : 64dp29km_0
  tags  : cameras=1-1 | compositions=negative-space | moods=dreamy,edgy,elegant,intense,luxurious,minimal | palettes=crimson-burgundy,duotone,monochrome | styles=cinematic,editorial | subjects=product | techniques=parameterised-template,text-overlay-explicit | usecases=ecommerce-main-image,poster-flyer
  prompt:
    A luxury cosmetics advertisement poster featuring a single upright {argument name="product type" default="lipstick"} centered on a glossy black cube pedestal against a rich monochrome {argument name="background color" default="deep crimson red"} studio background. The product is a bold satin-finish {argument name="product shade" default="true red"} lipstick with the bullet fully extended, dramatic...

#2  bm25=-9.85  shape=  source=None
  id    : z9q36mnc
  title : Futuristic Bionic Super Shoe
  author: @Ericool 🇲🇾
  tweet : https://x.com/EricoolWong/status/2048353098897453286
  image : https://gpt-image-2-prompts.goclawoffice.com/img/z9q36mnc_0
  imgid : z9q36mnc_0
  tags  : cameras=low-angle | moods=energetic,futuristic,intense,luxurious | palettes=gold-black | styles=cinematic | subjects=fashion-item,product | techniques=parameterised-template
  prompt:
    Extreme futuristic {argument name="subject" default="cheetah bionic super shoe"}, hybrid of supercar and running sneaker, aggressive mechanical structure, layered carbon fiber, glowing energy core, dynamic speed trails, {argument name="colors" default="black gold"} luxury finish, dramatic low angle, cinematic lighting, high-end sports ad, ultra detailed, 8k

#3  bm25=-9.82  shape=ecommerce  source=None
  id    : 0zjvnji7
  title : Vitamin C Skincare Ad
  author: @Gem Alpha
  tweet : https://x.com/Gemalpha_88/status/2046796479562678589
  image : https://gpt-image-2-prompts.goclawoffice.com/img/0zjvnji7_0
  imgid : 0zjvnji7_0
  tags  : cameras=1-1 | moods=dreamy,edgy,elegant,luxurious,minimal,warm-emotional | palettes=earth-tones | styles=photorealistic | subjects=abstract,product | techniques=aspect-explicit,parameterised-template
  prompt:
    Create a clean luxury skincare product advertisement in a square 1:1 layout with a warm beige studio background and strong natural sunlight casting soft palm-leaf shadows across the wall. Place 1 amber glass dropper bottle on the left-center, standing upright on a round cream stone pedestal. The bottle has a glossy gold collar and a matte white rubber dropper top. Add a white rectangular label wit...

(3 matched, showing 3)

Output is ranked with BM25 and includes prompt_body, tags.{moods,palettes,subjects,...}, author, tweet, image. The main filters: --shape, --has-image, -n, --full, --persist.

07Setup via the Web Dashboard

GoClaw ships a Web Dashboard with a sidebar grouped into Core / Capabilities / System. This setup touches four sections: Agents (Core), Skills (Capabilities), Providers + Builtin Tools (System). Minimum setup needs only one agent — every agent gets create_image through the global Provider Chain from Step 2a (Step 2b for read_image is optional; models with native vision can skip it).

Step 1 — Connect a Codex provider

Sidebar → System / ProvidersAdd Provider. In the Add Provider modal pick ChatGPT Subscription (OAuth), keep or edit the Account Alias (e.g. openai-codex), enter a Display Name if needed, then click Connect OpenAI Account. The system opens a new OAuth tab for signing in to ChatGPT (Plus / Pro / Team / Enterprise — all work) and granting access.

After sign-in, the browser is redirected to a callback like http://localhost:1455/auth/callback?code=...&state=.... If you're on a remote/VPS host and the browser can't reach localhost, copy the full address bar URL, paste it into the callback field in the modal, then click Submit. When the Providers list shows the new provider with status Connected, you're done.

Add Provider modal with ChatGPT Subscription OAuth, account alias openai-codex, status 'waiting for authentication' and a callback URL paste field for remote/VPS hosts.
Step 1 · Add Provider modal in the OAuth flow

Step 2a — Enable + configure create_image

Sidebar → System / Builtin Tools (route /builtin-tools) → tab MediaCreate Image. The Enabled flag defaults to OFF (see seed data at cmd/gateway_builtin_tools.go:55) — toggle it ON first; it's the master switch for the whole tenant. Then click Configure to open the Provider Chain modal.

The Create Image — Provider Chain modal lets you order fallback providers; the first enabled entry is tried first. Minimum setup: add one Codex Plus entry, model GPT-5.4, Timeout 600s (image gen can run long for complex prompts), Retries 1, and most importantly — pick Default · gpt-image-2 (recommended) for Image model. Hit Save. Click Add Provider if you want a fallback (e.g. OpenAI API key) for graceful degradation when OAuth quota runs out.

'Create Image — Provider Chain' modal with entry #1 Codex Plus enabled, Provider dropdown set to Codex Plus, Model GPT-5.4, Timeout 600s, Retries 1, Image model 'Default · gpt-image-2 (recommended)', Add Provider button below and Cancel/Save.
Step 2a · Provider Chain for create_image — Codex Plus + GPT-5.4 + gpt-image-2

Step 2b — Enable + configure read_image (optional)

Skip this if the agent's main model already has native vision (Qwen3.6-Plus, etc.) — the runtime falls back to inline mode, attaching the image bytes to the message for the LLM to read. Configure this if you use a text-only model or want to separate the vision provider from the reasoning model to optimise cost/latency.

Same page /builtin-tools → tab MediaRead Image. Toggle Enabled ON (default OFF). Note the req badge next to the tool name — short for "requires", the tool's dependency. Hover to see the vision_provider requirement: you need at least one vision-capable provider registered in Step 1 (e.g. Gemini, OpenAI, Anthropic, OpenRouter, dashscope qwen-vl) for the toggle to make sense — otherwise the tool throws "No vision provider configured" at runtime.

Once enabled, click ConfigureRead Image — Provider Chain modal. Example setup from the screenshot: one OpenRouter entry, model google/gemini-2.5-flash-image, Timeout 120s (vision calls are usually quick — no need for 600s like image gen), Retries 3 (vision calls are cheap; retrying is safe if the provider is flaky). This is a SEPARATE chain for read_image, not shared with create_image.

Routing: if the read_image chain is configured, every read_image call goes through it — even when the agent's main model has native vision. If no chain is configured, the image is attached inline to the message for the LLM to read directly (only works if the model supports vision). Code: internal/agent/media_tool_routing.go.
'Read Image — Provider Chain' modal with entry #1 OpenRouter enabled, Provider dropdown set to OpenRouter, Model 'google/gemini-2.5-flash-image', Timeout 120s, Retries 3, Add Provider button below and Cancel/Save.
Step 2b · Provider Chain for read_image — OpenRouter + google/gemini-2.5-flash-image, Timeout 120s, Retries 3
Why two chains? create_image needs an image-gen model (gpt-image-2, DALL-E 3, ...). read_image needs a vision model (Gemini 2.5 Flash, GPT-4o-mini, ...). Different model classes, different providers, different billing, different latency/retry profiles (image gen takes 4–8 minutes — retries are expensive; vision calls take seconds — retries are cheap) — so GoClaw stores them as two separate chains in builtin_tools.settings (see internal/tools/media_provider_chain.go:64-100). Enable and configure each tool independently.

Step 3 — Create the Agent

Sidebar → Core / AgentsNew Agent. Minimum fields: Name, Provider, Model (e.g. Tiểu Hổ + qwen3.6-plus, or any model smart enough to read images and run the skill). Save → the agent appears in the list. This agent's job is to do reasoning: analyse the brief, read images, search the corpus, refactor the prompt, then call create_image.

Agent card 'Tiểu Hổ' (handle tieu-ho), provider qwen / qwen3.6-plus, active badge, description 'A versatile personal assistant for fast, accurate task handling, work management, and health & habit reminders', with Full + Evolving badges and a 200K ctx context window.
Step 3 · Agent tieu-ho after creation — provider qwen, model qwen3.6-plus

Step 4 — Upload the gpt-image-2-pro-max skill

Clone and zip the skill from upstream:

macOS / Linux · bash
git clone https://github.com/therichardngai-code/gpt-image-2-pro-max /tmp/g2pm
cd /tmp/g2pm/.claude/skills/gpt-image-2-pro-max
zip -r ~/Desktop/gpt-image-2-pro-max.zip .
Windows · PowerShell
git clone https://github.com/therichardngai-code/gpt-image-2-pro-max $env:TEMP\g2pm
Set-Location $env:TEMP\g2pm\.claude\skills\gpt-image-2-pro-max
Compress-Archive -Path * -DestinationPath $HOME\Desktop\gpt-image-2-pro-max.zip -Force

Sidebar → Capabilities / SkillsUpload Skill → drop the zip → the Dashboard parses the SKILL.md frontmatter, pulls name + description, then saves the skill record to the DB (version is an integer the DB auto-assigns and increments on each upload — not taken from frontmatter). The skill shows up in the list with an Enabled toggle.

Finally, grant the skill to the agent you just created: open Agent detailSkills → toggle gpt-image-2-pro-max to granted. The loader injects SKILL.md into the agent's system prompt from the next turn — no restart needed. This skill is a prompt-engineering pipeline — it teaches the agent how to diagnose the brief, search the 3,238-prompt corpus, pick a mood-appropriate template, refactor and resolve slots, and only then call create_image with a polished prompt.

Upload Skills modal with instructions to upload a ZIP containing SKILL.md with YAML frontmatter (name, description, slug); drop zone, file gpt-image-2-pro-max.zip 7.3 KB with a NEW badge, status '1 of 1 valid', Upload (1) button.
Step 4a · Upload Skill modal — drop the ZIP, preview the valid entry
Agent detail page for 'Tiểu Hổ' (handle tieu-ho, provider qwen, model qwen3.6-plus, badges Full / V3 / Evolving): Dreaming Memory Consolidation section (Enabled, threshold 5, debounce 600000ms), Heartbeat (not set up), Hooks (not configured), Skills (1/1) listing gpt-image-2-pro-max with internal badge and granted toggle ON; description 'Production prompt-engineering pipeline for GPT-Image-2 / OpenAI image generation. Pairs a media-designer agent with a hosted searc...'.
Step 4b · Agent detail tieu-ho (qwen3.6-plus) — grant toggle for gpt-image-2-pro-max ON
Advanced pattern (optional) — split into two agents via Agent Team: You can bundle an orchestrator (lead, runs the skill, dispatches tasks) and an image worker (only calls create_image) into one Agent Team: the orchestrator holds the long context, the worker stays lightweight for rendering, and audit/trace are cleanly separated through the team task board. Create the team at Sidebar → Core / Teams → add both agents as members; the runtime switches the agent to ModeTeam (internal/agent/orchestration_mode.go) — full team tasks + delegate + spawn tools become available. But it is NOT required by the runtime — a single agent is enough to run the full workflow.

Tracing one turn (post-setup)

User brief: "Tet peak-season poster with a Vietnamese red fox, infographic style"

Trace one turn — brief → cited image User brief enters → 7 steps → reply with prompt citation USER BRIEF "Tet peak-season poster with a Vietnamese red fox, infographic style" 1 Loader.LoadForContext match SKILL.md description against the brief → inject SKILL.md body into the system prompt precondition: skill is granted to the agent (Step 4b) 2 Agent analyses the brief reads the workflow in SKILL.md, extracts facets: subject=fox · mood=festive · shape=infographic · palette=red-gold 3 Agent calls scripts/search.py via the bash tool python search.py "lunar new year fox infographic poster" --shape infographic -n 5 --has-image 4 search.py → corpus server (HTTP) https://gpt-image-2-prompts.goclawoffice.com · BM25 + tag boost ranking returns top-5 records (prompt body + tags + author + twitter_link + reference image URL) 5 Agent picks a record + refactors the prompt filter tags: moods=festive ∧ palettes=red-gold substitute {argument name="subject"} = "red fox" + inject detail: "Vietnamese tet decorations, lì xì, mai blossom" 6 create_image(prompt=<refactored>, aspect_ratio="3:4") → native path from Section 02 (Codex Responses API, image_generation tool, 2-tier gate) 7 Agent reply + citation image attached + the original prompt's author/twitter_link (provenance)

Detailed visual trace from a real run: PR #1002 · UX trace.

Multi-tenant: in tenant mode the sidebar gains a System / Tenants section. Each tenant has its own Codex provider + agents + skill grants — upload the skill once to shared scope and grant it to tenants as needed, no skill cloning required.

08Workflow: generating an image in GoClaw

"GoClaw users — generate images with GPT Image 2 in a single prompt. Upload a base image → the agent refines the prompt → image is generated right inside GoClaw.

P.S. I set up a main model on Qwen3.6-Plus (+ read_image) and GPT_Image_2 (create_image)." — the author (Richard Ng)

This is the workflow shared by the author himself: just upload the source image + type a few keywords (e.g. "make an ecommerce ad poster"). The agent does the rest — one upload is all it takes.

Minimum setup in GoClaw

Agenttieu-ho (Tiểu Hổ) · qwen3.6-plus · with the gpt-image-2-pro-max skill granted. This is the only agent the user talks to. It runs reasoning (analyse the brief, search the corpus, refactor the prompt) and calls create_image directly. Qwen3.6-Plus has native vision so it reads uploaded images directly — no need to configure the read_image tool.
Tool create_imageA builtin tool — the runtime attaches it to the tool list whenever the tenant has enabled the toggle at /builtin-tools (Step 2a, Section 07) and the agent has AllowImageGeneration=true (default). Provider Chain → Codex Plus + gpt-image-2 renders the PNG. The agent's main model (Qwen3.6-Plus) only handles reasoning and is unrelated to the image-gen model — the runtime routes media tools through their own chain. The read_image tool (Step 2b) only needs to be configured if the agent's main model is text-only (no vision capability).

Workflow: brief → finished image

Workflow: brief → image — 1 turn vs 2 turn Branches on user intent: "generate now" vs "review first" USER INPUT (turn 1) 📎 Upload poster.jpg + brief text e.g. "make an ecommerce ad poster" AGENT REASONING — running the gpt-image-2-pro-max skill (6 steps) SKILL.md gets injected into the system prompt when granted to the agent Read image — read poster.jpg: subject, palette, composition Search corpusscripts/search.py BM25 → top-5 prompts Pick template — mood-mismatch filter, pick a base Refactor — parameterise the template with {argument} slots Resolve — fill slots; default-fallback if ambiguous, never invent Output 4-block — Base · Parameterised · Resolved · Rationale — branch on user intent — User intent? (agent reasoning) "generate now" "review first" Ⓐ BRANCH A — 1 turn (autonomous) Agent does NOT end_turn, chains the next tool call ⑦ Call create_image(prompt=Resolved) → Provider Chain → Codex / gpt-image-2 → PNG Reply: PNG + 4-block + reference One user action, done Ⓑ BRANCH B — 2 turn (review-first) Agent end_turn after step ⑥, waits for user approval Reply turn 1: 4-block + reference image User reads the Resolved prompt, OKs / tweaks slots USER INPUT (turn 2): "OK, generate it" → triggers ⑦ create_image, PNG returned GoClaw runtime: NO hard gate on create_image The branch is decided 100% by the agent's reasoning based on the brief's wording. "generate now / make the image" → A · "suggest a prompt / show me a draft" → B Tool loop: internal/agent/loop_run.go · max 30 LLM iterations / user turn (DefaultMaxIterations) Turn-2 user input is the only trigger for branch B to continue — the agent already end_turn'd in turn 1
1 turn or 2 turns — depends on the instruction: A "generate the image" brief (like the demo above — branch A in the diagram) → the agent chains all 7 steps in a single user turn, autonomously, until the PNG is back. A "suggest a prompt for me to review" brief (branch B) → the agent stops after step ⑥ (returns the 4-block + reference) and end_turns; user approval / slot tweak in turn 2 then triggers the create_image call. The GoClaw runtime supports both — no hard gate on create_image, every decision lives in the agent's reasoning.
Ecommerce ad poster output: HARBORIIS brand showcases Mary Jane shoes, three models in cream silk outfits seated around an oversized floral-embroidery shoe on a pastel cream background with pearl spheres and crystal formations; tagline 'Where Softness Becomes Form. Floral mesh, weightless step, quiet confidence.' along with four feature icons (Sheer Mesh, Floral Embroidery, Soft Cream Tone, Hand-Finished); top corner reads 'Designed with ChatGPT'.
Workflow output — ad poster for HARBORIIS shoes, generated in one turn from a base image plus the keyword "make an ecommerce ad poster"

09Best practices

The two most important tips when running this combo:

  1. The more specific the brief, the more accurate the search. When the user types only "red fox", the agent searches for "red fox" and gets back all kinds of unrelated templates (forest fox, cartoon fox, realistic fox, etc.). Add keywords for shape (poster / infographic / portrait / ad…), mood (festive / moody / minimal…), and palette (red-gold / pastel / neon…) and the search snaps to your intent. Example: "Tet poster, Vietnamese red fox, infographic style, festive red-gold palette" → the agent's search query carries infographic festive red-gold, enough to filter to the right templates.
  2. Don't drop the timeout below 600s. Generating complex images (e.g. an infographic with lots of in-image text) typically takes 4–8 minutes server-side — that's not an error, that's realistic. The old default of 120s × 2 retries often got cut off mid-flight (context deadline exceeded) and ruined the run. PR #1002 changes it to 600s × 1 retry: wait longer, give the server enough time; one retry only, because retrying after a timeout doubles your cost and rarely succeeds. Operators can override if they need to, but don't go below 600s — you'll almost certainly hit the old failure mode again.

10Technical summary

File references

ComponentFileRole
Native image interfaceinternal/providers/native_image.goThe NativeImageProvider interface, ValidateImageModel, SizeFromAspect
Codex implementationinternal/providers/codex_native_image.goBuild the body, parse JSON / SSE responses
Tool entryinternal/tools/create_image.goTool dispatch, chain resolution, native path
Provider chaininternal/tools/media_provider_chain.goChain timeout 600s, max_retries 1 default
PNG embed (runtime)internal/tools/png_embed.gopngEmbedPrompt — inserts a "Description" tEXt chunk before IEND
PNG embed (2-chunk)internal/agent/png_metadata.goEmbedPNGPrompt writes 2 chunks (Description + Software) — not yet called by create_image
Tool filter gateinternal/agent/loop_tool_filter.go2-tier gate: capability AND allowImageGeneration
Vision routinginternal/agent/media_tool_routing.gohasReadImageProvider — file-ref vs inline mode for uploaded images
Orchestration modeinternal/agent/orchestration_mode.goModeTeam / ModeDelegate / ModeSpawn resolved from team + agent links
Builtin tool seedcmd/gateway_builtin_tools.goDefault Enabled: false + Requires dependencies (vision, image_gen, ...)

Skill backend

The skill is uploaded via the Dashboard (see Section 07 · Step 4). scripts/search.py calls an external corpus host; BM25 + tag-boost ranking lives on the server, not in GoClaw core.