Deep-dive into the system design, component responsibilities, data flows, and key design decisions.
- High-level Overview
- Storytelling Modes
- Repository Layout
- Subsystem A β Dynamic Interaction (Frontend)
- Subsystem B β Multi-agent Story Engine (Backend)
- Subsystem C β Character Workshop
- User Journey & Core Workflows
- Data Flows
- Service Topology & Ports
- Deployment
- Key Design Decisions
- Tech Stack Summary
Gemini Tales is an integrated AI storytelling system built on the Google Agent Development Kit (ADK). It allows users to generate interactive stories through two distinct pathways: Live and Agent-driven, acting as a Creative Storyteller βοΈ that breaks the traditional "text box" paradigm.
| Component | Responsibility | Primary Technology |
|---|---|---|
| Frontend | "Magic Mirror" UI β Real-time interaction & media | React 19 / Vite / Tailwind |
| Main Agent (Puck) | Live Narrator β Handles voice, vision, and interleaved media | Python / FastAPI / gemini-2.5-flash-native-audio-preview-12-2025 (via MODEL_ID in .env) |
| Supporting Brain | Background agents for research, safety, and writing | Google ADK / A2A Protocol |
| Media Factory | Generates cinematic animations, illustrations, and music | Veo 3.1 / Gemini 3.1 Flash-Image (Nano Banana 2) / Lyria 3 |
graph TD
User([User]) <--> Browser["Browser (Magic Mirror UI)"]
subgraph "Main Interaction Agent (Puck)"
Browser <-->|WebSocket| Puck["Puck (Live Narrator)"]
subgraph "Puck's Toolbox"
Illustrator["Illustration Engine"]
Awards["Achievement Manager"]
end
Puck <--> Illustrator
Puck <--> Awards
end
subgraph "Google AI Infrastructure"
Puck <-->|Multimodal Flow| GeminiLive["Gemini Live (via MODEL_ID)"]
Illustrator -->|Video Generation| Veo[Veo 3.1]
Illustrator -->|Dynamic Rendering| FlashImage["Gemini 3.1 Flash-Image (Nano Banana 2)"]
Illustrator -->|Audio Composition| Lyria3[Lyria 3]
end
subgraph "Supporting Brain (Agent Mode)"
Puck -->|Request Pipeline| Orchestrator[Orchestrator]
Orchestrator <-->|A2A| Researcher[Researcher]
Orchestrator <-->|A2A| Judge[Judge]
Orchestrator <-->|A2A| Storysmith[Storysmith]
end
style Browser fill:#f9f,stroke:#333,stroke-width:2px
style Puck fill:#f9f,stroke:#333,stroke-width:2px
style Orchestrator fill:#ccf,stroke:#333,stroke-width:2px
style Illustrator fill:#fff4dd,stroke:#d4a017,stroke-width:2px
Gemini Tales now supports two core experiences, toggled via the UI.
In Live Mode, the system uses the native Gemini Live capabilities for an unscripted, highly interactive session:
- Initiation: The user clicks Connect API, which opens a WebSocket connection using the query parameter
?mode=live(e.g.ws://localhost:8000/ws/puck_live/.../?mode=live). - Backend Behavior: The backend immediately activates the voice agent, which is ready for spontaneous, free-form dialog (the "Improviser").
- Latency: Near-zero.
- Visuals: Triggered by tool-calls during the live conversation.
In Agent Mode, the multi-agent backend pre-generates a story foundation before the live session begins:
- Story Generation: The frontend makes a request to the background ADK agent network (Researcher -> Judge -> Storysmith) to draft the story.
- Transition: Once the scenario/script is ready, the action button transitions to "Wake Puck!" (Π Π°Π·Π±ΡΠ΄ΠΈΡΡ ΠΠ°ΠΊΠ°).
- Initiation: Clicking this button opens a WebSocket connection with the query parameter
?mode=agent(e.g.ws://localhost:8000/ws/puck_live/.../?mode=agent). The frontend then automatically hands over the drafted story blueprint to the Live session. - Backend Behavior: Puck reads the story blueprint and strictly follows the prepared plot, narrating this exact pre-planned story to the child step-by-step.
gemini-tales/
βββ frontend/ # "Magic Mirror" React 19 Frontend
β βββ src/ # TSX components (Gemini Live integration)
β βββ package.json # Node.js dependencies
β
βββ backend/
β βββ app/ # Main Agent (Puck) & Media Factory
β β βββ main.py # FastAPI WebSocket Entry Point
β β βββ avatar_generator.py # Composite Image + Audio generator
β β βββ services/ # Granular Media Logic
β β β βββ music_generator.py # Lyria 3 Music Composition
β β β βββ ...
β β βββ routers/ # Puck Live & Agent Story Endpoints
β β
β βββ agents/ # Supporting ADK Brain
β βββ researcher/ # Adventure Seeker (Search)
β βββ judge/ # Guardian of Balance (Safety/Movement)
β βββ content_builder/ # Storysmith (Narrative)
β βββ orchestrator/ # Pipeline coordination
β βββ shared/ # Shared Safety & Config
β βββ run_local.ps1 # Local start for Brain microservices
β βββ deploy.ps1 # Cloud Run deployment script
Note: backend/agents/orchestrator/_last_trace.json is a generated debug trace of the last pipeline run, used by the trace modal. It is excluded from version control via .gitignore.
The frontend is a high-performance web interface migrated to TypeScript for enhanced stability. It orchestrates a unified multimodal stream.
Unlike traditional chatbots, Gemini Tales uses a synchronized stream:
- Unified Session: A single WebSocket session handles both PCM Audio (captured via
AudioWorklet) and Video Frames (1 FPS JPEG/Base64). - Spatial/Visual Context: Gemini processes video frames in real-time, allowing it to comment on physical actions or surroundings during the audio story.
To minimize friction, the application implements an automatic story trigger:
- Handshake: Browser establishes
ws://connection via the FastAPI proxy. - Agent Sync: Upon
SETUP_COMPLETE, the frontend calls the/api/chat_streamendpoint. - Pre-story Research: The Orchestrator triggers the agent network (Researcher -> Judge -> Storysmith) to generate a structured story context based on search and safety rules.
- Context Injection: The resulting story is injected into the Gemini Live session as a background "memory" trigger.
- Immersive Entry: The AI begins the narrative based on the agent-generated plot, greeting the user with voice and an initial illustration.
The application implements a unique "Stop-and-Watch" mechanism:
- Challenge Trigger: The system instructions guide Gemini to ask for a physical action.
- Immediate Silence: The model is instructed to stop speaking and wait after the request.
- Multimodal Verification: Using the 1 FPS video feed and audio transcription, the "Live" model detects when the child has completed the action and said the magic word, then resumes the story with praise.
The UI includes a robust device initialization flow (fetchDevices) that handles permissions and allows users to swap microphones/cameras on-the-fly without breaking the live session.
| Agent | Model (Configured in .env) |
Key tools / output | ADK type |
|---|---|---|---|
| Adventure Seeker | gemini-3.1-flash-lite-preview (MODEL_NAME_FLASH) |
google_search |
Agent |
| Guardian of Balance | gemini-3.1-flash-lite-preview (MODEL_NAME_FLASH) |
Safety/Quality Evaluation | Agent |
| Storysmith | gemini-3.1-pro-preview (MODEL_NAME) |
High-fidelity narrative | Agent |
| Orchestrator | β | A2A Coordination | SequentialAgent |
stateDiagram-v2
[*] --> ResearchLoop
state ResearchLoop {
[*] --> AdventureSeeker
AdventureSeeker --> GuardianOfBalance: findings
GuardianOfBalance --> EscalationChecker: feedback
state EscalationChecker <<choice>>
EscalationChecker --> [*]: status == "pass" (Escalate)
EscalationChecker --> AdventureSeeker: status == "fail" (Loop)
}
ResearchLoop --> Storysmith
Storysmith --> [*]
EscalationChecker is a custom BaseAgent subclass. It reads session.state["judge_feedback"] and yields an Event(escalate=True) to break the LoopAgent, or an empty event to continue.
Each of the three leaf agents (Researcher, Judge, Content Builder) runs as a standalone A2A server (served by adk_app.py). The Orchestrator connects to them via RemoteA2aAgent, which:
- Reads the agent card from
<agent_url>/.well-known/agent-card.json - Posts tasks over HTTP using the A2A protocol
- Uses an authenticated HTTPX client (
authenticated_httpx.py) to attach Google OAuth2 bearer tokens automatically β required when deployed on Cloud Run
Orchestrator
βββ RemoteA2aAgent("researcher") β HTTP POST http://localhost:8001/a2a/... (Adventure Seeker)
βββ RemoteA2aAgent("judge") β HTTP POST http://localhost:8002/a2a/... (Guardian of Balance)
βββ RemoteA2aAgent("content_builder") β HTTP POST http://localhost:8003/a2a/... (Storysmith)
app/main.py serves three critical functions:
- Static File Hosting: Serves the compiled React frontend from the
dist/directory. - Gemini Live WebSocket Proxy: Exposes a
/ws/proxyendpoint that handles the complex handshake and authentication with the Google Cloud Vertex AI endpoint. - Puck Live WebSocket Endpoint: Mounts
/ws/puck_live/{user_id}/{session_id}which establishes the ADK session-local runner for the narrator/improviser live session.
Proxy Workflow:
- Browser connects to
ws://localhost:8000/ws/proxy?project=...&model=.... - FastAPI backend generates a fresh Google OAuth2 bearer token.
- It establishes a secure WebSocket connection to the LlmBidiService in
us-central1. - It bi-directionally pipes messages between the browser and Google, handling binary audio data and JSON tool calls transparently.
The Media Factory provides a seamless, context-aware visual layer that makes the story feel alive.
- Technology: Veo 3.1 (Google's latest video generation model).
- Function: Transforms the static character description into a 4-second magical video preview.
- Trigger: Activated by the user via the "Animate" button in the Character Workshop.
- Technical Detail: The input reference image is loaded as raw binary data and passed to
generate_videosas atypes.Imagewrapper containingimage_bytesandmime_type(which serializes tobytesBase64EncodedandmimeTypeon the wire, as required by the Veo API).
- Technology: Gemini 3.1 Flash-Image.
- Function: Automatically generates high-quality watercolor illustrations for every new scene.
- Mechanism: The Main Agent (Puck) triggers a
generateIllustrationtool call, which is processed by the backend. TheStoryAvatarGeneratorthen orchestrates a multi-model pipeline: first generating a watercolor image, then passing that image to Lyria to compose a matching background track. - Movement Loop: Puck also uses the
recordMovementtool to update the Heroic Energy state when visual confirmation of an exercise is received.
- Technology: Lyria 3 (Google's latest music generation family).
- Function: Generates 30-second, 48kHz high-fidelity stereo background music.
- Multimodal Link: Uses the generated scene illustration as a visual prompt to ensure the music perfectly matches the current story's mood and setting.
The system supports a multimodal "likeness transfer" flow:
- Input: A real photo (JPEG/PNG) and a fairytale style prompt.
- Process: Gemini analyzes facial features from the upload and "repaints" them in the whimsical watercolor aesthetic, ensuring the child sees themselves as the hero.
Our current system workflow integrates character onboarding and interactive movement tracking:
πΌοΈ Generate img by photo (Nano Banana 2) -> π¬ Animate Puck (Veo 3.1) -> Puck's Music Theme (Lyria 3) -> Main Screen -> Choose Story Mode -> Choose Exercise Focus -> Connect API -> Settings (Mic, Video) = Connected
- Style Selection: The user selects one of 4 character styles: Elf (woodland), Wizard (sorcerer), Royal (prince/princess), or Critter (animal).
- Avatar Generation: The user uploads a photo to the Character Workshop, and Gemini 3.1 Flash-Image (Nano Banana 2) generates a customized watercolor fairytale character portrait keeping their facial details recognizable.
- Animation: Clicking the Animate button prompts Veo 3.1 to generate a 4-second animated video loop of the character.
- Soundtrack: The user selects one of 4 music genres (Forest, Sorcerer, Harp, March), and Lyria 3 composes a custom background theme matching the avatar's style and selected mood.
- Transition: The user enters the main screen to begin the interactive story.
When the child is asked to perform a physical task, the following flow triggers:
- Step A: Pass Focus to ADK Agent: The chosen Exercise Focus (Sky Magic, Earth Magic, or Solar Power) is passed to the backend websocket via the
exercise_modequery parameter, adjusting Puck's instruction context. - Step B: Camera Analysis: The child's movements are sent as a live video stream (1 FPS) to the backend, where the model analyzes the movements.
- Step C: Heroic Energy Award: Once the backend registers movement, it issues a
recordMovementtool call, awarding a set amount of Heroic Energy (e.g.,+15 Energy). - Step D: Visual Effects: The frontend renders a glowing golden particle effect and plays audio/visual feedback to confirm the completion.
sequenceDiagram
participant B as Browser (React UI)
participant P as FastAPI Proxy
participant G as Gemini Live API
B->>P: 1. WebSocket Connection
P->>G: 2. Handshake & Auth (OAuth2)
G-->>P: 3. Setup Complete
P-->>B: 4. Setup Complete
rect rgb(240, 240, 240)
Note over B, G: Real-time Interaction
B->>+P: Audio/Video Stream
P->>+G: Forward Binary
G-->>-P: AI Audio & Transcript
P-->>-B: Forward Response
end
Note over G, B: Tool Calling
G->>P: TOOL_CALL: awardBadge
P->>B: forward awardBadge
The ADK agents are still utilized by the content_builder during specific story transitions or for pre-generating lore, following the same A2A orchestration described in Subsystem B.
| Service | Port | Technology | Start command |
|---|---|---|---|
| App (Frontend + Proxy) | 8000 |
FastAPI + React (dist) | uv run python app/main.py |
| Adventure Seeker | 8001 |
ADK A2A server | uv run python adk_app.py researcher --host 0.0.0.0 --port 8001 --a2a |
| Guardian of Balance | 8002 |
ADK A2A server | uv run python adk_app.py judge --host 0.0.0.0 --port 8002 --a2a |
| Storysmith | 8003 |
ADK A2A server | uv run python adk_app.py content_builder --host 0.0.0.0 --port 8003 --a2a |
| Orchestrator | 8004 |
ADK server | uv run python adk_app.py orchestrator --host 0.0.0.0 --port 8004 --a2a |
All services are started in the correct order by run_local.ps1. A 5-second sleep ensures leaf agents are ready before the orchestrator tries to resolve their agent cards.
The system is designed for a split deployment strategy using two specialized automation scripts. This ensures that the "Supporting Brain" (internal agents) and the "Interaction Head" (Puck + Frontend) are correctly configured and secured.
-
Supporting Agents (
backend/agents/deploy.ps1):- Deploys Researcher, Judge, Content Builder, and Orchestrator.
- Enforces
--no-allow-unauthenticatedfor internal safety. - Orchestrates the capture of service URLs to build the agentic network graph.
-
Main Application (
deploy_app.ps1):- Deploys the unified Gemini Tales App (Puck + Frontend).
- Handles the dual-stage build: compiles the React 19 frontend and wraps it with the FastAPI server.
- Automatically injects
GOOGLE_CLOUD_PROJECTand other metadata as environment variables.
To allow for runtime updates to AI models and parameters without re-compiling the frontend, we use a Dynamic Config Endpoint:
- Backend:
GET /api/configreads secrets (like API Keys and Model IDs) from the Cloud Run environment. - Frontend: During initialization, the React app fetches this data to self-configure.
- Benefit: Judges can swap models or keys via the Google Cloud console, and the "Magic Mirror" will adapt instantly on the next page refresh.
graph TD
subgraph "Public Internet"
UI[Public App URL]
end
subgraph "Google Cloud Run (VPC-Secured)"
App["Gemini Tales App (Frontend + Puck)"]
Orchestrator["Orchestrator Agent"]
Brain["Researcher/Judge/Storysmith Agents"]
end
UI -->|Unauthenticated| App
App -->|OAuth2 Token| Orchestrator
Orchestrator -->|A2A + OAuth2| Brain
The Gemini Tales App is the only public entry point. All inter-agent communication is secured with Google OAuth2 bearer tokens, managed by authenticated_httpx.py.
Observability: The FastAPI app instruments traces with OpenTelemetry and exports them to Google Cloud Trace via CloudTraceSpanExporter.
Instead of calling the Gemini Live API directly from the browser, we use a FastAPI WebSocket proxy. This ensures that the Vertex AI credentials and Project ID remain secure on the server, while still providing a low-latency pipe for audio and video data.
- ADK Messaging Protocol Alignment: When running in ADK proxy mode (
useADK = true), the client wraps text messages in a simplified JSON object ({ text: ... }) expected by the ADKLiveRequestQueueendpoint (which forwards it to the ADKLiveRequestQueue), rather than the raw Gemini Live API JSON structure ({ client_content: ... }).
The front end was migrated from Vanilla JS to a React SPA. This allows for more robust state management of the complex real-time media streams and tool calls, as well as a more responsive and premium UI.
To provide high-quality and safe content, we separated story generation from story delivery. The ADK agent network performs research and safety checks asynchronously before the live conversation starts, ensuring the "Live" AI has a solid, well-researched narrative foundation.
Rather than using a fixed number of research passes, the Judge's output_schema produces a structured { status: "pass"|"fail" } verdict. The EscalationChecker reads this from session state and escalates the loop early when quality is sufficient (up to a safety cap of 3 iterations).
Using the A2A protocol means each agent is independently deployable and scalable. The Orchestrator only needs to know the agent card URL β not the implementation. This also enables mixing agents written in different languages or frameworks in the future.
The Orchestrator saves agent outputs (research_findings, judge_feedback) into ADK session state. Sub-agents read from this state in their prompts via the {state[key]} template syntax. This avoids passing large payloads through function arguments and keeps the inter-agent contract simple.
authenticated_httpx.py wraps google.auth.transport.requests to inject an OAuth2 bearer token into every outgoing request. The same helper is used both by the Orchestrator (to call leaf agents) and by the FastAPI app (to call the Orchestrator). In local development, tokens are sourced from the gcloud CLI.
- Windows Host Support: To support local execution on Windows developer workstations, the wrapper automatically detects the OS and executes
gcloud.cmdinstead ofgcloudwhen invoking the CLI shell command.
Scene generation (draw_story_scene) and character animation are CPU/GPU-heavy operations taking up to 10 seconds. If executed synchronously, they would block the real-time BIDI audio channel, causing stream disconnection. To prevent this, the backend tools immediately return a "started in background" acknowledgment to the LLM model, freeing it to continue talking. The heavy image/music generation runs in a separate background thread (asyncio.to_thread) and pushes the finished assets asynchronously to the client over the active WebSocket using custom event envelopes (ILLUSTRATION).
To ensure the character (Puck) maintains visual consistency during the Character Workshop (such as portrait transformation and rotating through different 360Β° poses), the StoryAvatarGenerator leverages persistent multi-turn chat sessions (chat_avatar and chat_scene) instead of isolated, independent API calls. This preserves the visual design traits in the model's chat history, enabling consistent generation of new poses and scene illustrations.
| Layer | Technology | Version / Specifics |
|---|---|---|
| Main LLM Brain | gemini-3.1-pro-preview |
Configured as MODEL_NAME in .env |
| Fast Reasoning | gemini-3.1-flash-lite-preview |
Configured as MODEL_NAME_FLASH in .env |
| Live Interaction | gemini-2.5-flash-native-audio-preview-12-2025 |
Configured as MODEL_ID in .env |
| Video Production | veo-3.1-generate-preview |
Configured as VIDEO_MODEL_ID in .env |
| Audio Production | lyria-3-clip-preview |
Configured as LYRIA_MODEL_ID in .env |
| Image Production | gemini-3.1-flash-image (Nano Banana 2) |
Configured as VITE_MODEL_ID_IMAGE in .env |
| Agent Framework | Google Agent Development Kit (ADK) | 1.22.0 |
| Frontend | React 19 + Vite | "Magic Mirror" Dashboard |
| Backend | FastAPI (Python 3.12) | Main Agent bridge |
| Deployment | Google Cloud Run | Serverless microservices |