Skip to content

Pokemon Mystery Dungeon Red (GBA) VLM (qwen3-vl family) + public github pages hosted cloud RAG

License

Notifications You must be signed in to change notification settings

TimeLordRaps/pokemon-md-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pokemon Mystery Dungeon Red - Autonomous Agent

Multi-model Qwen3-VL agent with hierarchical RAG system, dynamic temporal resolution, and live dashboard for autonomous Pokemon Mystery Dungeon Red gameplay.

🎮 Project Overview

Goal: Build an autonomous agent that can play Pokemon Mystery Dungeon Red using:

  • Multi-scale visual reasoning (Qwen3-VL 2B/4B/8B)
  • Hierarchical RAG with 7 temporal resolution silos
  • Dynamic FPS adjustment (30fps → 1fps) and frame multipliers
  • Live searchable dashboard (GitHub Pages + You.com Content API)
  • Cost-aware model routing and vision optimization

Tech Stack:

  • Emulator: mgba + mgba-http (960x640 @ 30fps)
  • Vision Models: Qwen3-VL-2B/4B/8B (Thinking + Instruct variants)
  • Vector DB: ChromaDB or FAISS (multi-scale temporal embeddings)
  • Dashboard: GitHub Pages (static) + You.com Content API (retrieval)
  • Control: Python + mgba-http API

🎬 Demo Video

Watch the 3-minute agent demo (MP4) — 180 seconds of autonomous gameplay with Kokoro TTS narration, automatically generated from agent trajectory and You.com knowledge retrieval.

Submission snapshot:

  • Branch: deadline-2025-10-30-2355-PT (frozen @ 23:55 UTC-7)
  • Tag: deadline-2025-10-30-2359-PT (final submission timestamp)

⚡ Quick Start (2 minutes)

  1. Activate environment:

    mamba activate agent-hackathon
  2. Run demo (50 steps + 3-min video):

    cd pokemon-md-agent
    python scripts/final_demo_runner.py
  3. View results:

    • Video: agent_demo.mp4
    • Logs: runs/demo_*/trajectory_*.jsonl

⚡ Quick Start (5 minutes)

Prerequisites

  • mGBA emulator (version 0.8.0+) with Lua socket server running on port 8888
  • Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia) ROM file
  • Python 3.11+ with conda/mamba environment
  • Save file with game in Tiny Woods or similar dungeon (starter floor)

Setup

  1. Place ROM & Save in rom/ directory:

    # From project root (pokemon-md-agent/)
    cp "Pokemon Mystery Dungeon - Red Rescue Team.gba" ./rom/
    cp "Pokemon Mystery Dungeon - Red Rescue Team.sav" ./rom/
  2. Create & activate conda environment (installs Kokoro TTS + MoviePy):

    mamba create -n agent-hackathon python=3.11 -y
    mamba activate agent-hackathon
    pip install -r requirements.txt
  3. Configure You.com Content API (optional but recommended):

    # Persist YOUR key in PowerShell profile or current session
    $Env:YOU_API_KEY = "<your-you-api-key>"
    
    # Smoke test (live mode) - replace URL with a domain you expect the agent to use
    python -m scripts.check_you_api --url https://www.serebii.net/dungeon/redblue/d001.shtml --live
    # macOS/Linux example
    export YOU_API_KEY="<your-you-api-key>"
    python -m scripts.check_you_api --url https://www.serebii.net/dungeon/redblue/d001.shtml --live
    • Success prints • https://... -> OK | ...
    • If you skip this step (or the key is invalid) the agent falls back to placeholder content.
  4. Start mGBA with Lua socket server (Windows PowerShell example):

    # Ensure mGBA-http is loaded (Lua console > File > Load script > mGBASocketServer.lua)
    # Server defaults to port 8888
    # Verify with: python .temp_check_ram.py
  5. Run final demo (50-step agent + 3-min video + Kokoro voiceover):

    mamba activate agent-hackathon
    cd pokemon-md-agent
    python scripts/final_demo_runner.py

    Output:

    • runs/demo_*/trajectory_*.jsonl - Full trajectory data
    • agent_demo.mp4 - 3-minute montage video (key frames + Kokoro TTS narration)
    • Console logs show real-time progress

Expected Timeline

  • Initialization: ~5s
  • Agent execution (50 steps): ~30-60s
  • Video generation: ~10-20s
  • Total: ~1-2 minutes

Troubleshooting

Issue Solution
Failed to connect to mGBA Verify mGBA is running, socket server active, port 8888
No ROM files found Check ROM + SAV files are in ./rom/ directory
Perception failed: unpack requires... Stale mGBA connection; restart emulator
Video generation failed Ensure opencv-python installed: pip install opencv-python

📊 Dashboard & Monitoring

The agent includes a comprehensive dashboard system for monitoring gameplay and retrieving external knowledge.

HF_HOME Path Sanitization

Critical Fix Applied: All HF_HOME environment variable usages have been sanitized to handle quoted paths, normalize separators, and support user path expansion. This resolves model loading failures on Windows systems where HF_HOME may contain quotes or need path expansion.

Applied sanitization template:

  • Strip surrounding quotes (", ')
  • Expand user paths (~ → actual home directory)
  • Normalize path separators for cross-platform compatibility
  • Added comprehensive test coverage in test_path_sanitization.py

Verification: Model loading tested with real Qwen3-VL models from HuggingFace Hub, confirming proper cache directory resolution and tokenizer loading.

Dashboard Features

  • Live Updates: Real-time trajectory logging and meta-view generation
  • Searchable Content: Client-side FAISS indexes for fast similarity search
  • Rate Limiting: Token bucket rate limiting (30 files/min, 300/hour) with exponential backoff
  • Build Budget: Coalesces commits to ≤10/hour to avoid GitHub Actions limits
  • LFS Avoidance: Keeps artifacts under 8MB; no Git LFS unless required
  • Resolution Modes: 2× (480×320) default for dashboard, 1× (240×160) for Qwen-VL benchmarking

👁️ Vision Prompts & Game State Schema

Phase 1 Implementation (Complete): Structured JSON schema for vision model outputs.

GameState Schema

The agent uses a Pydantic-validated GameState schema for consistent vision model outputs:

from src.models.game_state_schema import GameState, Entity, GameStateEnum

# Models must return JSON matching this schema
state = GameState(
    player_pos=(12, 8),
    player_hp=45,
    floor=3,
    state=GameStateEnum.EXPLORING,
    enemies=[Entity(x=14, y=8, type="enemy", species="Geodude")],
    items=[Entity(x=10, y=6, type="item", name="Apple")],
    confidence=0.95,
    threats=["Geodude approaching"],
    opportunities=["Move up to dodge"]
)

Key Features:

  • 51 unit tests (1.28s runtime) covering validation, serialization, edge cases
  • Type-safe coordinates (0-indexed bounds checking, negative rejection)
  • Confidence scoring (0-1 range, quality metrics)
  • JSON roundtrip (validation + serialization)
  • Few-shot examples (3-5 predefined examples for in-context learning)

Quick Validation

# Activate environment
mamba activate agent-hackathon

# Run schema tests (51 tests)
python -m pytest tests/test_game_state_schema.py tests/test_game_state_utils.py -v

# Quick validation
python scripts/test_vision_schema.py

# (Windows) PowerShell validation script
.\scripts\validate_vision_schema.ps1 -RunTests

Phase 2: Vision System Prompts

Phase 2 Implementation (Complete): Structured system prompts for Qwen3-VL vision models.

Key Features

  • 58 unit + integration tests (1.34s runtime) covering prompt variants and message integration
  • Instruct variant — Direct JSON output for 2B/4B models
  • Thinking variant — Chain-of-thought reasoning for reasoning-enabled models
  • PromptBuilder class — Type-safe prompt assembly with few-shot examples
  • Message packager integration — Seamless integration with three-message protocol
  • Model-specific optimization — 2B/4B use instruct, 8B uses thinking variant

System Prompts

Located in src/models/vision_prompts.py:

from src.models.vision_prompts import (
    VISION_SYSTEM_PROMPT_INSTRUCT,
    VISION_SYSTEM_PROMPT_THINKING,
    PromptBuilder,
    format_vision_prompt_with_examples
)

# Build complete prompt with context and examples
builder = PromptBuilder("instruct")
builder.add_few_shot_examples(3)
builder.add_context(policy_hint="explore", model_size="4B")

prompt = builder.build_complete_prompt()
# Returns: {"system": "...", "user": "..."}

# Or use high-level function
complete = format_vision_prompt_with_examples(
    policy_hint="battle",
    model_variant="thinking",
    num_examples=3,
    model_size="8B"
)

Message Packager Integration

Located in src/orchestrator/message_packager.py:

from src.orchestrator.message_packager import pack_with_vision_prompts

# Pack game state with vision prompts
step_state = {...}  # From Copilot or agent state
system_prompt, messages = pack_with_vision_prompts(
    step_state,
    policy_hint="explore",
    model_size="4B",
    num_examples=3
)

# Returns: (system_prompt_str, [msg1, msg2, msg3])
# Ready to send to Qwen3-VL with three-message protocol

Quick Validation

# Activate environment
mamba activate agent-hackathon

# Run all vision tests (Phase 1 + Phase 2)
python -m pytest tests/test_game_state_*.py tests/test_vision_prompts.py tests/test_message_packager_vision.py -v

# Quick validation (Phase 2 only)
python scripts/test_vision_prompts.py

# (Windows) PowerShell validation script
.\scripts\validate_vision_prompts.ps1 -RunTests

Prompt Characteristics

Instruct Variant (2B/4B models)

  • ~2,234 characters
  • Direct JSON output format
  • Explicit requirements and rules
  • Optimized for smaller models with less reasoning capability
  • Focus on clear instructions and schema compliance

Thinking Variant (8B+ reasoning models)

  • ~2,521 characters
  • 6-step chain-of-thought reasoning (OBSERVATION → CLASSIFICATION → STATE → THREATS → CONFIDENCE → JSON)
  • Encourages explicit reasoning about visual input
  • Better for models that benefit from intermediate reasoning steps
  • Chain of thought helps with complex multi-entity scenes

Utility Functions

Located in src/models/game_state_utils.py:

# Parse model output with validation
state = parse_model_output(json_str, partial_ok=True, confidence_threshold=0.7)

# Validate state quality
report = validate_game_state(state)
print(f"Quality: {report['quality_score']:.2f}")
print(f"Warnings: {report['warnings']}")

# Generate few-shot examples
examples = generate_few_shot_examples(num_examples=3)

# Format for agent decisions
text = format_state_for_decision(state)

🛡️ Agent Gatekeeper

The Agent Gatekeeper provides safety filtering for agent actions, ensuring only valid and vetted actions are executed.

Features

  • Safety Filtering: Rejects explicitly invalid actions (e.g., self-destruct, quit, exit, die, end_game)
  • ANN Validation: Requires ≥3 shallow ANN hits to permit actions via vector similarity search
  • Fallback Behavior: On ANN failure, conservatively rejects all actions
  • Async Operation: Supports async ANN search for non-blocking validation

Usage

from src.agent.gatekeeper import Gatekeeper
from src.retrieval.ann_search import VectorSearch

# Initialize with ANN search dependency
ann_search = VectorSearch(index_path="path/to/ann/index")
gatekeeper = Gatekeeper(ann_search=ann_search, min_hits=3)

# Filter actions
valid_actions = ["move", "attack", "use_item"]
state = {"ascii": "dungeon_grid", "player_x": 10, "player_y": 10}
filtered_actions = await gatekeeper.filter(valid_actions, state)
# Returns: ["move", "attack", "use_item"] if ANN validation passes

Integration

The gatekeeper is automatically integrated into the agent reasoning pipeline:

  • Actions are extracted from LLM responses
  • Passed through gatekeeper filtering before execution
  • Invalid actions trigger fallback to safe defaults

🧪 Testing & Profiling

Important: Always cd to REPO ROOT (absolute) before running tests; scripts enforce this.

Test Scripts

Fast Lane (≤3 minutes):

# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
if (-not (Test-Path 'C:\Homework\agent_hackathon\pokemon-md-agent\pyproject.toml')) { Write-Error 'Not at repo root'; exit 2 }
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx"
# Git Bash
mamba info --envs && python --version && mamba activate agent-hackathon && \
[ -f /c/Homework/agent_hackathon/pokemon-md-agent/pyproject.toml ] || { echo "Not at repo root"; exit 2; } && \
cd /c/Homework/agent_hackathon/pokemon-md-agent && pwd && ls -la && \
export PYTHONPATH=/c/Homework/agent_hackathon/pokemon-md-agent/src && \
python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx"

Full Suite (10-15 minutes):

# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
if (-not (Test-Path 'C:\Homework\agent_hackathon\pokemon-md-agent\pyproject.toml')) { Write-Error 'Not at repo root'; exit 2 }
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python -m pytest -q
# Git Bash
mamba info --envs && python --version && mamba activate agent-hackathon && \
[ -f /c/Homework/agent_hackathon/pokemon-md-agent/pyproject.toml ] || { echo "Not at repo root"; exit 2; } && \
cd /c/Homework/agent_hackathon/pokemon-md-agent && pwd && ls -la && \
export PYTHONPATH=/c/Homework/agent_hackathon/pokemon-md-agent/src && \
python -m pytest -q

Profiling & Benchmarking

Bench Sweep (5-10 minutes):

# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python profiling/bench_qwen_vl.py --models all --time-budget-s 180 --full --plot

Sync Profiling Data:

# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
Copy-Item "..\profiling\*" ".\profiling\" -Recurse -Force -Exclude "__pycache__"

Test Markers

  • @pytest.mark.slow: Long-running tests
  • @pytest.mark.network: Network-dependent tests
  • @pytest.mark.bench: Performance benchmarks
  • @pytest.mark.longctx: Long context tests
  • @pytest.mark.real_model: Real model inference tests

Outputs

  • Test results: Console output with session summary and top slow tests
  • Bench results: profiling/results/<UTC_ISO>/ (CSV, JSONL, plots)
  • Profiling data: Consolidated in profiling/ directory


�📁 Project Structure

pokemon-md-agent/
├── README.md                          # This file
├── AGENTS.md                          # Instructions for code agents (Copilot/Claude Code)
├── requirements.txt                   # Python dependencies
├── .gitignore                        # Git ignore patterns
│
├── docs/                             # Architecture & design documents
│   ├── pokemon-md-rag-system.md     # RAG system architecture
│   ├── pokemon-md-dashboard.md      # Dashboard design
│   ├── pokemon-md-agent-scaffold.md # Agent scaffold & environment
│   └── embedding-types.md           # Detailed embedding strategy
│
├── src/                              # Source code
│   ├── agent/                       # Agent core
│   │   ├── __init__.py
│   │   ├── qwen_controller.py       # Multi-model Qwen3-VL orchestration
│   │   ├── model_router.py          # 2B/4B/8B routing logic
│   │   └── memory_manager.py        # Scratchpad & persistent memory
│   │
│   ├── orchestrator/                # Message orchestration
│   │   ├── __init__.py
│   │   └── message_packager.py      # Three-message protocol with model presets
│   │
│   ├── embeddings/                  # Embedding generation & storage
│   │   ├── __init__.py
│   │   ├── extractor.py             # Extract embeddings from Qwen3-VL
│   │   ├── temporal_silo.py         # 7 temporal resolution managers
│   │   └── vector_store.py          # ChromaDB wrapper
│   │
│   ├── vision/                      # Screenshot processing
│   │   ├── __init__.py
│   │   ├── sprite_detector.py       # Qwen3-VL sprite detection
│   │   ├── grid_parser.py           # Convert to tile grid for pathfinding
│   │   └── ascii_renderer.py        # ASCII state for blind LLMs
│   │
│   ├── environment/                 # mgba integration
│   │   ├── __init__.py
│   │   ├── mgba_controller.py       # mgba-http API wrapper
│   │   ├── fps_adjuster.py          # Dynamic FPS & frame multiplier
│   │   └── action_executor.py       # Button press execution
│   │
│   ├── retrieval/                   # RAG system
│   │   ├── __init__.py
│   │   ├── auto_retrieve.py         # Automatic trajectory retrieval
│   │   ├── circular_buffer.py       # On-device circular buffer (60-min window)
│   │   ├── cross_silo_search.py     # Multi-scale search
│   │   ├── deduplicator.py          # pHash/sprite-hash deduplication
│   │   ├── embedding_generator.py   # Text/image embedding generation
│   │   ├── keyframe_policy.py       # Keyframe selection (SSIM/floor/combat triggers)
│   │   ├── local_ann_index.py       # SQLite ANN index for KNN search
│   │   ├── meta_view_writer.py      # 2×2 meta-view generation
│   │   ├── on_device_buffer.py      # Orchestrates all buffer components
│   │   └── stuckness_detector.py    # Loop detection
│   │
│   └── dashboard/                   # Live dashboard
│       ├── __init__.py
│       ├── uploader.py              # Batch upload to GitHub Pages
│       ├── content_api.py           # You.com Content API wrapper
│       └── similarity_precompute.py # Pre-compute comparison pages
│
├── tests/                           # Unit tests
│   ├── test_mgba_connection.py
│   └── test_on_device_buffer.py
│
├── demos/                           # Visual demonstrations
│   └── embedding_visualization.py
│
├── examples/                        # Example usage
│   └── quickstart.py
│
├── research/                        # Related papers & inspirations
│   └── qwen3-vl-summary.md
│
└── config/                          # Configuration files
    ├── agent_config.yaml            # Agent behavior settings
    ├── embedding_config.yaml        # Embedding strategy config
    └── mgba_config.ini              # mgba settings

🚀 Quick Start (Post-Fix)

  1. Start mGBA with ROM + Lua script:

    C:\Homework\agent_hackathon\rom\Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia).gba
    C:\Homework\agent_hackathon\pokemon-md-agent\config\save_files\game_start_save.ss0
    C:\Homework\agent_hackathon\pokemon-md-agent\src\mgba-harness\mgba-http\mGBASocketServer.lua
    
  2. Run demo:

    cd pokemon-md-agent
    python demo_agent.py --max-steps 50
  3. View results:

    ls -lt runs/  # Latest run folder

Troubleshooting

  • Screenshot locked: Fixed in v1.1 (auto-retry with exponential backoff)
  • Socket error: Fixed in v1.1 (proper cleanup on disconnect)
  • WRAM defaults: Check config/addresses/pmd_red_us_v1.json offsets

Prerequisites (Original Setup)

  • Python 3.11+ (with CUDA support for GPU acceleration)
  • mgba with mgba-http enabled (Lua-only setup)
  • Pokemon Mystery Dungeon Red ROM (you provide)
  • GPU: NVIDIA GPU with CUDA support recommended (RTX 30-series or newer)

Installation (Original)

# Clone or extract this repo
cd pokemon-md-agent

# Install PyTorch with CUDA support first (required for GPU acceleration)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Install as editable package
pip install -e .

ℹ️ New dependency: the benchmark now leverages nano-graphrag for retrieval-augmented prompt scaffolding. It is included in requirements.txt and will be installed automatically with the editable package command above.

Note: The installation automatically detects your CUDA version and GPU architecture to install the correct PyTorch and Unsloth versions. If you encounter CUDA detection issues, you can manually run Unsloth's auto-install script first:

# Optional: Run Unsloth's auto-detection script
python -c "import urllib.request; exec(urllib.request.urlopen('https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py').read())"

Verified: This installation method has been tested and confirmed to work with:

  • PyTorch 2.9.0+cu128 (CUDA 12.8, compatible with CUDA 12.9)
  • Unsloth v2025.10.10 with Qwen3-VL support
  • RTX 4090 GPU with Ampere architecture

Configure mgba (Lua-Only Setup)

Important: This project uses mgba-http with Lua socket server. No Python socket server needed.

  1. Download mgba v0.10.5+ from mgba.io
  2. Place your Pokemon Mystery Dungeon Red ROM in the rom/ directory
  3. Start mgba and load the game:
    • Load the ROM: File → Load ROM → select rom/Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia).gba
    • Load the save file: File → Load State File → select config/game_start.sav
    • Load the Lua script: Tools → ScriptingLoad script → select src/mgba-harness/mgba-http/mGBASocketServer.lua
  4. The Lua script will start the HTTP server automatically on port 8888

Save Slot Advice:

  • Slot 0: Title screen (for reset)
  • Slot 1: Floor ready (for benchmark loops) - agent loads this automatically
  • Slot 2: Last autosave
  • Slots 3-98: Manual saves
  • Slot 99: Final save on agent shutdown

The agent will automatically load slot 1 on startup for consistent benchmarking.

Run Agent (Original)

python examples/quickstart.py

📊 Benchmarking

Comprehensive 3D Performance Analysis

The project includes a comprehensive benchmark harness for measuring Qwen3-VL model performance across context lengths, batch sizes, and task types:

# Run comprehensive benchmark with 3D analysis
python profiling/bench_qwen_vl.py --models all --tasks all --num-runs 3

# Dry run for testing (no actual model inference)
python profiling/bench_qwen_vl.py --dry-run --models "unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit" --tasks "text_only"

# Custom configuration
python profiling/bench_qwen_vl.py --models "unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit,unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit" --max-new-tokens 256

Benchmark Features

Context Length Scaling: Tests from 1024 to 256k tokens (262k max for Qwen3-VL) on log2 scale

  • 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144 tokens

Batch Size Optimization: Tests batch sizes 1, 2, 4, 8 with automatic model-aware limits

  • 2B models: up to batch size 8
  • 4B models: up to batch size 4
  • 8B models: up to batch size 2

Task Performance Analysis: Four micro-benchmark tasks

  • text_only: Text summarization
  • vision_simple: Basic image description
  • vision_complex: Tactical situation analysis
  • mixed_reasoning: Strategic decision making

3D Visualizations: Interactive performance landscapes

  • Throughput surfaces (context × batch size × tokens/sec)
  • Performance contour maps
  • Optimal batch size curves
  • Log-scale context length plots

Expected Outputs

  • CSV Data: profiling/data/comprehensive_benchmark_results.csv with all measurements
  • 3D Surface Plots: profiling/plots/3d_throughput_surfaces.png
  • Performance Landscapes: profiling/plots/performance_landscapes.png
  • Optimization Curves: profiling/plots/batch_optimization.png
  • Context Scaling: profiling/plots/log_context_throughput.png

Interpreting Results

Throughput Analysis:

  • Higher values indicate faster inference
  • Look for inflection points where performance degrades
  • Compare batching vs non-batching efficiency

Performance Scores:

  • 0.0-1.0 scale based on response quality heuristics
  • Task-specific scoring (conciseness, descriptiveness, strategy)

Optimal Configurations:

  • Batch size curves show sweet spots for each context length
  • 3D surfaces reveal performance saddle points
  • Contour maps highlight efficient operating regions

Model-Specific Limits

Model Max Context Max Batch Typical Throughput
Qwen3-VL-2B 32,768 8 60-80 tokens/sec
Qwen3-VL-4B 65,536 4 40-60 tokens/sec
Qwen3-VL-8B 131,072 2 20-40 tokens/sec

The benchmark automatically respects these limits and provides consistent comparison across all supported Qwen3-VL variants.


Text-Speed Guarantee

The agent implements a text-speed guarantee feature to ensure OCR capture of dialogue frames:

  • Menu Profile: src/mgba-harness/profiles/set_text_speed_slow.json navigates Options → Text Speed → Slow on boot
  • RAM Fallback: Direct memory poke to text-speed setting when allow_memory_write enabled and ROM hash safe
  • Input Pacing: A button taps throttled to ≥1 second intervals during textboxes for reliable OCR capture

Multi-Scale Temporal Embeddings

7 temporal resolution silos with dynamic FPS adjustment:

Silo Base Sample Rate Agent-Adjustable FPS Context Span
temporal_1frame Every frame 30→10→5→3→1 fps 0-4 sec
temporal_2frame Every 2nd - 0-8 sec
temporal_4frame Every 4th - 0-16 sec
temporal_8frame Every 8th - 0-32 sec
temporal_16frame Every 16th - 0-64 sec
temporal_32frame Every 32nd - 0-128 sec
temporal_64frame Every 64th - 2+ min

Agent can dynamically:

  • Adjust base FPS (30→1fps) to "zoom out" temporally
  • Change frame multipliers (4x→8x→16x) for finer resolution
  • Allocate memory budget across silos (e.g., 3/4 for last 5 min)

Embedding Types (Corrected)

Input embeddings:

  • input: Hidden states of what was sent to the model

Thinking models (reasoning-aware):

  • think_input: Hidden state at/before </think> + input
  • think_full: Hidden state before </s> (full input+output)
  • think_only: Embedding of only <think>...</think> block
  • think_image_input: Like think_input but image-only input
  • think_image_full: Like think_full but image-only input
  • think_image_only: Image-only reasoning (experimental)

Instruct models (fast, no reasoning overhead):

  • instruct_eos: Hidden state at </s> token
  • instruct_image_only: Image tokens only

Model Routing

Qwen3-VL-2B-Instruct → Fast compression, simple navigation
         ↓
Qwen3-VL-4B-Thinking → Routing, retrieval, stuck detection
         ↓
Qwen3-VL-8B-Thinking-FP8 → Strategic decisions, dashboard queries

Escalation triggers:

  • Confidence < 0.8 → 2B→4B
  • Confidence < 0.6 OR stuck > 5 → 4B→8B
  • 8B can call You.com Content API (cooldown: 5 min, budget: 100 calls)

Inference Batching & KV Caching

The agent implements micro-batching for improved throughput:

  • Batch sizes: 8 for 2B, 4 for 4B, 2 for 8B models
  • Timeout: 50ms default for batch accumulation
  • KV cache: On-disk memmap for long prefixes (HF_HOME/pmd_kv_cache)
  • Async processing: asyncio.gather for parallel inference

🎯 Key Features

1. Dynamic Temporal Resolution

Agent can adjust how it perceives time:

# Zoom out (see longer time span with less detail)
agent.adjust_fps(target_fps=5)  # 30fps → 5fps
agent.adjust_frame_multiplier(multiplier=16)  # 4x → 16x

# Zoom in (see recent moments with more detail)
agent.adjust_fps(target_fps=30)  # Back to 30fps
agent.adjust_frame_multiplier(multiplier=2)  # 16x → 2x

2. Memory Split Control

Agent can allocate context budget across temporal ranges:

# Example: 3/4 for last 5 min, 1/4 for storyline/missions
agent.allocate_memory({
    "last_5_minutes": 0.75,
    "storyline": 0.15,
    "active_missions": 0.10
})

3. Persistent Scratchpad

Agent has a "sticky note" that persists across environment interactions:

agent.scratchpad.write("Floor 7: stairs are usually in NE corner")
# This will be visible to agent in next inference

4. Stuckness Detection

Cross-temporal divergence metric:

  • High short-term similarity (repeating micro-actions)
  • Low long-term similarity (no macro progress) → Triggers escalation to 8B + dashboard fetch

5. Live Searchable Dashboard

  • GitHub Pages hosted (updated every 5 minutes)
  • Pre-computed similarity comparisons
  • Accessible via You.com Content API (agent-only secret URLs)
  • Judge message wall for hackathon feedback

📋 Usage Examples

Grid Parser

The grid parser produces a uniform tile grid and screen mapping from game screen data, enabling pathfinding and spatial reasoning for the agent.

from src.vision.grid_parser import GridParser
from src.environment.ram_decoders import RAMSnapshot

# Initialize parser
parser = GridParser()

# Parse RAM data into grid
grid_frame = parser.parse_ram_snapshot(ram_snapshot)

# Access grid properties
print(f"Grid size: {grid_frame.width}x{grid_frame.height}")
print(f"Tile size: {grid_frame.tile_size_px}px")

# Get tile at position
tile = grid_frame.tiles[y][x]
print(f"Tile type: {tile.tile_type}")

# Compute pathfinding distances
bfs_result = parser.compute_bfs_distances(grid_frame, start=(x, y))
distance_to_target = bfs_result.distances[target_y][target_x]

🛠️ Development Workflow

For Code Agents (Copilot/Claude Code/Roo-Coder)

See AGENTS.md for detailed instructions on:

  • How to structure code changes
  • Testing procedures
  • Integration patterns
  • Prompt templates

Manual Development

  1. Make changes in src/ directory
  2. Test fast lane with .\scripts\test_fast.ps1 (Windows) or bash scripts/test_fast.sh (Linux/Mac)
  3. Test full suite with .\scripts\test_full.ps1 (Windows) or bash scripts/test_full.sh (Linux/Mac)
  4. Run demos in demos/ to visualize changes
  5. Commit with descriptive messages

Test Markers & Scripts

Fast Lane (scripts/test_fast.ps1):

  • Command: mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; $env:FAST="1"; $env:PYTEST_FDUMP_S="45"; $env:PYTHONPATH="$(pwd)\src"; python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx"
  • Expected Runtime: <3 minutes
  • Purpose: Quick validation excluding slow/network/bench/longctx tests

Full Lane (scripts/test_full.ps1):

  • Command: mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; Remove-Item Env:FAST -ErrorAction SilentlyContinue; $env:PYTEST_FDUMP_S="90"; $env:PYTHONPATH="$(pwd)\src"; python -m pytest -q
  • Expected Runtime: 10-15 minutes
  • Purpose: Complete test suite with all markers

CI Lane (scripts/test_ci.ps1):

  • Command: Calls scripts/test_fast.ps1
  • Expected Runtime: <3 minutes
  • Purpose: Minimal CI validation

Bench Sweep (scripts/bench_sweep.ps1):

  • Command: mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; $env:PYTHONPATH="$(pwd)\src"; python profiling/bench_qwen_vl.py --models all --csv bench_results.csv --time-budget-s 180 --full --plot bench_results.csv
  • Expected Runtime: 5-10 minutes per configuration
  • Purpose: Performance benchmarking with parameter sweeps, saves CSV + JSONL + PNG plots to profiling/results/<UTC_ISO>/

Sync Profiling (scripts/sync_profiling.ps1):

  • Command: mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; Copy-Item "..\profiling\*" ".\profiling\" -Recurse -Force -Exclude "__pycache__"
  • Expected Runtime: <1 minute
  • Purpose: Consolidate profiling data from root directory

Markers:

  • @pytest.mark.slow: Long-running tests (model training, heavy parametrization)
  • @pytest.mark.network: Tests requiring emulator/web connections
  • @pytest.mark.bench: Performance benchmarking and plotting
  • @pytest.mark.longctx: Tests with ≥64k context

Environment Variables:

  • FAST=1: Reduces test parameters for faster execution
  • PYTEST_FDUMP_S=45: Session timeout for deadlock detection (default 60s)

Flags:

  • --maxfail=1: Stop after first failure
  • --timeout=30 --timeout-method=thread: 30s timeout per test with thread method
  • -m "not slow and not network and not bench and not longctx": Exclude marked tests
  • filterwarnings = ["ignore::DeprecationWarning"]: Suppress deprecation warnings

Troubleshooting

Test Failures:

  • Timeout errors: Increase PYTEST_FDUMP_S environment variable or check for infinite loops
  • Import errors: Ensure PYTHONPATH includes src/ directory
  • mGBA connection failures: Verify emulator is running with Lua script on port 8888
  • CUDA out of memory: Reduce batch sizes or use smaller models for testing

Benchmark Issues:

  • Long runtimes: Use --time-budget-s to limit entire benchmark duration (default 180s)
  • Time budget exceeded: Benchmark suite ran longer than --time-budget-s limit - check summary.json
  • OOM during bench: Reduce --batches or --contexts parameters, or use smaller models
  • No plots generated: Ensure matplotlib is installed and CSV file exists
  • Output directory errors: Check write permissions for profiling/results/<UTC_TIMESTAMP>/
  • Fast lane limitations: Use --full flag to run comprehensive benchmarks

Common Runtime Issues:

  • SyntaxError in qwen_controller.py: See agent_mailbox/copilot2codex.md for core team fix
  • faulthandler timeout: Tests hanging - check for blocking I/O operations
  • Top slow tests: Review session output for slowest tests to optimize

Expected Runtimes:

  • Fast lane: 2-3 minutes
  • Full lane: 10-15 minutes
  • Bench sweep: 5-10 minutes per config
  • CI lane: <3 minutes

Profiling Consolidation

Run .\scripts\sync_profiling.ps1 to consolidate profiling data from legacy root profiling/ directory into pokemon-md-agent/profiling/.

Current Test Status

⚠️ Tests currently blocked by runtime bug: SyntaxError in src/agent/qwen_controller.py (await outside async function). See agent_mailbox/copilot2codex.md for details. Core team fix required before test suite can run.

Benchmarking & Profiling

Run performance benchmarks with .\scripts\bench_sweep.ps1 (Windows) or equivalent bash script.

Bench Flags:

  • --time-budget-s: Time budget for entire benchmark suite (seconds, default: 180)
  • --full: Run full benchmark suite (longer, more comprehensive)
  • --contexts: Exact context lengths to test (comma-separated, overrides --min-ctx/--ctx-mult)
  • --image-text-ratios: Image-to-text content ratios to test (comma-separated floats, default: '0.5')
  • --models: Models to benchmark ('all' or comma-separated list)
  • --min-ctx: Minimum context length (default: 1024)
  • --ctx-mult: Context length multiplier (default: 1.5)
  • --max-wall: Maximum wall clock time per benchmark (seconds, default: 60)
  • --batches: Batch sizes to test (comma-separated, default: '1,2,4,8')
  • --best-of: Best-of values to test (comma-separated, default: '1,2,4,8')
  • --csv: Output CSV path (required for benchmarking)
  • --plot: CSV file to plot from (generates plots in profiling/plots/)
  • --dry-run: Use synthetic timings instead of real inference

Example Commands:

# Fast lane benchmark (default)
python profiling/bench_qwen_vl.py --csv results.csv --dry-run

# Full benchmark with time budget
python profiling/bench_qwen_vl.py --full --time-budget-s 300 --csv results.csv

# Custom contexts and image-text ratios
python profiling/bench_qwen_vl.py --contexts 1024,2048,4096,8192 --image-text-ratios 0.3,0.5,0.7 --csv results.csv

# Plot existing results
python profiling/bench_qwen_vl.py --plot results.csv

Results saved to profiling/results/<UTC_TIMESTAMP>/ with CSV metrics, JSON summary, and interactive plots.


🧪 Test Execution

Test Runner Commands

Fast Lane (under 3 minutes):

# Windows PowerShell
.\scripts\test_fast.ps1

# Linux/Mac bash
bash scripts/test_fast.sh

Full Suite (10-15 minutes):

# Windows PowerShell  
.\scripts\test_full.ps1

# Linux/Mac bash
bash scripts/test_full.sh

CI Validation:

# Windows PowerShell
.\scripts\test_ci.ps1

# Linux/Mac bash  
bash scripts/test_ci.sh

Expected Runtimes

  • Fast lane: ≤3 minutes
  • Full suite: 10-15 minutes
  • Bench sweep: 5-10 minutes per config
  • CI: ≤3 minutes

Troubleshooting Common Test Failures

Issue Solution
faulthandler timeout Tests hanging - check for blocking I/O operations, increase PYTEST_FDUMP_S
Top slow tests Review session output for slowest tests to optimize
SyntaxError in qwen_controller.py See agent_mailbox/copilot2codex.md for core team fix
mGBA connection failures Verify emulator is running with Lua script on port 8888
CUDA out of memory Reduce batch sizes or use smaller models for testing
Import errors Ensure PYTHONPATH includes src/ directory
Timeout errors Increase PYTEST_FDUMP_S environment variable or check for infinite loops
Benchmark time budget exceeded Benchmark suite ran longer than --time-budget-s limit - check summary.json
No plots generated Ensure matplotlib is installed and CSV file exists
Output directory errors Check write permissions for profiling/results/<UTC_TIMESTAMP>/

Test Markers

  • @pytest.mark.slow: Long-running tests (model training, heavy parametrization)
  • @pytest.mark.network: Tests requiring emulator/web connections
  • @pytest.mark.bench: Performance benchmarking and plotting
  • @pytest.mark.longctx: Tests with ≥64k context

Test Scripts

Fast Lane (under 3 minutes):

# Windows PowerShell
.\scripts\test_fast.ps1

# Linux/Mac bash
bash scripts/test_fast.sh

Full Suite (10-15 minutes):

# Windows PowerShell  
.\scripts\test_full.ps1

# Linux/Mac bash
bash scripts/test_full.sh

CI Validation:

# Windows PowerShell
.\scripts\test_ci.ps1

# Linux/Mac bash  
bash scripts/test_ci.sh

Profiling & Benchmarking

Bench Sweep (5-10 minutes):

# Windows PowerShell
.\scripts\bench_sweep.ps1 -time_budget_s 180 -full -create_plots

# Linux/Mac bash
bash scripts/bench_sweep.sh

Sync Profiling Data:

# Windows PowerShell
.\scripts\sync_profiling.ps1

# Linux/Mac bash
bash scripts/sync_profiling.sh

Test Markers

  • @pytest.mark.slow: Long-running tests
  • @pytest.mark.network: Network-dependent tests
  • @pytest.mark.bench: Performance benchmarks
  • @pytest.mark.longctx: Long context tests

Expected Runtimes

  • Fast lane: ≤3 minutes
  • Full suite: 10-15 minutes
  • Bench sweep: 5-10 minutes per config
  • CI: ≤3 minutes

Outputs

  • Test results: Console output
  • Bench results: profiling/results/<UTC_ISO>/ (CSV, JSONL, plots)
  • Profiling data: Consolidated in profiling/ directory

About

Pokemon Mystery Dungeon Red (GBA) VLM (qwen3-vl family) + public github pages hosted cloud RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages