Multi-model Qwen3-VL agent with hierarchical RAG system, dynamic temporal resolution, and live dashboard for autonomous Pokemon Mystery Dungeon Red gameplay.
Goal: Build an autonomous agent that can play Pokemon Mystery Dungeon Red using:
- Multi-scale visual reasoning (Qwen3-VL 2B/4B/8B)
- Hierarchical RAG with 7 temporal resolution silos
- Dynamic FPS adjustment (30fps → 1fps) and frame multipliers
- Live searchable dashboard (GitHub Pages + You.com Content API)
- Cost-aware model routing and vision optimization
Tech Stack:
- Emulator: mgba + mgba-http (960x640 @ 30fps)
- Vision Models: Qwen3-VL-2B/4B/8B (Thinking + Instruct variants)
- Vector DB: ChromaDB or FAISS (multi-scale temporal embeddings)
- Dashboard: GitHub Pages (static) + You.com Content API (retrieval)
- Control: Python + mgba-http API
Watch the 3-minute agent demo (MP4) — 180 seconds of autonomous gameplay with Kokoro TTS narration, automatically generated from agent trajectory and You.com knowledge retrieval.
Submission snapshot:
- Branch:
deadline-2025-10-30-2355-PT(frozen @ 23:55 UTC-7) - Tag:
deadline-2025-10-30-2359-PT(final submission timestamp)
-
Activate environment:
mamba activate agent-hackathon
-
Run demo (50 steps + 3-min video):
cd pokemon-md-agent python scripts/final_demo_runner.py -
View results:
- Video:
agent_demo.mp4 - Logs:
runs/demo_*/trajectory_*.jsonl
- Video:
- mGBA emulator (version 0.8.0+) with Lua socket server running on port
8888 - Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia) ROM file
- Python 3.11+ with conda/mamba environment
- Save file with game in Tiny Woods or similar dungeon (starter floor)
-
Place ROM & Save in
rom/directory:# From project root (pokemon-md-agent/) cp "Pokemon Mystery Dungeon - Red Rescue Team.gba" ./rom/ cp "Pokemon Mystery Dungeon - Red Rescue Team.sav" ./rom/
-
Create & activate conda environment (installs Kokoro TTS + MoviePy):
mamba create -n agent-hackathon python=3.11 -y mamba activate agent-hackathon pip install -r requirements.txt
-
Configure You.com Content API (optional but recommended):
# Persist YOUR key in PowerShell profile or current session $Env:YOU_API_KEY = "<your-you-api-key>" # Smoke test (live mode) - replace URL with a domain you expect the agent to use python -m scripts.check_you_api --url https://www.serebii.net/dungeon/redblue/d001.shtml --live
# macOS/Linux example export YOU_API_KEY="<your-you-api-key>" python -m scripts.check_you_api --url https://www.serebii.net/dungeon/redblue/d001.shtml --live
- Success prints
• https://... -> OK | ... - If you skip this step (or the key is invalid) the agent falls back to placeholder content.
- Success prints
-
Start mGBA with Lua socket server (Windows PowerShell example):
# Ensure mGBA-http is loaded (Lua console > File > Load script > mGBASocketServer.lua) # Server defaults to port 8888 # Verify with: python .temp_check_ram.py
-
Run final demo (50-step agent + 3-min video + Kokoro voiceover):
mamba activate agent-hackathon cd pokemon-md-agent python scripts/final_demo_runner.pyOutput:
runs/demo_*/trajectory_*.jsonl- Full trajectory dataagent_demo.mp4- 3-minute montage video (key frames + Kokoro TTS narration)- Console logs show real-time progress
- Initialization: ~5s
- Agent execution (50 steps): ~30-60s
- Video generation: ~10-20s
- Total: ~1-2 minutes
| Issue | Solution |
|---|---|
Failed to connect to mGBA |
Verify mGBA is running, socket server active, port 8888 |
No ROM files found |
Check ROM + SAV files are in ./rom/ directory |
Perception failed: unpack requires... |
Stale mGBA connection; restart emulator |
Video generation failed |
Ensure opencv-python installed: pip install opencv-python |
The agent includes a comprehensive dashboard system for monitoring gameplay and retrieving external knowledge.
Critical Fix Applied: All HF_HOME environment variable usages have been sanitized to handle quoted paths, normalize separators, and support user path expansion. This resolves model loading failures on Windows systems where HF_HOME may contain quotes or need path expansion.
Applied sanitization template:
- Strip surrounding quotes (
",') - Expand user paths (
~→ actual home directory) - Normalize path separators for cross-platform compatibility
- Added comprehensive test coverage in
test_path_sanitization.py
Verification: Model loading tested with real Qwen3-VL models from HuggingFace Hub, confirming proper cache directory resolution and tokenizer loading.
- Live Updates: Real-time trajectory logging and meta-view generation
- Searchable Content: Client-side FAISS indexes for fast similarity search
- Rate Limiting: Token bucket rate limiting (30 files/min, 300/hour) with exponential backoff
- Build Budget: Coalesces commits to ≤10/hour to avoid GitHub Actions limits
- LFS Avoidance: Keeps artifacts under 8MB; no Git LFS unless required
- Resolution Modes: 2× (480×320) default for dashboard, 1× (240×160) for Qwen-VL benchmarking
Phase 1 Implementation (Complete): Structured JSON schema for vision model outputs.
The agent uses a Pydantic-validated GameState schema for consistent vision model outputs:
from src.models.game_state_schema import GameState, Entity, GameStateEnum
# Models must return JSON matching this schema
state = GameState(
player_pos=(12, 8),
player_hp=45,
floor=3,
state=GameStateEnum.EXPLORING,
enemies=[Entity(x=14, y=8, type="enemy", species="Geodude")],
items=[Entity(x=10, y=6, type="item", name="Apple")],
confidence=0.95,
threats=["Geodude approaching"],
opportunities=["Move up to dodge"]
)Key Features:
- ✅ 51 unit tests (1.28s runtime) covering validation, serialization, edge cases
- ✅ Type-safe coordinates (0-indexed bounds checking, negative rejection)
- ✅ Confidence scoring (0-1 range, quality metrics)
- ✅ JSON roundtrip (validation + serialization)
- ✅ Few-shot examples (3-5 predefined examples for in-context learning)
# Activate environment
mamba activate agent-hackathon
# Run schema tests (51 tests)
python -m pytest tests/test_game_state_schema.py tests/test_game_state_utils.py -v
# Quick validation
python scripts/test_vision_schema.py
# (Windows) PowerShell validation script
.\scripts\validate_vision_schema.ps1 -RunTestsPhase 2 Implementation (Complete): Structured system prompts for Qwen3-VL vision models.
- ✅ 58 unit + integration tests (1.34s runtime) covering prompt variants and message integration
- ✅ Instruct variant — Direct JSON output for 2B/4B models
- ✅ Thinking variant — Chain-of-thought reasoning for reasoning-enabled models
- ✅ PromptBuilder class — Type-safe prompt assembly with few-shot examples
- ✅ Message packager integration — Seamless integration with three-message protocol
- ✅ Model-specific optimization — 2B/4B use instruct, 8B uses thinking variant
Located in src/models/vision_prompts.py:
from src.models.vision_prompts import (
VISION_SYSTEM_PROMPT_INSTRUCT,
VISION_SYSTEM_PROMPT_THINKING,
PromptBuilder,
format_vision_prompt_with_examples
)
# Build complete prompt with context and examples
builder = PromptBuilder("instruct")
builder.add_few_shot_examples(3)
builder.add_context(policy_hint="explore", model_size="4B")
prompt = builder.build_complete_prompt()
# Returns: {"system": "...", "user": "..."}
# Or use high-level function
complete = format_vision_prompt_with_examples(
policy_hint="battle",
model_variant="thinking",
num_examples=3,
model_size="8B"
)Located in src/orchestrator/message_packager.py:
from src.orchestrator.message_packager import pack_with_vision_prompts
# Pack game state with vision prompts
step_state = {...} # From Copilot or agent state
system_prompt, messages = pack_with_vision_prompts(
step_state,
policy_hint="explore",
model_size="4B",
num_examples=3
)
# Returns: (system_prompt_str, [msg1, msg2, msg3])
# Ready to send to Qwen3-VL with three-message protocol# Activate environment
mamba activate agent-hackathon
# Run all vision tests (Phase 1 + Phase 2)
python -m pytest tests/test_game_state_*.py tests/test_vision_prompts.py tests/test_message_packager_vision.py -v
# Quick validation (Phase 2 only)
python scripts/test_vision_prompts.py
# (Windows) PowerShell validation script
.\scripts\validate_vision_prompts.ps1 -RunTestsInstruct Variant (2B/4B models)
- ~2,234 characters
- Direct JSON output format
- Explicit requirements and rules
- Optimized for smaller models with less reasoning capability
- Focus on clear instructions and schema compliance
Thinking Variant (8B+ reasoning models)
- ~2,521 characters
- 6-step chain-of-thought reasoning (OBSERVATION → CLASSIFICATION → STATE → THREATS → CONFIDENCE → JSON)
- Encourages explicit reasoning about visual input
- Better for models that benefit from intermediate reasoning steps
- Chain of thought helps with complex multi-entity scenes
Located in src/models/game_state_utils.py:
# Parse model output with validation
state = parse_model_output(json_str, partial_ok=True, confidence_threshold=0.7)
# Validate state quality
report = validate_game_state(state)
print(f"Quality: {report['quality_score']:.2f}")
print(f"Warnings: {report['warnings']}")
# Generate few-shot examples
examples = generate_few_shot_examples(num_examples=3)
# Format for agent decisions
text = format_state_for_decision(state)The Agent Gatekeeper provides safety filtering for agent actions, ensuring only valid and vetted actions are executed.
- Safety Filtering: Rejects explicitly invalid actions (e.g.,
self-destruct,quit,exit,die,end_game) - ANN Validation: Requires ≥3 shallow ANN hits to permit actions via vector similarity search
- Fallback Behavior: On ANN failure, conservatively rejects all actions
- Async Operation: Supports async ANN search for non-blocking validation
from src.agent.gatekeeper import Gatekeeper
from src.retrieval.ann_search import VectorSearch
# Initialize with ANN search dependency
ann_search = VectorSearch(index_path="path/to/ann/index")
gatekeeper = Gatekeeper(ann_search=ann_search, min_hits=3)
# Filter actions
valid_actions = ["move", "attack", "use_item"]
state = {"ascii": "dungeon_grid", "player_x": 10, "player_y": 10}
filtered_actions = await gatekeeper.filter(valid_actions, state)
# Returns: ["move", "attack", "use_item"] if ANN validation passesThe gatekeeper is automatically integrated into the agent reasoning pipeline:
- Actions are extracted from LLM responses
- Passed through gatekeeper filtering before execution
- Invalid actions trigger fallback to safe defaults
Important: Always cd to REPO ROOT (absolute) before running tests; scripts enforce this.
Fast Lane (≤3 minutes):
# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
if (-not (Test-Path 'C:\Homework\agent_hackathon\pokemon-md-agent\pyproject.toml')) { Write-Error 'Not at repo root'; exit 2 }
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx"# Git Bash
mamba info --envs && python --version && mamba activate agent-hackathon && \
[ -f /c/Homework/agent_hackathon/pokemon-md-agent/pyproject.toml ] || { echo "Not at repo root"; exit 2; } && \
cd /c/Homework/agent_hackathon/pokemon-md-agent && pwd && ls -la && \
export PYTHONPATH=/c/Homework/agent_hackathon/pokemon-md-agent/src && \
python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx"Full Suite (10-15 minutes):
# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
if (-not (Test-Path 'C:\Homework\agent_hackathon\pokemon-md-agent\pyproject.toml')) { Write-Error 'Not at repo root'; exit 2 }
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python -m pytest -q# Git Bash
mamba info --envs && python --version && mamba activate agent-hackathon && \
[ -f /c/Homework/agent_hackathon/pokemon-md-agent/pyproject.toml ] || { echo "Not at repo root"; exit 2; } && \
cd /c/Homework/agent_hackathon/pokemon-md-agent && pwd && ls -la && \
export PYTHONPATH=/c/Homework/agent_hackathon/pokemon-md-agent/src && \
python -m pytest -qBench Sweep (5-10 minutes):
# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
$env:PYTHONPATH='C:\Homework\agent_hackathon\pokemon-md-agent\src';
python profiling/bench_qwen_vl.py --models all --time-budget-s 180 --full --plotSync Profiling Data:
# Windows PowerShell
mamba info --envs; python --version; mamba activate agent-hackathon;
Set-Location -Path 'C:\Homework\agent_hackathon\pokemon-md-agent';
Copy-Item "..\profiling\*" ".\profiling\" -Recurse -Force -Exclude "__pycache__"@pytest.mark.slow: Long-running tests@pytest.mark.network: Network-dependent tests@pytest.mark.bench: Performance benchmarks@pytest.mark.longctx: Long context tests@pytest.mark.real_model: Real model inference tests
- Test results: Console output with session summary and top slow tests
- Bench results:
profiling/results/<UTC_ISO>/(CSV, JSONL, plots) - Profiling data: Consolidated in
profiling/directory
pokemon-md-agent/
├── README.md # This file
├── AGENTS.md # Instructions for code agents (Copilot/Claude Code)
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore patterns
│
├── docs/ # Architecture & design documents
│ ├── pokemon-md-rag-system.md # RAG system architecture
│ ├── pokemon-md-dashboard.md # Dashboard design
│ ├── pokemon-md-agent-scaffold.md # Agent scaffold & environment
│ └── embedding-types.md # Detailed embedding strategy
│
├── src/ # Source code
│ ├── agent/ # Agent core
│ │ ├── __init__.py
│ │ ├── qwen_controller.py # Multi-model Qwen3-VL orchestration
│ │ ├── model_router.py # 2B/4B/8B routing logic
│ │ └── memory_manager.py # Scratchpad & persistent memory
│ │
│ ├── orchestrator/ # Message orchestration
│ │ ├── __init__.py
│ │ └── message_packager.py # Three-message protocol with model presets
│ │
│ ├── embeddings/ # Embedding generation & storage
│ │ ├── __init__.py
│ │ ├── extractor.py # Extract embeddings from Qwen3-VL
│ │ ├── temporal_silo.py # 7 temporal resolution managers
│ │ └── vector_store.py # ChromaDB wrapper
│ │
│ ├── vision/ # Screenshot processing
│ │ ├── __init__.py
│ │ ├── sprite_detector.py # Qwen3-VL sprite detection
│ │ ├── grid_parser.py # Convert to tile grid for pathfinding
│ │ └── ascii_renderer.py # ASCII state for blind LLMs
│ │
│ ├── environment/ # mgba integration
│ │ ├── __init__.py
│ │ ├── mgba_controller.py # mgba-http API wrapper
│ │ ├── fps_adjuster.py # Dynamic FPS & frame multiplier
│ │ └── action_executor.py # Button press execution
│ │
│ ├── retrieval/ # RAG system
│ │ ├── __init__.py
│ │ ├── auto_retrieve.py # Automatic trajectory retrieval
│ │ ├── circular_buffer.py # On-device circular buffer (60-min window)
│ │ ├── cross_silo_search.py # Multi-scale search
│ │ ├── deduplicator.py # pHash/sprite-hash deduplication
│ │ ├── embedding_generator.py # Text/image embedding generation
│ │ ├── keyframe_policy.py # Keyframe selection (SSIM/floor/combat triggers)
│ │ ├── local_ann_index.py # SQLite ANN index for KNN search
│ │ ├── meta_view_writer.py # 2×2 meta-view generation
│ │ ├── on_device_buffer.py # Orchestrates all buffer components
│ │ └── stuckness_detector.py # Loop detection
│ │
│ └── dashboard/ # Live dashboard
│ ├── __init__.py
│ ├── uploader.py # Batch upload to GitHub Pages
│ ├── content_api.py # You.com Content API wrapper
│ └── similarity_precompute.py # Pre-compute comparison pages
│
├── tests/ # Unit tests
│ ├── test_mgba_connection.py
│ └── test_on_device_buffer.py
│
├── demos/ # Visual demonstrations
│ └── embedding_visualization.py
│
├── examples/ # Example usage
│ └── quickstart.py
│
├── research/ # Related papers & inspirations
│ └── qwen3-vl-summary.md
│
└── config/ # Configuration files
├── agent_config.yaml # Agent behavior settings
├── embedding_config.yaml # Embedding strategy config
└── mgba_config.ini # mgba settings
-
Start mGBA with ROM + Lua script:
C:\Homework\agent_hackathon\rom\Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia).gba C:\Homework\agent_hackathon\pokemon-md-agent\config\save_files\game_start_save.ss0 C:\Homework\agent_hackathon\pokemon-md-agent\src\mgba-harness\mgba-http\mGBASocketServer.lua -
Run demo:
cd pokemon-md-agent python demo_agent.py --max-steps 50 -
View results:
ls -lt runs/ # Latest run folder
- Screenshot locked: Fixed in v1.1 (auto-retry with exponential backoff)
- Socket error: Fixed in v1.1 (proper cleanup on disconnect)
- WRAM defaults: Check
config/addresses/pmd_red_us_v1.jsonoffsets
- Python 3.11+ (with CUDA support for GPU acceleration)
- mgba with mgba-http enabled (Lua-only setup)
- Pokemon Mystery Dungeon Red ROM (you provide)
- GPU: NVIDIA GPU with CUDA support recommended (RTX 30-series or newer)
# Clone or extract this repo
cd pokemon-md-agent
# Install PyTorch with CUDA support first (required for GPU acceleration)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install as editable package
pip install -e .ℹ️ New dependency: the benchmark now leverages
nano-graphragfor retrieval-augmented prompt scaffolding. It is included inrequirements.txtand will be installed automatically with the editable package command above.
Note: The installation automatically detects your CUDA version and GPU architecture to install the correct PyTorch and Unsloth versions. If you encounter CUDA detection issues, you can manually run Unsloth's auto-install script first:
# Optional: Run Unsloth's auto-detection script
python -c "import urllib.request; exec(urllib.request.urlopen('https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py').read())"Verified: This installation method has been tested and confirmed to work with:
- PyTorch 2.9.0+cu128 (CUDA 12.8, compatible with CUDA 12.9)
- Unsloth v2025.10.10 with Qwen3-VL support
- RTX 4090 GPU with Ampere architecture
Important: This project uses mgba-http with Lua socket server. No Python socket server needed.
- Download mgba v0.10.5+ from mgba.io
- Place your Pokemon Mystery Dungeon Red ROM in the
rom/directory - Start mgba and load the game:
- Load the ROM:
File → Load ROM→ selectrom/Pokemon Mystery Dungeon - Red Rescue Team (USA, Australia).gba - Load the save file:
File → Load State File→ selectconfig/game_start.sav - Load the Lua script:
Tools → Scripting→Load script→ selectsrc/mgba-harness/mgba-http/mGBASocketServer.lua
- Load the ROM:
- The Lua script will start the HTTP server automatically on port 8888
Save Slot Advice:
- Slot 0: Title screen (for reset)
- Slot 1: Floor ready (for benchmark loops) - agent loads this automatically
- Slot 2: Last autosave
- Slots 3-98: Manual saves
- Slot 99: Final save on agent shutdown
The agent will automatically load slot 1 on startup for consistent benchmarking.
python examples/quickstart.pyThe project includes a comprehensive benchmark harness for measuring Qwen3-VL model performance across context lengths, batch sizes, and task types:
# Run comprehensive benchmark with 3D analysis
python profiling/bench_qwen_vl.py --models all --tasks all --num-runs 3
# Dry run for testing (no actual model inference)
python profiling/bench_qwen_vl.py --dry-run --models "unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit" --tasks "text_only"
# Custom configuration
python profiling/bench_qwen_vl.py --models "unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit,unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit" --max-new-tokens 256Context Length Scaling: Tests from 1024 to 256k tokens (262k max for Qwen3-VL) on log2 scale
- 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144 tokens
Batch Size Optimization: Tests batch sizes 1, 2, 4, 8 with automatic model-aware limits
- 2B models: up to batch size 8
- 4B models: up to batch size 4
- 8B models: up to batch size 2
Task Performance Analysis: Four micro-benchmark tasks
text_only: Text summarizationvision_simple: Basic image descriptionvision_complex: Tactical situation analysismixed_reasoning: Strategic decision making
3D Visualizations: Interactive performance landscapes
- Throughput surfaces (context × batch size × tokens/sec)
- Performance contour maps
- Optimal batch size curves
- Log-scale context length plots
- CSV Data:
profiling/data/comprehensive_benchmark_results.csvwith all measurements - 3D Surface Plots:
profiling/plots/3d_throughput_surfaces.png - Performance Landscapes:
profiling/plots/performance_landscapes.png - Optimization Curves:
profiling/plots/batch_optimization.png - Context Scaling:
profiling/plots/log_context_throughput.png
Throughput Analysis:
- Higher values indicate faster inference
- Look for inflection points where performance degrades
- Compare batching vs non-batching efficiency
Performance Scores:
- 0.0-1.0 scale based on response quality heuristics
- Task-specific scoring (conciseness, descriptiveness, strategy)
Optimal Configurations:
- Batch size curves show sweet spots for each context length
- 3D surfaces reveal performance saddle points
- Contour maps highlight efficient operating regions
| Model | Max Context | Max Batch | Typical Throughput |
|---|---|---|---|
| Qwen3-VL-2B | 32,768 | 8 | 60-80 tokens/sec |
| Qwen3-VL-4B | 65,536 | 4 | 40-60 tokens/sec |
| Qwen3-VL-8B | 131,072 | 2 | 20-40 tokens/sec |
The benchmark automatically respects these limits and provides consistent comparison across all supported Qwen3-VL variants.
The agent implements a text-speed guarantee feature to ensure OCR capture of dialogue frames:
- Menu Profile:
src/mgba-harness/profiles/set_text_speed_slow.jsonnavigates Options → Text Speed → Slow on boot - RAM Fallback: Direct memory poke to text-speed setting when
allow_memory_writeenabled and ROM hash safe - Input Pacing: A button taps throttled to ≥1 second intervals during textboxes for reliable OCR capture
7 temporal resolution silos with dynamic FPS adjustment:
| Silo | Base Sample Rate | Agent-Adjustable FPS | Context Span |
|---|---|---|---|
| temporal_1frame | Every frame | 30→10→5→3→1 fps | 0-4 sec |
| temporal_2frame | Every 2nd | - | 0-8 sec |
| temporal_4frame | Every 4th | - | 0-16 sec |
| temporal_8frame | Every 8th | - | 0-32 sec |
| temporal_16frame | Every 16th | - | 0-64 sec |
| temporal_32frame | Every 32nd | - | 0-128 sec |
| temporal_64frame | Every 64th | - | 2+ min |
Agent can dynamically:
- Adjust base FPS (30→1fps) to "zoom out" temporally
- Change frame multipliers (4x→8x→16x) for finer resolution
- Allocate memory budget across silos (e.g., 3/4 for last 5 min)
Input embeddings:
input: Hidden states of what was sent to the model
Thinking models (reasoning-aware):
think_input: Hidden state at/before</think>+ inputthink_full: Hidden state before</s>(full input+output)think_only: Embedding of only<think>...</think>blockthink_image_input: Likethink_inputbut image-only inputthink_image_full: Likethink_fullbut image-only inputthink_image_only: Image-only reasoning (experimental)
Instruct models (fast, no reasoning overhead):
instruct_eos: Hidden state at</s>tokeninstruct_image_only: Image tokens only
Qwen3-VL-2B-Instruct → Fast compression, simple navigation
↓
Qwen3-VL-4B-Thinking → Routing, retrieval, stuck detection
↓
Qwen3-VL-8B-Thinking-FP8 → Strategic decisions, dashboard queries
Escalation triggers:
- Confidence < 0.8 → 2B→4B
- Confidence < 0.6 OR stuck > 5 → 4B→8B
- 8B can call You.com Content API (cooldown: 5 min, budget: 100 calls)
The agent implements micro-batching for improved throughput:
- Batch sizes: 8 for 2B, 4 for 4B, 2 for 8B models
- Timeout: 50ms default for batch accumulation
- KV cache: On-disk memmap for long prefixes (HF_HOME/pmd_kv_cache)
- Async processing: asyncio.gather for parallel inference
Agent can adjust how it perceives time:
# Zoom out (see longer time span with less detail)
agent.adjust_fps(target_fps=5) # 30fps → 5fps
agent.adjust_frame_multiplier(multiplier=16) # 4x → 16x
# Zoom in (see recent moments with more detail)
agent.adjust_fps(target_fps=30) # Back to 30fps
agent.adjust_frame_multiplier(multiplier=2) # 16x → 2xAgent can allocate context budget across temporal ranges:
# Example: 3/4 for last 5 min, 1/4 for storyline/missions
agent.allocate_memory({
"last_5_minutes": 0.75,
"storyline": 0.15,
"active_missions": 0.10
})Agent has a "sticky note" that persists across environment interactions:
agent.scratchpad.write("Floor 7: stairs are usually in NE corner")
# This will be visible to agent in next inferenceCross-temporal divergence metric:
- High short-term similarity (repeating micro-actions)
- Low long-term similarity (no macro progress) → Triggers escalation to 8B + dashboard fetch
- GitHub Pages hosted (updated every 5 minutes)
- Pre-computed similarity comparisons
- Accessible via You.com Content API (agent-only secret URLs)
- Judge message wall for hackathon feedback
The grid parser produces a uniform tile grid and screen mapping from game screen data, enabling pathfinding and spatial reasoning for the agent.
from src.vision.grid_parser import GridParser
from src.environment.ram_decoders import RAMSnapshot
# Initialize parser
parser = GridParser()
# Parse RAM data into grid
grid_frame = parser.parse_ram_snapshot(ram_snapshot)
# Access grid properties
print(f"Grid size: {grid_frame.width}x{grid_frame.height}")
print(f"Tile size: {grid_frame.tile_size_px}px")
# Get tile at position
tile = grid_frame.tiles[y][x]
print(f"Tile type: {tile.tile_type}")
# Compute pathfinding distances
bfs_result = parser.compute_bfs_distances(grid_frame, start=(x, y))
distance_to_target = bfs_result.distances[target_y][target_x]See AGENTS.md for detailed instructions on:
- How to structure code changes
- Testing procedures
- Integration patterns
- Prompt templates
- Make changes in
src/directory - Test fast lane with
.\scripts\test_fast.ps1(Windows) orbash scripts/test_fast.sh(Linux/Mac) - Test full suite with
.\scripts\test_full.ps1(Windows) orbash scripts/test_full.sh(Linux/Mac) - Run demos in
demos/to visualize changes - Commit with descriptive messages
Fast Lane (scripts/test_fast.ps1):
- Command:
mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; $env:FAST="1"; $env:PYTEST_FDUMP_S="45"; $env:PYTHONPATH="$(pwd)\src"; python -m pytest -q --maxfail=1 -m "not slow and not network and not bench and not longctx" - Expected Runtime: <3 minutes
- Purpose: Quick validation excluding slow/network/bench/longctx tests
Full Lane (scripts/test_full.ps1):
- Command:
mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; Remove-Item Env:FAST -ErrorAction SilentlyContinue; $env:PYTEST_FDUMP_S="90"; $env:PYTHONPATH="$(pwd)\src"; python -m pytest -q - Expected Runtime: 10-15 minutes
- Purpose: Complete test suite with all markers
CI Lane (scripts/test_ci.ps1):
- Command: Calls
scripts/test_fast.ps1 - Expected Runtime: <3 minutes
- Purpose: Minimal CI validation
Bench Sweep (scripts/bench_sweep.ps1):
- Command:
mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; $env:PYTHONPATH="$(pwd)\src"; python profiling/bench_qwen_vl.py --models all --csv bench_results.csv --time-budget-s 180 --full --plot bench_results.csv - Expected Runtime: 5-10 minutes per configuration
- Purpose: Performance benchmarking with parameter sweeps, saves CSV + JSONL + PNG plots to
profiling/results/<UTC_ISO>/
Sync Profiling (scripts/sync_profiling.ps1):
- Command:
mamba info --envs; python --version; mamba activate agent-hackathon; pwd; ls; Copy-Item "..\profiling\*" ".\profiling\" -Recurse -Force -Exclude "__pycache__" - Expected Runtime: <1 minute
- Purpose: Consolidate profiling data from root directory
Markers:
@pytest.mark.slow: Long-running tests (model training, heavy parametrization)@pytest.mark.network: Tests requiring emulator/web connections@pytest.mark.bench: Performance benchmarking and plotting@pytest.mark.longctx: Tests with ≥64k context
Environment Variables:
FAST=1: Reduces test parameters for faster executionPYTEST_FDUMP_S=45: Session timeout for deadlock detection (default 60s)
Flags:
--maxfail=1: Stop after first failure--timeout=30 --timeout-method=thread: 30s timeout per test with thread method-m "not slow and not network and not bench and not longctx": Exclude marked testsfilterwarnings = ["ignore::DeprecationWarning"]: Suppress deprecation warnings
Test Failures:
- Timeout errors: Increase
PYTEST_FDUMP_Senvironment variable or check for infinite loops - Import errors: Ensure
PYTHONPATHincludessrc/directory - mGBA connection failures: Verify emulator is running with Lua script on port 8888
- CUDA out of memory: Reduce batch sizes or use smaller models for testing
Benchmark Issues:
- Long runtimes: Use
--time-budget-sto limit entire benchmark duration (default 180s) - Time budget exceeded: Benchmark suite ran longer than
--time-budget-slimit - check summary.json - OOM during bench: Reduce
--batchesor--contextsparameters, or use smaller models - No plots generated: Ensure matplotlib is installed and CSV file exists
- Output directory errors: Check write permissions for
profiling/results/<UTC_TIMESTAMP>/ - Fast lane limitations: Use
--fullflag to run comprehensive benchmarks
Common Runtime Issues:
- SyntaxError in qwen_controller.py: See
agent_mailbox/copilot2codex.mdfor core team fix - faulthandler timeout: Tests hanging - check for blocking I/O operations
- Top slow tests: Review session output for slowest tests to optimize
Expected Runtimes:
- Fast lane: 2-3 minutes
- Full lane: 10-15 minutes
- Bench sweep: 5-10 minutes per config
- CI lane: <3 minutes
Run .\scripts\sync_profiling.ps1 to consolidate profiling data from legacy root profiling/ directory into pokemon-md-agent/profiling/.
src/agent/qwen_controller.py (await outside async function). See agent_mailbox/copilot2codex.md for details. Core team fix required before test suite can run.
Run performance benchmarks with .\scripts\bench_sweep.ps1 (Windows) or equivalent bash script.
Bench Flags:
--time-budget-s: Time budget for entire benchmark suite (seconds, default: 180)--full: Run full benchmark suite (longer, more comprehensive)--contexts: Exact context lengths to test (comma-separated, overrides --min-ctx/--ctx-mult)--image-text-ratios: Image-to-text content ratios to test (comma-separated floats, default: '0.5')--models: Models to benchmark ('all' or comma-separated list)--min-ctx: Minimum context length (default: 1024)--ctx-mult: Context length multiplier (default: 1.5)--max-wall: Maximum wall clock time per benchmark (seconds, default: 60)--batches: Batch sizes to test (comma-separated, default: '1,2,4,8')--best-of: Best-of values to test (comma-separated, default: '1,2,4,8')--csv: Output CSV path (required for benchmarking)--plot: CSV file to plot from (generates plots in profiling/plots/)--dry-run: Use synthetic timings instead of real inference
Example Commands:
# Fast lane benchmark (default)
python profiling/bench_qwen_vl.py --csv results.csv --dry-run
# Full benchmark with time budget
python profiling/bench_qwen_vl.py --full --time-budget-s 300 --csv results.csv
# Custom contexts and image-text ratios
python profiling/bench_qwen_vl.py --contexts 1024,2048,4096,8192 --image-text-ratios 0.3,0.5,0.7 --csv results.csv
# Plot existing results
python profiling/bench_qwen_vl.py --plot results.csvResults saved to profiling/results/<UTC_TIMESTAMP>/ with CSV metrics, JSON summary, and interactive plots.
Fast Lane (under 3 minutes):
# Windows PowerShell
.\scripts\test_fast.ps1
# Linux/Mac bash
bash scripts/test_fast.shFull Suite (10-15 minutes):
# Windows PowerShell
.\scripts\test_full.ps1
# Linux/Mac bash
bash scripts/test_full.shCI Validation:
# Windows PowerShell
.\scripts\test_ci.ps1
# Linux/Mac bash
bash scripts/test_ci.sh- Fast lane: ≤3 minutes
- Full suite: 10-15 minutes
- Bench sweep: 5-10 minutes per config
- CI: ≤3 minutes
| Issue | Solution |
|---|---|
faulthandler timeout |
Tests hanging - check for blocking I/O operations, increase PYTEST_FDUMP_S |
Top slow tests |
Review session output for slowest tests to optimize |
SyntaxError in qwen_controller.py |
See agent_mailbox/copilot2codex.md for core team fix |
mGBA connection failures |
Verify emulator is running with Lua script on port 8888 |
CUDA out of memory |
Reduce batch sizes or use smaller models for testing |
Import errors |
Ensure PYTHONPATH includes src/ directory |
Timeout errors |
Increase PYTEST_FDUMP_S environment variable or check for infinite loops |
Benchmark time budget exceeded |
Benchmark suite ran longer than --time-budget-s limit - check summary.json |
No plots generated |
Ensure matplotlib is installed and CSV file exists |
Output directory errors |
Check write permissions for profiling/results/<UTC_TIMESTAMP>/ |
@pytest.mark.slow: Long-running tests (model training, heavy parametrization)@pytest.mark.network: Tests requiring emulator/web connections@pytest.mark.bench: Performance benchmarking and plotting@pytest.mark.longctx: Tests with ≥64k context
Fast Lane (under 3 minutes):
# Windows PowerShell
.\scripts\test_fast.ps1
# Linux/Mac bash
bash scripts/test_fast.shFull Suite (10-15 minutes):
# Windows PowerShell
.\scripts\test_full.ps1
# Linux/Mac bash
bash scripts/test_full.shCI Validation:
# Windows PowerShell
.\scripts\test_ci.ps1
# Linux/Mac bash
bash scripts/test_ci.shBench Sweep (5-10 minutes):
# Windows PowerShell
.\scripts\bench_sweep.ps1 -time_budget_s 180 -full -create_plots
# Linux/Mac bash
bash scripts/bench_sweep.shSync Profiling Data:
# Windows PowerShell
.\scripts\sync_profiling.ps1
# Linux/Mac bash
bash scripts/sync_profiling.sh@pytest.mark.slow: Long-running tests@pytest.mark.network: Network-dependent tests@pytest.mark.bench: Performance benchmarks@pytest.mark.longctx: Long context tests
- Fast lane: ≤3 minutes
- Full suite: 10-15 minutes
- Bench sweep: 5-10 minutes per config
- CI: ≤3 minutes
- Test results: Console output
- Bench results:
profiling/results/<UTC_ISO>/(CSV, JSONL, plots) - Profiling data: Consolidated in
profiling/directory