Codex × Hermes BareMetal Doom Experiment, formerly Hermes OS-Doom Harness, is a Google I/O-inspired multi-agent software development experiment: use Codex and Hermes-style agent workflows to build a tiny QEMU-bootable OS, add just enough kernel/runtime surface to host doomgeneric/Freedoom, and prove the result with executable gates instead of screenshots or claims.
This GIF is generated from QEMU monitor screendump frames captured from the live VNC/display surface during ./evaluate.sh live-display; it is not hand-authored or AI-generated. A still gameplay frame from ./evaluate.sh playable is also kept at docs/assets/readme-doom.png.
The point is not that an AI agent magically wrote an operating system in one shot. The point is that complex agent work needs contracts, small scopes, failure logs, repeatable evaluation, and durable handoffs. Doom/Freedoom is used as the final integration pressure test because it forces display, input, timing, memory, and game-data loading to work together.
한국어 요약: 이 저장소는 Hermes Agent를 루트 오케스트레이터로 두고, 좁은 milestone과 QEMU 기반 evaluator를 통해 toy OS 위에서 Doom-compatible 데모를 실행하는 멀티에이전트 개발 실험입니다.
Status: complete through M15f evaluator-backed Google-demo parity threshold.
The full regression gate passes:
./evaluate.sh allExpected final marker:
ALL_PASS
Generated logs under artifacts/logs/ are intentionally not committed. The committed parity JSON/report are evidence snapshots; reproduce them from a clean checkout with:
./evaluate.sh all > artifacts/logs/all.log 2>&1
./evaluate.sh google-parity-auditThe M11b IRQ-backed Doom smoke is a dedicated gate available as ./evaluate.sh irq-doom and now included in ./evaluate.sh all. M12a adds ./evaluate.sh memory-map, a boot-memory discovery gate that parses QEMU's Multiboot memory map and reports usable/reserved regions over serial. M12b adds ./evaluate.sh memory-heap, a narrow heap smoke that selects a 4KiB-aligned heap region after the kernel image from that Multiboot map and initializes the existing bump allocator there. M12c adds ./evaluate.sh file-layer, a sidecar kernel gate that serves an embedded config file and a volatile save-file test double through the kernel compatibility stdio/stat/access APIs. M13 adds ./evaluate.sh swarm-evidence, an actual-records-only evidence gate for durable handoffs, generated run logs, and explicit unavailable metrics. M14 adds ./evaluate.sh public-demo, a one-command public demo package that regenerates playable/live-display evidence and a conservative public report. M15 adds ./evaluate.sh google-parity-audit, ./evaluate.sh paging, ./evaluate.sh physical-allocator, ./evaluate.sh persistent-save, ./evaluate.sh video-capture-demo, and ./evaluate.sh real-agent-swarm-run: the project now reaches its evaluator-backed Google-demo parity threshold at 94/100 while preserving conservative claim boundaries. It still does not claim Google-scale 93 agents, production OS status, real-hardware video proof, or polished live stage recording.
This repository is best described as an evaluator-backed, small-scale public reproduction of the Google I/O Antigravity-style OS + Doom demo. It reaches the repo-defined parity threshold because it boots a custom QEMU OS, runs Doom/Freedoom, proves OS-depth sidecars, emits public artifacts, and records actual Codex-native subagent evidence.
It is not a full Google-scale recreation. The project does not claim the reported 93-agent scale, exact per-agent token/cost export, production OS status, real hardware execution, sound/networking, or a polished live-stage recording.
| Dimension | Current project | Compared with the Google I/O demo |
|---|---|---|
| Bootable OS | QEMU Multiboot toy OS with serial-verifiable gates | Near demo-level for the toy OS reproduction goal |
| Doom/Freedoom execution | Sustained QEMU Doom/Freedoom loop with input and frame artifacts | Near demo-level for reproducible Doom execution |
| OS depth | Multiboot memory map, heap selection, paging sidecar, physical allocator sidecar, raw-disk save sidecar | Strong for a toy OS, but still not a production OS |
| Reproducibility | ./evaluate.sh all, reports, screenshots, GIF, JSON score |
More transparent and repeatable than a stage-only claim |
| Agent evidence | Three real Codex-native subagent audits plus durable metrics artifact | Much smaller than Google-scale multi-agent orchestration |
| Demo polish | QEMU/VNC GIF, static screenshots, and HTML previews | Below a polished live stage demo |
Safe public wording: "This project reaches its evaluator-backed Google-demo parity threshold for a QEMU/Freedoom OS-Doom reproduction." Avoid saying it is a Google-scale 93-agent recreation, a production OS, or a real-hardware stage demo.
The final playable gate boots the kernel in QEMU, loads embedded Freedoom data through the kernel-side WAD provider, initializes upstream doomgeneric, advances a sustained loop, injects movement/action keys through the QEMU monitor, verifies frame/key progress over serial, and generates a full 320x200 preview from the serial frame dump.
The M9 live-display gate proves a separate visible-surface path: DG_DrawFrame() copies real Doom/Freedoom frames to a QEMU display surface and captures headless VNC-backed screendump frames. In this QEMU -kernel environment Multiboot framebuffer metadata is not provided, so the automated gate programs a VGA 320x200x256 graphics fallback and blits Doom frames into the QEMU-visible VGA surface while preserving the framebuffer blit path for bootloaders that do provide one.
The M10 interactive-input gate proves the current polling PS/2 path can deliver deterministic Doom key press/release events for arrows, space, control, enter, and escape. M11 adds a separate interrupt-smoke kernel that loads an IDT, remaps the PIC, receives PIT IRQ0 ticks, and receives QEMU-injected PS/2 keyboard scancodes through IRQ1. M11b adds a dedicated compile-time Doom smoke path where DG_GetTicksMs/DG_SleepMs use the IRQ tick counter and DG_GetKey consumes the IRQ keyboard queue. The broader playable/live-display gates remain on the established polling path until a later migration gate expands the claim.
Primary proof artifacts after running the gate:
artifacts/logs/all.logorartifacts/logs/all-post-commit.logartifacts/logs/playable.logartifacts/logs/memory-map.logartifacts/logs/memory-heap.logartifacts/logs/paging.logartifacts/logs/physical-allocator.logartifacts/logs/persistent-save.logartifacts/logs/file-layer.logartifacts/logs/swarm-evidence.logartifacts/logs/public-demo.logartifacts/logs/video-capture-demo.logartifacts/logs/real-agent-swarm-run.logartifacts/agent_run_log.mdartifacts/agent_metrics.jsonartifacts/real_agent_swarm_run.mdartifacts/real_agent_swarm_metrics.jsonartifacts/public_demo_report.mdartifacts/video_capture_report.mdartifacts/google_parity_report.mdartifacts/google_parity_score.jsonartifacts/logs/interactive-input.logartifacts/logs/irq.logartifacts/logs/irq-doom.logartifacts/logs/live-display.logartifacts/screenshots/demo.ppmartifacts/screenshots/demo.pngartifacts/screenshots/demo.htmlartifacts/screenshots/live-display.ppmartifacts/screenshots/live-display.pngartifacts/screenshots/live-display.htmlartifacts/videos/demo.gifartifacts/videos/live-display-frames/live-display-*.ppmartifacts/final_report.mdartifacts/test_report.mdartifacts/demo_commands.md
Generated screenshots/logs are intentionally not treated as source; regenerate them locally with the commands below.
As of the latest local Codex accounting pass on 2026-05-24, the project has the following measurable implementation footprint.
Codex session-log accounting for F:\codex_hermes_os_doom:
- Time window: 2026-05-23 13:55:25 KST to 2026-05-24 13:41:36 KST.
- Wall-clock span: about 23 hours, 46 minutes, 11 seconds.
- Sessions: 24.
- Model calls: 1,212.
- Logged total tokens: 153,428,346.
- Input tokens: 152,655,141.
- Cached input tokens: 147,149,312.
- Output tokens: 773,205.
- Reasoning output tokens reported separately: 288,826.
- Non-cached input plus output tokens: about 6,279,034.
Codex Goal accounting is tracked separately and should not be added directly to the session-log total because it can overlap with the same underlying work:
- M12-M14 goal: 430,769 tokens over 1 hour, 19 minutes, 37 seconds.
- M15 goal: 622,814 tokens over 1 hour, 15 minutes, 27 seconds.
- Combined explicit Goal accounting for those two goals: 1,053,583 tokens over 2 hours, 35 minutes, 4 seconds.
Hermes/agent caveat: Codex-native subagents that wrote Codex session logs under this project cwd are included in the session-log total. The confirmed M15 real subagent slice accounts for 1,459,434 logged tokens across three Codex-native subagents, with an overlapping wall-clock window of about 2 minutes, 29 seconds. External Hermes-only token and cost telemetry is not exported into the repository artifacts; artifacts/agent_metrics.json and artifacts/real_agent_swarm_metrics.json explicitly record token/cost metrics as unavailable. Therefore the honest project total is the Codex session-log total above, plus an explicit caveat that any external Hermes usage outside those logs is currently unmeasured.
From Linux or WSL:
./scripts/check_deps.sh
./evaluate.sh allFor the visible preview only:
make visible-demoThen open:
artifacts/screenshots/demo.html
The preview is derived from QEMU serial output; it is not a hand-authored image.
For the M9 live-display artifact:
./evaluate.sh live-displayThen open:
artifacts/screenshots/live-display.html
For the M10 interactive-input gate:
./evaluate.sh interactive-inputFor the M11 IRQ groundwork gate:
./evaluate.sh irqFor the M11b IRQ-backed Doom input/timer smoke:
./evaluate.sh irq-doomFor the M12a Multiboot memory-map smoke:
./evaluate.sh memory-mapFor the M12b memory-map-backed heap smoke:
./evaluate.sh memory-heapFor the M15b paging sidecar smoke:
./evaluate.sh pagingFor the M15c physical allocator sidecar smoke:
./evaluate.sh physical-allocatorFor the M15d persistent save sidecar smoke:
./evaluate.sh persistent-saveFor the M12c embedded file/config test-double smoke:
./evaluate.sh file-layerFor the M13 actual agent-run evidence gate:
./evaluate.sh swarm-evidenceFor the M14 public demo package:
./evaluate.sh public-demoFor the M15e animated QEMU/VNC capture package:
./evaluate.sh video-capture-demoFor the M15f real Codex-native subagent evidence package:
./evaluate.sh real-agent-swarm-runFor the M15a Google-demo parity audit:
./evaluate.sh google-parity-auditFor a manual windowed smoke session after the gate has built the dedicated kernel:
qemu-system-x86_64 \
-kernel build/kernel_doom_interactive_input.elf \
-serial stdio \
-display gtk \
-no-reboot \
-no-shutdownFocus the QEMU window and use arrows, space, control, enter, and escape. The serial output should show DOOM_KEY_EVENT_FULL lines. On systems without GTK display support, use an available QEMU display backend such as SDL or VNC.
Every milestone is executable through evaluate.sh:
| Gate | Purpose | Success marker |
|---|---|---|
sanity |
Repository/control-file structure | SANITY_OK |
build |
Freestanding kernel ELF/bin build | BUILD_OK |
boot |
QEMU Multiboot kernel boot and serial marker | BOOT_OK |
display |
Deterministic VGA text-mode smoke | DISPLAY_OK |
input |
QEMU monitor key injection into PS/2 path | INPUT_OK |
timer |
PIT-backed tick/sleep smoke | TIMER_OK |
runtime |
Freestanding string/memory/heap smoke | RUNTIME_OK |
memory-map |
Multiboot memory-map inspection in QEMU | MEMORY_MAP_OK |
memory-heap |
Memory-map-backed heap initialization in QEMU | MEMORY_HEAP_OK |
paging |
4MiB identity-mapped paging sidecar smoke | PAGING_OK |
physical-allocator |
Memory-map-backed physical page free stack smoke | PHYS_ALLOC_OK |
persistent-save |
QEMU raw-disk save record persists across two boots | PERSIST_SAVE_OK |
file-layer |
Embedded config and volatile save-file test double in QEMU | FILE_LAYER_OK |
swarm-evidence |
Actual-records-only agent handoff/metrics evidence | SWARM_EVIDENCE_OK |
public-demo |
One-command playable/live-display public demo package | PUBLIC_DEMO_OK |
video-capture-demo |
Animated GIF capture from QEMU/VNC screendump frames | VIDEO_CAPTURE_DEMO_OK |
real-agent-swarm-run |
Actual Codex-native subagent verification evidence | REAL_AGENT_SWARM_RUN_OK |
google-parity-audit |
Google-demo parity score/report audit | GOOGLE_PARITY_AUDIT_OK |
doom-init |
Host compile/init of local doomgeneric bridge | DOOM_INIT_OK |
doom-frame |
Deterministic host DG_DrawFrame bridge |
DOOM_FRAME_1_OK |
doom-os-bridge |
Kernel-linked local Doom platform bridge | DOOM_OS_BRIDGE_OK |
freedoom-host |
Full upstream doomgeneric + Freedoom host frame | FREEDOOM_HOST_OK |
freedoom-qemu |
Embedded Freedoom WAD metadata/lump provider in QEMU | FREEDOOM_QEMU_WAD_OK |
doom-kernel-compile |
Upstream doomgeneric freestanding compile | DOOM_KERNEL_COMPILE_OK |
doom-kernel-link |
Upstream doomgeneric objects linked into kernel | DOOM_KERNEL_LINK_OK |
doom-kernel-demo |
Real upstream Doom/Freedoom frame inside QEMU | DOOM_KERNEL_DEMO_OK |
doom-kernel-input |
QEMU keyboard input reaches Doom key events | DOOM_KERNEL_INPUT_OK |
interactive-input |
Multi-key Doom input press/release ordering through QEMU injection | INTERACTIVE_INPUT_OK |
irq |
IDT/PIC/PIT IRQ0 and keyboard IRQ1 smoke | IRQ_OK |
irq-doom |
Dedicated Doom smoke path using IRQ-backed timer ticks and keyboard queue | IRQ_DOOM_OK |
playable |
Sustained QEMU Doom/Freedoom loop with injected input | PLAYABLE_OK |
live-display |
Doom/Freedoom frames copied to a QEMU-visible live display surface | LIVE_DISPLAY_OK |
hardening |
Host-side WAD bounds and libc shim error-handling regression checks | HARDENING_OK |
assert-panic |
Freestanding assert diagnostics and NDEBUG no-evaluation behavior |
ASSERT_PANIC_OK |
all |
Full regression | ALL_PASS |
PRD.md Product requirements and experiment framing
architecture.md Current technical architecture
contracts/ Module/subagent contracts and pass criteria
progress.md Milestone completion log
failure_log.md Reproducible failures and fixes
evaluate.sh Single validation entrypoint
Makefile Build and convenience targets
kernel/ Multiboot entry, kernel main, serial
linker.ld 32-bit freestanding kernel layout
drivers/ VGA display, PS/2 keyboard, PIT timer
runtime/ Minimal freestanding libc/heap primitives
ports/doomgeneric/ Local Doom bridge, Freedoom WAD provider, upstream vendor
tests/ Shell-based evaluator gates
scripts/ Dependency check, Freedoom fetch, preview renderer
docs/agent-runs/ Hermes-style run records and handoff logs
artifacts/ Generated reports plus ignored runtime logs/assets/screenshots
The kernel is intentionally small:
- QEMU loads a Multiboot v1 ELF through
-kernel. kernel/entry.Senters 32-bit C code.kernel/kernel.cemits serial proof markers and runs smoke gates.- Drivers provide the minimum display/input/timer surface.
runtime/supplies enough freestanding C behavior for the kernel and Doom bridge.ports/doomgeneric/adapts upstreamdoomgenericto host and kernel smoke environments.- The final demo links verified Freedoom data into the kernel as an ELF binary object and serves it through a kernel-side WAD provider.
The evaluator is serial-log-first: a gate passes only when the expected marker appears in QEMU/host output.
- This repository must not contain commercial Doom WADs.
- The harness downloads and verifies Freedoom for legal/free content testing.
artifacts/assets/freedoom/is local generated state and is ignored by git.- Vendored
doomgenericsources are kept underports/doomgeneric/upstream/with upstream metadata and license files.
- This is a toy OS and demo harness, not a production operating system.
- There is no POSIX environment, multitasking, networking, sound, or production filesystem.
- M12a parses and reports the Multiboot memory map. M12b initializes the existing bump allocator from a selected Multiboot usable region in a dedicated smoke gate. M12c adds an embedded config and volatile save-file test double, not persistent disk storage. M13 verifies actual evidence records only; it does not fabricate a new swarm run. M14 packages a reproducible public demo. M15b proves bounded identity-mapped paging in a sidecar kernel, not virtual memory isolation or Doom-path paging migration. M15c proves a bounded physical page free stack in a sidecar kernel, not Doom-path memory migration or a production memory manager. M15d proves one raw-disk save record across two QEMU boots, not Doom save migration or a filesystem. M15e/M15f add GIF capture and actual Codex-native subagent evidence while explicitly avoiding Google-scale 93-agent and production claims.
- Sound is intentionally no-op in the current smoke/demo path.
- The automated final proof is still serial-log/checksum anchored; M9 adds QEMU display-surface screendumps and a VNC-derived GIF, using a VGA 320x200x256 graphics fallback because QEMU
-kerneldoes not provide Multiboot framebuffer metadata here. - M10 proves polling PS/2 Set 1 press/release ordering through QEMU input. M11 proves IRQ delivery in a dedicated sidecar kernel. M11b proves a dedicated Doom smoke path can consume IRQ-backed timer ticks and keyboard events, but the broader playable/live-display loops have not yet migrated to IRQ-backed input/timer.
- QEMU is required for meaningful OS validation;
make buildalone is not enough proof.
- Agent-run evidence and handoffs:
docs/agent-runs/ - Final report:
artifacts/final_report.md - Demo command reference:
artifacts/demo_commands.md - Test report:
artifacts/test_report.md - Detailed milestone and failure history:
progress.md,failure_log.md
The current project is a serial-verified reproduction, not a full Google I/O
demo equivalent. The scale-up plan is documented in
docs/plans/2026-05-23-google-demo-level-roadmap.md.
See docs/setup.md for package installation notes. In short, Ubuntu/WSL needs build tools, 32-bit GCC support, binutils, file, curl, and QEMU:
sudo apt-get update
sudo apt-get install -y build-essential gcc-multilib binutils file curl qemu-system-x86Then:
./scripts/check_deps.sh
./evaluate.sh all