Skip to content

Latest commit

 

History

History
325 lines (242 loc) · 11.1 KB

File metadata and controls

325 lines (242 loc) · 11.1 KB

t4a

A lightweight terminal daemon designed to be observed by LLMs, not humans.

Single static Rust binary. ~2,000 lines. Manages PTY sessions, maintains virtual terminal state, renders screenshots on demand, and emits events when interesting things happen. Exposes everything over a newline-delimited JSON protocol on a Unix socket. Ships with a thin CLI client.

Motivation

Current coding agents interact with terminals by capturing stdout/stderr as text. This is wasteful — a cargo build produces thousands of lines the agent doesn't need, and structured terminal output (tables, diffs, TUI programs) becomes a mess of escape codes consuming context tokens.

Humans don't work this way. They glance at the terminal, see the overall state, and focus on specific lines when needed. t4a gives LLMs the same experience: a screenshot to glance at, line-level text extraction for precision, and event notifications so the agent doesn't waste cycles polling.

The screenshot-as-primary-observation pattern exploits the fact that vision tokens are cheap (a full terminal screenshot is ~1,334 tokens in a single tile) and that LLMs process images with bidirectional attention. For high-contrast monospace text on a fixed grid, visual comprehension is near-perfect.

Each terminal is a PTY paired with a VT100 state machine. The state machine absorbs all output (including escape codes) and maintains the current screen contents as a character grid. Screenshots are rendered from this grid on demand. The event watcher monitors the raw byte stream for patterns (silence, prompts, bell, process exit) and emits notifications.

Protocol

Newline-delimited JSON over a Unix socket at /tmp/t4a.sock (configurable via T4A_SOCKET).

Each CLI invocation connects, sends one JSON line, reads one JSON response line, and disconnects. All requests include a "cmd" field. All responses include "ok": true on success or "ok": false, "error": "..." on failure.

Terminals

create — Create a new terminal.

// Request
{"cmd": "create", "cols": 80, "rows": 24, "cmd_args": ["bash"], "cwd": "/home/x", "env": {}}

// Response
{"ok": true, "id": "t1", "cols": 80, "rows": 24, "pid": 12345}

All fields except cmd are optional. Defaults: 80x24, $SHELL or bash.

list — List all terminals.

// Request
{"cmd": "list"}

// Response
{"ok": true, "terminals": [{"id": "t1", "cols": 80, "rows": 24, "pid": 12345, "alive": true, "title": "bash"}]}

kill — Kill terminal and clean up. Sends SIGHUP to the shell process group.

{"cmd": "kill", "id": "t1"}

Input

send — Write to the terminal.

// Request
{"cmd": "send", "id": "t1", "input": "cargo build --release\n"}

// Response
{"ok": true}

Input is a string. Use \n for enter, \x03 for Ctrl+C, \x1b[A for arrow up, etc. Also accepts "input_base64" for raw bytes.

Observation

screenshot — Render the current viewport as PNG.

// Request
{"cmd": "screenshot", "id": "t1", "cursor": true, "pad": 1, "scale": 66}

// Response (two parts)
{"ok": true, "len": 12345}
<12345 raw PNG bytes follow immediately after the JSON line>

The response is a JSON header line with the byte length, followed by exactly that many raw PNG bytes. All parameters are optional.

text — Read text from the terminal buffer.

// Request — last 5 lines (indexed from bottom, 0 = last line)
{"cmd": "text", "id": "t1", "start": 0, "end": 5, "trim": true}

// Request — all lines (omit start/end)
{"cmd": "text", "id": "t1", "trim": true}

// Response
{"ok": true, "lines": ["$ cargo build", "   Compiling..."], "region": "viewport", "start": 0, "end": 5, "total_lines": 24}

cursor — Cursor position and state.

// Request
{"cmd": "cursor", "id": "t1"}

// Response
{"ok": true, "row": 5, "col": 32, "visible": true}

Resize

resize — Resize terminal. Sends SIGWINCH to the child process.

{"cmd": "resize", "id": "t1", "cols": 120, "rows": 40}

Events

events — Stream of terminal events as newline-delimited JSON.

// Request
{"cmd": "events", "terminal": "t1"}

// Response (streaming, one JSON line per event until disconnect)
{"event": "idle", "terminal": "t1", "after_ms": 2000}
{"event": "bell", "terminal": "t1"}
{"event": "command_done", "terminal": "t1", "code": 0}
{"event": "exit", "terminal": "t1", "code": 0}
{"event": "title", "terminal": "t1", "title": "vim src/main.rs"}
{"event": "activity", "terminal": "t1"}

The terminal filter is optional — omit it to receive events from all terminals.

Event types:

Event Trigger Use
idle No output for N ms after activity (configurable, default 2000ms) Command probably finished
bell BEL character (\x07) received Program wants attention
command_done Shell integration OSC received after command completes Shell is ready for input, includes exit code
exit Child process exited Terminal is dead
title OSC title sequence received Window title changed
activity Output resumed after idle period Something is happening again

Configuration

config — Update daemon configuration.

{
  "cmd": "config",
  "idle_timeout_ms": 2000
}

CLI

The CLI is a thin client for the Unix socket. Every subcommand maps to one protocol message. The daemon auto-starts on first command.

t4a daemon [timeout_ms]
t4a create [-- cmd...]
t4a list
t4a send <id> <input>
t4a screenshot <id> [-o file.png]
t4a text <id> [start:end]
t4a cursor <id>
t4a resize <id> <cols> <rows>
t4a events [id]
t4a kill <id>

The send command reads from stdin if no input argument is given. The screenshot command writes PNG to stdout by default. The events command streams newline-delimited JSON to stdout.

LLM Integration Pattern

The intended usage pattern — this is not part of t4a itself but shows how an agent harness uses it:

import socket, json

def t4a_request(req):
    s = socket.socket(socket.AF_UNIX)
    s.connect("/tmp/t4a.sock")
    s.sendall(json.dumps(req).encode() + b"\n")
    line = b""
    while not line.endswith(b"\n"):
        line += s.recv(4096)
    return json.loads(line)

def t4a_screen(tid):
    s = socket.socket(socket.AF_UNIX)
    s.connect("/tmp/t4a.sock")
    s.sendall(json.dumps({"cmd": "screenshot", "id": tid}).encode() + b"\n")
    line = b""
    while not line.endswith(b"\n"):
        line += s.recv(4096)
    header = json.loads(line)
    png = b""
    while len(png) < header["len"]:
        png += s.recv(header["len"] - len(png))
    return png

# Create terminal
t = t4a_request({"cmd": "create", "cols": 80, "rows": 24})
tid = t["id"]

# Define tools for the LLM
tools = [
    {
        "name": "terminal_send",
        "description": "Send input to the terminal. Use \\n for enter, \\x03 for Ctrl+C.",
        "parameters": {"input": "string"}
    },
    {
        "name": "terminal_screen",
        "description": "Get a screenshot of the terminal viewport. Returns PNG.",
        "parameters": {}
    },
    {
        "name": "terminal_read",
        "description": "Read exact text from specific terminal lines.",
        "parameters": {"start": "int", "end": "int"}
    },
]

A typical agent turn:

1. LLM calls terminal_send("cargo build --release\n")
2. LLM calls terminal_wait()
3. ← Agent harness blocks until idle event, returns screenshot
4. LLM sees screenshot: "build failed, error near bottom of screen"
5. LLM calls terminal_read(18, 23)  # read the error lines
6. ← Returns exact text: "error[E0308]: mismatched types..."
7. LLM reasons about the fix, calls terminal_send("vim src/main.rs\n")
8. LLM calls terminal_screen()
9. ← Screenshot of vim, LLM navigates visually

Token budget per turn: ~1,334 (screenshot) + ~100 (text read) = ~1,434 tokens of observation. Compare to dumping 50KB of build output as text (~12,000+ tokens).

Implementation Notes

Crate Dependencies

[dependencies]
portable-pty = "0.9"       # PTY creation and management
vt100 = "0.16"             # VT100/xterm state machine
png = "0.18"               # PNG encoding
image = "0.25"             # Image scaling
tokio = "1"                # async runtime (socket accept + PTY reading)
serde = "1"                # JSON serialization
serde_json = "1"
nix = "0.31"               # Signal handling
noto-sans-mono-bitmap = "0.3" # Embedded monospace font

Font Rendering

Embed a monospace bitmap font directly in the binary. At 20px height and 10px width, an 80×24 terminal renders to 800×480 pixels, then downscaled to ~528×332 at 66% for optimal vision token efficiency (~257 tokens per screenshot).

The vt100 crate maintains a cell grid with character + attributes (bold, color, inverse, etc.). The renderer walks this grid and blits each character from the embedded font, applying foreground/background colors. No text shaping, no kerning, no ligatures. It's a fixed grid of glyphs.

Event Detection

A background task per terminal reads from the PTY master fd and:

  1. Feeds bytes into the vt100::Parser to update screen state
  2. Watches for event triggers:
    • Idle: track timestamp of last byte received. A timer fires if no bytes for idle_timeout_ms. Reset on new bytes.
    • Bell: watch for \x07 in the byte stream (before VT100 parsing)
    • Command done: shell integration OSC \033]7777;done;<code>\007 via precmd/PROMPT_COMMAND hook
    • Exit: waitpid on the child PID
    • Title: the vt100 crate exposes the window title set by OSC sequences
    • Activity: transition from idle state to receiving bytes

Events are broadcast to all listeners via a tokio::sync::broadcast channel.

Concurrency

The daemon is single-threaded async (tokio). Each terminal has:

  • A task reading from the PTY master fd and updating the VT100 state
  • A task running the idle timer

Requests are handled concurrently. The VT100 screen state is behind a Mutex — reads (screenshot, text) and writes (PTY output processing) are serialized. Contention is minimal since writes are fast (just feeding bytes to the parser).

Scope / Non-Goals

In scope:

  • PTY lifecycle management
  • VT100 terminal emulation (via vt100 crate)
  • Screenshot rendering with embedded font
  • Text extraction from viewport and scrollback
  • Event detection and streaming notification
  • NDJSON protocol over Unix socket
  • CLI client
  • Multi-terminal support

Out of scope (for v1):

  • Mouse input support
  • Sixel/image protocol rendering
  • Recording/replay of terminal sessions
  • Authentication/authorization on the socket
  • Windows support
  • Remote access (TCP/TLS) — use SSH tunneling if needed
  • Built-in MCP server — build this as a separate thin adapter

File Structure

src/
  main.rs          # CLI parsing, daemon entry point
  daemon.rs        # Unix socket server, JSON dispatch
  terminal.rs      # Terminal struct: PTY + VT100 + scrollback
  pool.rs          # Terminal pool management, ID generation
  renderer.rs      # VT100 screen → PNG rendering
  font.rs          # Embedded bitmap font data and glyph lookup
  events.rs        # Event detection, broadcast channel
  cli.rs           # CLI client (JSON over Unix socket)

Cargo.toml
README
spec.md