Skip to content

Latest commit

 

History

History
484 lines (379 loc) · 19.1 KB

File metadata and controls

484 lines (379 loc) · 19.1 KB

TokenSieve — Architecture

Problem

AI agents running in agentic CLI loops call system tools — aws, kubectl, docker, gh — and consume their output as context. Most CLI JSON responses are massively bloated. A single aws ec2 describe-instances call can return 50,000 tokens when the agent needs fewer than 2,000.

TokenSieve intercepts that output before it reaches the agent, compresses it, and proves how much it saved.


How It's Installed: PATH Shadowing

TokenSieve uses $PATH precedence to sit transparently between the agent and the real binary.

~/.tokensieve/bin/aws     →  symlink to tokensieve binary
~/.tokensieve/bin/kubectl →  symlink to tokensieve binary
...

$PATH = ~/.tokensieve/bin:/usr/local/bin:/usr/bin:/bin

When the agent runs aws ec2 describe-instances, the shell resolves aws to the symlink first. TokenSieve reads argv[0] to learn it's masquerading as aws, locates the real aws binary further down the $PATH, and delegates to it.

The agent never knows the interception happened. From its perspective it ran aws and got a response.

Agent shell
    │
    │  $ aws ec2 describe-instances
    ▼
~/.tokensieve/bin/aws      ← symlink → tokensieve binary
    │
    │  exec /usr/local/bin/aws ec2 describe-instances
    ▼
Real AWS CLI               ← runs normally, stdout piped back
    │
    ▼
TokenSieve pipeline        ← compression happens here
    │
    ├── stdout → compressed payload  ← agent reads this
    └── stderr → token savings receipt

The 6-Stage Pipeline

Every byte of captured stdout passes through these stages in sequence.

Stage 1 — Scrubber (src/scrubber.rs)

Strips ANSI terminal escape sequences (color codes, cursor movement, formatting) using a compiled regex. Some CLIs embed ANSI codes inside their JSON output when they detect a pseudo-terminal; this breaks serde_json parsing. The scrubber runs unconditionally before any JSON parse attempt.

"\x1B[32m{\x1B[0m\"key\": 1}" → "{\"key\": 1}"

The regex is compiled once at startup via once_cell::Lazy — zero initialization cost on subsequent calls.

Stage 2 — JSON Gate (src/main.rs)

Attempts to parse the scrubbed output as JSON. If parsing fails, the raw original output is forwarded to stdout unchanged and the pipeline stops. This is the zero-cost fallback for non-JSON tools.

Non-JSON tools are completely transparent — TokenSieve adds no latency beyond the subprocess execution itself.

Stage 3 — Sieve (src/sieve.rs)

Recursively prunes the JSON tree, removing or replacing values that carry no useful information for an LLM:

Value Action Reason
null Remove Explicitly absent — no information
"" (empty string) Remove Semantically identical to null in API responses
[] (empty array) Remove No elements — no information
{} (empty object) Remove All children were pruned — collapsed subtree
Opaque base64 blob Replace with <base64 N chars> Unreadable to an LLM; largest single token sink in cloud API responses

Pruning is bottom-up: children are pruned before parents, so entire subtrees collapse when all their fields were empty — a common AWS pattern (e.g. "Monitoring": {"State": null}).

Base64 blob detection uses a single content-only gate — no field name inspection required:

  1. String length ≥ 200 characters
  2. Every character is in the base64 alphabet (A–Z a–z 0–9 + / = - _ \n \r)
  3. ≥ 92% of characters are alphanumeric

All three conditions must hold. This is specific enough to avoid false-positives on JWTs, API keys, long UUIDs, or human-readable descriptions while reliably catching all certificate / kubeconfig / TLS blobs regardless of what key they appear under. An LLM cannot use raw base64 regardless of context, so the key name adds no signal.

Before:
  {"certificateAuthority": {"data": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0t..."}}
  → ~800 tokens for the blob

After:
  {"certificateAuthority": {"data": "<base64 1476 chars>"}}
  → 4 tokens

Stage 4 — Deduper (src/deduper.rs)

Two passes to eliminate structural and temporal redundancy.

Pass 1 — Epoch timestamp stripping

Any integer value > 10¹² is a Unix millisecond timestamp (year ~2001+). These are unreadable to an LLM and consume tokens. Stripped unconditionally.

"last_restarted_time": 1772633534144   → stripped
"num_workers":         4               → kept

Pass 2 — First-seen-wins deduplication

Cloud APIs routinely embed the same scalar value in multiple places — a resource ID at the top level and again inside tags, metadata, or default_tags. The deduper traverses depth-first and drops any field whose scalar value has already appeared elsewhere in the same document.

{"cluster_id": "abc123",          ← kept (first occurrence)
 "default_tags": {
   "ClusterId": "abc123",         ← stripped (duplicate)
   "Region": "us-east-1"          ← kept (first occurrence)
 }}

Scoping rules — critical for correctness across resource types:

Context Scope Why
Root array Each element gets an independent seen-set A list of clusters that all share a region should each display their region
Nested array (e.g. NetworkInterfaces, SecurityGroups) Each element gets a snapshot copy of the parent's seen-set Sibling NIs on the same subnet must both show their SubnetId; but a VpcId already seen at the instance level is still filtered
Root object Single shared seen-set One document, one pass

Two-pass object traversal ensures correctness regardless of key order. Within each object, scalar fields and nested objects are processed first (building the seen-set), then arrays are processed using a snapshot of that complete seen-set. This prevents a NetworkInterfaces array (N) from being snapshotted before vpc_id (v) has been registered.

Stage 5 — Router (src/router.rs + src/pvfn.rs)

Examines the pruned+deduped value's structure and selects the format that produces the fewest tokens for that shape.

Is root a non-empty Array where every element is an Object?
│
├── YES → Compute fill ratio (non-'-' cells / total cells across union schema)
│         │
│         ├── fill ≥ 55% → Schema-YAML  ("SchemaYAML")
│         │                Keys emitted once in a schema: block.
│         │                Values as compact flow sequences under data:.
│         │
│         └── fill < 55% → PVFN  (sparsity guard — too many '-' placeholders)
│
├── NO, but root is a single-key object wrapping an array?
│         │
│         └── Unwrap, test inner array → same fill-ratio branch above
│
└── NO  → PVFN  ("PVFN")

Schema-YAML (when fill ≥ 55%):

Keys printed once, values as indexed rows — an LLM reconstructs row[i].field = data[i][schema.index(field)].

schema:
- cluster_id
- aws_attributes.availability
- spark_version
data:
- [abc123, ON_DEMAND, 16.2.x-scala2.12]
- [def456, SPOT,      18.0.x-scala2.13]

Nested objects are flattened to dot-notation paths (aws_attributes.availability). Arrays of scalars are joined as comma-separated strings.

Sparsity guard: when responses with different schemas are merged in fetch mode, the union schema can have hundreds of columns and most cells are -. Below 55% fill, Schema-YAML generates more tokens from filler than it saves on key repetition — the router falls through to PVFN.

PVFN — Path-Value Flattened Notation (src/pvfn.rs):

The catch-all fallback for deeply nested objects, heterogeneous arrays, and sparse structures.

Theoretical basis — the Dremel insight (Google, 2010)

Google's Dremel paper (Melnik et al., VLDB 2010) proved that any arbitrarily nested, repeated record structure can be losslessly encoded as a flat sequence of (path, value) pairs with two small integers per value — a repetition level (which repeated field in the path started a new list) and a definition level (how deep into the schema the value is defined). This encoding became the foundation for Apache Parquet and every columnar data warehouse built since.

The core insight: structural nesting tokens exist to help a parser reconstruct the tree. They carry no information for a reader that already understands the schema. A reader that can parse instance.NetworkInterfaces.0.SubnetId needs no surrounding braces, brackets, or commas to locate the value — the path is self-describing.

PVFN applies this same insight to LLM context windows instead of disk storage:

Dremel / Parquet PVFN
Columnar storage for query engines Flat path=value lines for LLM context
Repetition + definition levels encode nesting depth Numeric indices and dot-notation encode nesting depth
Strips structural overhead for I/O efficiency Strips structural tokens for context-window efficiency
Lossless reconstruction from flat encoding LLM reads a.b.c=v and infers {"a":{"b":{"c":"v"}}}

Where PVFN diverges: Dremel targets column-wise aggregation (scan all values of one field across millions of rows). LLMs read sequentially across all fields of one record. This is why PVFN keeps path as a prefix on every line rather than grouping by column — it preserves the record-local reading order an LLM expects.

The @map header (key abbreviations) and hybrid Schema-YAML blocks for dense sub-arrays are PVFN's extensions beyond the base Dremel encoding, targeting the additional overhead of long repeated key names that Parquet handles via separate column metadata.

Three components:

  1. @map header — assigns camelCase-initialism abbreviations to any key appearing ≥ 2 times and ≥ 7 characters long. Collision resolution adds a digit suffix (SGSG2).

  2. Path=value lines — one line per leaf value using dot-notation paths. Null/empty values produce no line. Arrays become numeric indices.

  3. Hybrid inline Schema-YAML — when a nested array is a dense homogeneous list of objects (all elements are objects, fill ≥ 55%, ≥ 2 elements), PVFN inlines a compact Schema-YAML block at that path rather than emitting one path.N.key=value line per cell.

Example output:

@map
NI=NetworkInterfaces
SG=SecurityGroups

instance.InstanceId=i-abc123
instance.InstanceType=m5.xlarge
instance.NI.0.SubnetId=subnet-xyz
instance.SG:
  schema:[GroupId, GroupName]
  data:
  - [sg-abc, web-sg]
  - [sg-def, db-sg]
instance.status=running

Stage 6 — Handoff (src/handoff.rs)

Writes to two streams:

  • stdout — the compressed payload. This is what the agent's tool-call result contains.
  • stderr — a single-line token savings receipt. Ignored by agents; visible to humans and log aggregators.

Receipt format:

[TokenSieve] Original: 4821 tok | Compressed: 612 tok | Saved: 4209 (87.3%) | Shape: PVFN

stdout is flushed before the stderr write to prevent interleaving.


Token Math

The auditor (src/auditor.rs) uses tiktoken-rs with the cl100k_base BPE vocabulary — the same tokenizer used by GPT-4 and close to Claude's. It runs fully offline.

Results from the stress test suite run against a live AWS account:

Scenario Typical savings Primary driver
EKS describe-cluster ~66% Base64 cert blob redaction (~800 tok removed)
EC2 describe-security-groups (5) ~55% Abbreviation map + deduper
EC2 describe-subnets (8) ~52% Abbreviation map + hybrid block
EC2 describe-vpcs ~49% SchemaYAML key deduplication
EC2 describe-instances (6) ~42% PVFN structural removal + abbreviations
CloudWatch describe-log-groups (10) ~30% Hybrid block + abbreviation map
IAM list-roles (5) ~27% PVFN structural removal
KMS list-keys (10) ~15% SchemaYAML (2-column dense schema)
Simple lists (addons, policies) 5–32% Format overhead amortization

Overall average across 17 stress tests: 46.9% token savings (40,483 tokens in → 21,487 tokens out)

The receipt makes savings verifiable — not estimated. The exact token delta is computed on the actual input/output pair for every call.


File Map

tokensieve/
├── Cargo.toml          — dependencies: serde_json, serde_yaml, regex, tiktoken-rs, once_cell, tokio
├── docs/
│   ├── ARCHITECTURE.md — this document
│   └── stress-tests.md — measured compression results across 17 real AWS API calls
└── src/
    ├── main.rs         — entry point: PATH proxy mode + fetch mode, pipeline orchestration
    ├── scrubber.rs     — ANSI escape sequence removal
    ├── sieve.rs        — recursive JSON pruner + content-only base64 blob redactor
    ├── deduper.rs      — epoch timestamp stripping + first-seen-wins deduplication
    ├── router.rs       — shape detection, Schema-YAML serialization, sparsity guard
    ├── pvfn.rs         — PVFN formatter: @map header, path=value lines, hybrid blocks
    ├── auditor.rs      — BPE token counting (cl100k_base), receipt formatting
    └── handoff.rs      — stdout/stderr split emitter

Installation

# 1. Build
cargo build --release

# 2. Create the sieve bin directory
mkdir -p ~/.tokensieve/bin

# 3. Symlink any CLI tools you want intercepted
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/aws
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/databricks
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/kubectl
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/docker
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/gh

# 4. Permanently prepend to PATH
#    bash users:
echo 'export PATH="$HOME/.tokensieve/bin:$PATH"' >> ~/.bash_profile
#    zsh users:
echo 'export PATH="$HOME/.tokensieve/bin:$PATH"' >> ~/.zshrc

# 5. Apply to the current terminal immediately
source ~/.bash_profile   # or: source ~/.zshrc

After step 4 every new terminal session will automatically resolve aws, kubectl, etc. to the tokensieve symlinks — no manual export needed.

To verify the intercept is active:

which aws
# → /Users/<you>/.tokensieve/bin/aws

To add a new tool, add another symlink — no code changes required. To remove a tool, delete its symlink.

⚠ Do not symlink general-purpose shell utilities such as cat, jq, curl, or gh. TokenSieve intercepts every invocation of a symlinked command — including ones you run in your own terminal for unrelated purposes. Symlinking cat for pipeline testing, for example, means that any script or tool that pipes output through cat will have its JSON silently rewritten, and any cat > file that writes non-JSON content will behave unexpectedly. Only symlink purpose-built CLI tools whose JSON output you explicitly want compressed (aws, databricks, kubectl, gh CLI used for infra queries, etc.). If you need to test the pipeline manually, pipe directly into the binary:

echo '{"key": "value"}' | /path/to/target/release/tokensieve

Testing the Pipeline

# Pipe any JSON through the tokensieve binary directly
echo '{"UserId":"AIDA...","Account":"123456789","Arn":"arn:aws:iam::123456789:user/alice",
  "ResponseMetadata":{"RequestId":"abc","HTTPStatusCode":200,"HTTPHeaders":{},"RetryAttempts":0}}' \
  | /path/to/tokensieve

# Expected stdout (PVFN, HTTPHeaders and empty fields pruned):
# Account=123456789
# Arn=arn:aws:iam::123456789:user/alice
# ResponseMetadata.HTTPStatusCode=200
# ResponseMetadata.RequestId=abc
# UserId=AIDA...

# Expected stderr:
# [TokenSieve] Original: 38 tok | Compressed: 22 tok | Saved: 16 (42.1%) | Shape: PVFN

Design Decisions

Why content-only base64 detection (no key hints)? The earlier implementation required either a matching key hint (certificate, kubeconfig, private_key, etc.) OR the string to be ≥ 500 characters before scrubbing a generic key like data. This missed the common EKS pattern where certificateAuthority.data has a generic terminal key "data" at 232 characters. The three-part content gate (length + character set + alphanumeric ratio) is specific enough on its own — an LLM cannot use raw base64 regardless of the key name, so the key name adds no useful signal.

Why PVFN instead of YAML as the fallback? YAML still repeats key names for every object in a nested structure. PVFN's design is grounded in the Dremel paper's central finding (Google, VLDB 2010): structural nesting tokens exist to help a parser reconstruct the tree — a reader that understands the schema doesn't need them. An LLM reading a.b.c=value extracts the same information as reading {"a": {"b": {"c": "value"}}} with zero braces, brackets, or repeated quotes. On top of that, PVFN's @map header amortizes long repeated key names across the whole document (not just one nesting level), and hybrid inline Schema-YAML blocks handle dense sub-arrays. For real AWS responses, PVFN consistently produces 5–15% fewer tokens than equivalent YAML for the same nested structures.

Why nested array scoping in the deduper? A flat global seen-set produced incorrect results for resources with repeated parallel sub-structures (e.g. NetworkInterfaces[0] and NetworkInterfaces[1] on the same subnet). With global scoping, NI[1].SubnetId would be stripped as a duplicate of NI[0].SubnetId even though both are semantically distinct. Nested array scoping — each array element gets a snapshot of the parent's seen-set rather than a shared mutable reference — fixes this for all resource types without requiring per-resource configuration.

Why a two-pass object traversal in the deduper? When building the seen-set snapshot for a nested array, all parent scalar fields must already be registered — regardless of their alphabetical position relative to the array key. A single alphabetical pass would snapshot too early if the array key sorts before some parent scalars (e.g. NetworkInterfaces (N) before vpc_id (v)). Two passes (scalars+objects first, arrays second) guarantees the snapshot is always complete.

Why a sparsity guard in the router? Schema-YAML pays for itself only when the union schema is dense — each column appears in most rows. When two CLI responses with different shapes are merged by fetch mode, the union can have hundreds of columns and most cells are - placeholders. Below 55% fill, the - tokens cost more than the saved key repetitions and the router falls through to PVFN.

Why stderr for the receipt? Agents read tool stdout as structured data. Mixing observability output into stdout would corrupt the payload. Stderr is the Unix convention for diagnostic output and is captured separately by most logging systems.

Why a static Rust binary? A statically linked binary installs by copying a single file — no runtime dependencies, no version conflicts with the tools being proxied. The process startup overhead is under 5 ms, negligible against the network I/O of any real CLI call.