AI agents running in agentic CLI loops call system tools — aws, kubectl,
docker, gh — and consume their output as context. Most CLI JSON responses
are massively bloated. A single aws ec2 describe-instances call can return
50,000 tokens when the agent needs fewer than 2,000.
TokenSieve intercepts that output before it reaches the agent, compresses it, and proves how much it saved.
TokenSieve uses $PATH precedence to sit transparently between the agent and
the real binary.
~/.tokensieve/bin/aws → symlink to tokensieve binary
~/.tokensieve/bin/kubectl → symlink to tokensieve binary
...
$PATH = ~/.tokensieve/bin:/usr/local/bin:/usr/bin:/bin
When the agent runs aws ec2 describe-instances, the shell resolves aws to
the symlink first. TokenSieve reads argv[0] to learn it's masquerading as
aws, locates the real aws binary further down the $PATH, and delegates
to it.
The agent never knows the interception happened. From its perspective it ran
aws and got a response.
Agent shell
│
│ $ aws ec2 describe-instances
▼
~/.tokensieve/bin/aws ← symlink → tokensieve binary
│
│ exec /usr/local/bin/aws ec2 describe-instances
▼
Real AWS CLI ← runs normally, stdout piped back
│
▼
TokenSieve pipeline ← compression happens here
│
├── stdout → compressed payload ← agent reads this
└── stderr → token savings receipt
Every byte of captured stdout passes through these stages in sequence.
Strips ANSI terminal escape sequences (color codes, cursor movement, formatting)
using a compiled regex. Some CLIs embed ANSI codes inside their JSON output when
they detect a pseudo-terminal; this breaks serde_json parsing. The scrubber
runs unconditionally before any JSON parse attempt.
"\x1B[32m{\x1B[0m\"key\": 1}" → "{\"key\": 1}"
The regex is compiled once at startup via once_cell::Lazy — zero
initialization cost on subsequent calls.
Attempts to parse the scrubbed output as JSON. If parsing fails, the raw original output is forwarded to stdout unchanged and the pipeline stops. This is the zero-cost fallback for non-JSON tools.
Non-JSON tools are completely transparent — TokenSieve adds no latency beyond the subprocess execution itself.
Recursively prunes the JSON tree, removing or replacing values that carry no useful information for an LLM:
| Value | Action | Reason |
|---|---|---|
null |
Remove | Explicitly absent — no information |
"" (empty string) |
Remove | Semantically identical to null in API responses |
[] (empty array) |
Remove | No elements — no information |
{} (empty object) |
Remove | All children were pruned — collapsed subtree |
| Opaque base64 blob | Replace with <base64 N chars> |
Unreadable to an LLM; largest single token sink in cloud API responses |
Pruning is bottom-up: children are pruned before parents, so entire
subtrees collapse when all their fields were empty — a common AWS pattern
(e.g. "Monitoring": {"State": null}).
Base64 blob detection uses a single content-only gate — no field name inspection required:
- String length ≥ 200 characters
- Every character is in the base64 alphabet (
A–Z a–z 0–9 + / = - _ \n \r) - ≥ 92% of characters are alphanumeric
All three conditions must hold. This is specific enough to avoid false-positives on JWTs, API keys, long UUIDs, or human-readable descriptions while reliably catching all certificate / kubeconfig / TLS blobs regardless of what key they appear under. An LLM cannot use raw base64 regardless of context, so the key name adds no signal.
Before:
{"certificateAuthority": {"data": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0t..."}}
→ ~800 tokens for the blob
After:
{"certificateAuthority": {"data": "<base64 1476 chars>"}}
→ 4 tokens
Two passes to eliminate structural and temporal redundancy.
Pass 1 — Epoch timestamp stripping
Any integer value > 10¹² is a Unix millisecond timestamp (year ~2001+). These are unreadable to an LLM and consume tokens. Stripped unconditionally.
"last_restarted_time": 1772633534144 → stripped
"num_workers": 4 → kept
Pass 2 — First-seen-wins deduplication
Cloud APIs routinely embed the same scalar value in multiple places — a resource
ID at the top level and again inside tags, metadata, or default_tags. The
deduper traverses depth-first and drops any field whose scalar value has already
appeared elsewhere in the same document.
{"cluster_id": "abc123", ← kept (first occurrence)
"default_tags": {
"ClusterId": "abc123", ← stripped (duplicate)
"Region": "us-east-1" ← kept (first occurrence)
}}
Scoping rules — critical for correctness across resource types:
| Context | Scope | Why |
|---|---|---|
| Root array | Each element gets an independent seen-set | A list of clusters that all share a region should each display their region |
Nested array (e.g. NetworkInterfaces, SecurityGroups) |
Each element gets a snapshot copy of the parent's seen-set | Sibling NIs on the same subnet must both show their SubnetId; but a VpcId already seen at the instance level is still filtered |
| Root object | Single shared seen-set | One document, one pass |
Two-pass object traversal ensures correctness regardless of key order.
Within each object, scalar fields and nested objects are processed first (building
the seen-set), then arrays are processed using a snapshot of that complete
seen-set. This prevents a NetworkInterfaces array (N) from being snapshotted
before vpc_id (v) has been registered.
Examines the pruned+deduped value's structure and selects the format that produces the fewest tokens for that shape.
Is root a non-empty Array where every element is an Object?
│
├── YES → Compute fill ratio (non-'-' cells / total cells across union schema)
│ │
│ ├── fill ≥ 55% → Schema-YAML ("SchemaYAML")
│ │ Keys emitted once in a schema: block.
│ │ Values as compact flow sequences under data:.
│ │
│ └── fill < 55% → PVFN (sparsity guard — too many '-' placeholders)
│
├── NO, but root is a single-key object wrapping an array?
│ │
│ └── Unwrap, test inner array → same fill-ratio branch above
│
└── NO → PVFN ("PVFN")
Schema-YAML (when fill ≥ 55%):
Keys printed once, values as indexed rows — an LLM reconstructs
row[i].field = data[i][schema.index(field)].
schema:
- cluster_id
- aws_attributes.availability
- spark_version
data:
- [abc123, ON_DEMAND, 16.2.x-scala2.12]
- [def456, SPOT, 18.0.x-scala2.13]Nested objects are flattened to dot-notation paths
(aws_attributes.availability). Arrays of scalars are joined as
comma-separated strings.
Sparsity guard: when responses with different schemas are merged in fetch
mode, the union schema can have hundreds of columns and most cells are -.
Below 55% fill, Schema-YAML generates more tokens from filler than it saves on
key repetition — the router falls through to PVFN.
PVFN — Path-Value Flattened Notation (src/pvfn.rs):
The catch-all fallback for deeply nested objects, heterogeneous arrays, and sparse structures.
Theoretical basis — the Dremel insight (Google, 2010)
Google's Dremel paper
(Melnik et al., VLDB 2010) proved that any arbitrarily nested, repeated record
structure can be losslessly encoded as a flat sequence of (path, value) pairs
with two small integers per value — a repetition level (which repeated field
in the path started a new list) and a definition level (how deep into the
schema the value is defined). This encoding became the foundation for Apache
Parquet and every columnar data warehouse built since.
The core insight: structural nesting tokens exist to help a parser reconstruct
the tree. They carry no information for a reader that already understands the
schema. A reader that can parse instance.NetworkInterfaces.0.SubnetId needs
no surrounding braces, brackets, or commas to locate the value — the path is
self-describing.
PVFN applies this same insight to LLM context windows instead of disk storage:
| Dremel / Parquet | PVFN |
|---|---|
| Columnar storage for query engines | Flat path=value lines for LLM context |
| Repetition + definition levels encode nesting depth | Numeric indices and dot-notation encode nesting depth |
| Strips structural overhead for I/O efficiency | Strips structural tokens for context-window efficiency |
| Lossless reconstruction from flat encoding | LLM reads a.b.c=v and infers {"a":{"b":{"c":"v"}}} |
Where PVFN diverges: Dremel targets column-wise aggregation (scan all values of one field across millions of rows). LLMs read sequentially across all fields of one record. This is why PVFN keeps path as a prefix on every line rather than grouping by column — it preserves the record-local reading order an LLM expects.
The @map header (key abbreviations) and hybrid Schema-YAML blocks for dense
sub-arrays are PVFN's extensions beyond the base Dremel encoding, targeting the
additional overhead of long repeated key names that Parquet handles via separate
column metadata.
Three components:
-
@mapheader — assigns camelCase-initialism abbreviations to any key appearing ≥ 2 times and ≥ 7 characters long. Collision resolution adds a digit suffix (SG→SG2). -
Path=value lines — one line per leaf value using dot-notation paths. Null/empty values produce no line. Arrays become numeric indices.
-
Hybrid inline Schema-YAML — when a nested array is a dense homogeneous list of objects (all elements are objects, fill ≥ 55%, ≥ 2 elements), PVFN inlines a compact Schema-YAML block at that path rather than emitting one
path.N.key=valueline per cell.
Example output:
@map
NI=NetworkInterfaces
SG=SecurityGroups
instance.InstanceId=i-abc123
instance.InstanceType=m5.xlarge
instance.NI.0.SubnetId=subnet-xyz
instance.SG:
schema:[GroupId, GroupName]
data:
- [sg-abc, web-sg]
- [sg-def, db-sg]
instance.status=running
Writes to two streams:
- stdout — the compressed payload. This is what the agent's tool-call result contains.
- stderr — a single-line token savings receipt. Ignored by agents; visible to humans and log aggregators.
Receipt format:
[TokenSieve] Original: 4821 tok | Compressed: 612 tok | Saved: 4209 (87.3%) | Shape: PVFN
stdout is flushed before the stderr write to prevent interleaving.
The auditor (src/auditor.rs) uses tiktoken-rs with the cl100k_base BPE
vocabulary — the same tokenizer used by GPT-4 and close to Claude's. It runs
fully offline.
Results from the stress test suite run against a live AWS account:
| Scenario | Typical savings | Primary driver |
|---|---|---|
EKS describe-cluster |
~66% | Base64 cert blob redaction (~800 tok removed) |
EC2 describe-security-groups (5) |
~55% | Abbreviation map + deduper |
EC2 describe-subnets (8) |
~52% | Abbreviation map + hybrid block |
EC2 describe-vpcs |
~49% | SchemaYAML key deduplication |
EC2 describe-instances (6) |
~42% | PVFN structural removal + abbreviations |
CloudWatch describe-log-groups (10) |
~30% | Hybrid block + abbreviation map |
IAM list-roles (5) |
~27% | PVFN structural removal |
KMS list-keys (10) |
~15% | SchemaYAML (2-column dense schema) |
| Simple lists (addons, policies) | 5–32% | Format overhead amortization |
Overall average across 17 stress tests: 46.9% token savings (40,483 tokens in → 21,487 tokens out)
The receipt makes savings verifiable — not estimated. The exact token delta is computed on the actual input/output pair for every call.
tokensieve/
├── Cargo.toml — dependencies: serde_json, serde_yaml, regex, tiktoken-rs, once_cell, tokio
├── docs/
│ ├── ARCHITECTURE.md — this document
│ └── stress-tests.md — measured compression results across 17 real AWS API calls
└── src/
├── main.rs — entry point: PATH proxy mode + fetch mode, pipeline orchestration
├── scrubber.rs — ANSI escape sequence removal
├── sieve.rs — recursive JSON pruner + content-only base64 blob redactor
├── deduper.rs — epoch timestamp stripping + first-seen-wins deduplication
├── router.rs — shape detection, Schema-YAML serialization, sparsity guard
├── pvfn.rs — PVFN formatter: @map header, path=value lines, hybrid blocks
├── auditor.rs — BPE token counting (cl100k_base), receipt formatting
└── handoff.rs — stdout/stderr split emitter
# 1. Build
cargo build --release
# 2. Create the sieve bin directory
mkdir -p ~/.tokensieve/bin
# 3. Symlink any CLI tools you want intercepted
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/aws
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/databricks
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/kubectl
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/docker
ln -sf $(pwd)/target/release/tokensieve ~/.tokensieve/bin/gh
# 4. Permanently prepend to PATH
# bash users:
echo 'export PATH="$HOME/.tokensieve/bin:$PATH"' >> ~/.bash_profile
# zsh users:
echo 'export PATH="$HOME/.tokensieve/bin:$PATH"' >> ~/.zshrc
# 5. Apply to the current terminal immediately
source ~/.bash_profile # or: source ~/.zshrcAfter step 4 every new terminal session will automatically resolve aws,
kubectl, etc. to the tokensieve symlinks — no manual export needed.
To verify the intercept is active:
which aws
# → /Users/<you>/.tokensieve/bin/awsTo add a new tool, add another symlink — no code changes required. To remove a tool, delete its symlink.
⚠ Do not symlink general-purpose shell utilities such as
cat,jq,curl, orgh. TokenSieve intercepts every invocation of a symlinked command — including ones you run in your own terminal for unrelated purposes. Symlinkingcatfor pipeline testing, for example, means that any script or tool that pipes output throughcatwill have its JSON silently rewritten, and anycat > filethat writes non-JSON content will behave unexpectedly. Only symlink purpose-built CLI tools whose JSON output you explicitly want compressed (aws,databricks,kubectl,ghCLI used for infra queries, etc.). If you need to test the pipeline manually, pipe directly into the binary:echo '{"key": "value"}' | /path/to/target/release/tokensieve
# Pipe any JSON through the tokensieve binary directly
echo '{"UserId":"AIDA...","Account":"123456789","Arn":"arn:aws:iam::123456789:user/alice",
"ResponseMetadata":{"RequestId":"abc","HTTPStatusCode":200,"HTTPHeaders":{},"RetryAttempts":0}}' \
| /path/to/tokensieve
# Expected stdout (PVFN, HTTPHeaders and empty fields pruned):
# Account=123456789
# Arn=arn:aws:iam::123456789:user/alice
# ResponseMetadata.HTTPStatusCode=200
# ResponseMetadata.RequestId=abc
# UserId=AIDA...
# Expected stderr:
# [TokenSieve] Original: 38 tok | Compressed: 22 tok | Saved: 16 (42.1%) | Shape: PVFNWhy content-only base64 detection (no key hints)?
The earlier implementation required either a matching key hint (certificate,
kubeconfig, private_key, etc.) OR the string to be ≥ 500 characters before
scrubbing a generic key like data. This missed the common EKS pattern where
certificateAuthority.data has a generic terminal key "data" at 232 characters.
The three-part content gate (length + character set + alphanumeric ratio) is
specific enough on its own — an LLM cannot use raw base64 regardless of the key
name, so the key name adds no useful signal.
Why PVFN instead of YAML as the fallback?
YAML still repeats key names for every object in a nested structure. PVFN's
design is grounded in the Dremel paper's central finding (Google, VLDB 2010):
structural nesting tokens exist to help a parser reconstruct the tree — a
reader that understands the schema doesn't need them. An LLM reading
a.b.c=value extracts the same information as reading {"a": {"b": {"c": "value"}}} with zero braces, brackets, or repeated quotes. On top of that,
PVFN's @map header amortizes long repeated key names across the whole document
(not just one nesting level), and hybrid inline Schema-YAML blocks handle dense
sub-arrays. For real AWS responses, PVFN consistently produces 5–15% fewer
tokens than equivalent YAML for the same nested structures.
Why nested array scoping in the deduper?
A flat global seen-set produced incorrect results for resources with repeated
parallel sub-structures (e.g. NetworkInterfaces[0] and NetworkInterfaces[1]
on the same subnet). With global scoping, NI[1].SubnetId would be stripped as
a duplicate of NI[0].SubnetId even though both are semantically distinct.
Nested array scoping — each array element gets a snapshot of the parent's
seen-set rather than a shared mutable reference — fixes this for all resource
types without requiring per-resource configuration.
Why a two-pass object traversal in the deduper?
When building the seen-set snapshot for a nested array, all parent scalar fields
must already be registered — regardless of their alphabetical position relative to
the array key. A single alphabetical pass would snapshot too early if the array
key sorts before some parent scalars (e.g. NetworkInterfaces (N) before
vpc_id (v)). Two passes (scalars+objects first, arrays second) guarantees the
snapshot is always complete.
Why a sparsity guard in the router?
Schema-YAML pays for itself only when the union schema is dense — each column
appears in most rows. When two CLI responses with different shapes are merged by
fetch mode, the union can have hundreds of columns and most cells are -
placeholders. Below 55% fill, the - tokens cost more than the saved key
repetitions and the router falls through to PVFN.
Why stderr for the receipt? Agents read tool stdout as structured data. Mixing observability output into stdout would corrupt the payload. Stderr is the Unix convention for diagnostic output and is captured separately by most logging systems.
Why a static Rust binary? A statically linked binary installs by copying a single file — no runtime dependencies, no version conflicts with the tools being proxied. The process startup overhead is under 5 ms, negligible against the network I/O of any real CLI call.