Skip to content

feat: publish versioned schemas for JSONL audit/telemetry artifacts #2378

@lpcox

Description

@lpcox

Problem

AWF components emit several JSONL files as runtime artifacts:

  1. token-usage.jsonl (api-proxy token-tracker.js) — per-API-call token usage records
  2. audit.jsonl (Squid proxy) — L7 HTTP/HTTPS traffic decisions (allow/deny per request)
  3. events.jsonl (Copilot CLI session state) — agent session events

These files are consumed by:

  • The awf logs stats / awf logs summary commands (log-aggregator, log-parser)
  • GitHub Actions workflows that upload firewall-audit-logs artifacts
  • External tooling and compliance auditors analyzing workflow security posture
  • Research scripts (e.g., scripts/paper/collect-token-data.ts)

The problem: there is no published schema for these files. The format is defined implicitly by the code that writes them, making it fragile for consumers:

  • If a field is added, renamed, or its type changes, consumers break silently
  • External auditors cannot validate that logs conform to an expected structure
  • gh-aw and other upstream tools cannot programmatically discover what fields are available
  • There is no internal compliance mechanism ensuring writers conform to a contract

Proposal

1. Define schemas (JSON Schema or TypeScript interfaces)

For each JSONL file, publish a versioned schema describing the record structure:

token-usage.jsonl (current implicit schema from token-tracker.js):

{
  "timestamp": "string (ISO 8601)",
  "request_id": "string (UUID)",
  "provider": "string (anthropic|openai|copilot|gemini)",
  "model": "string",
  "path": "string (API endpoint path)",
  "status": "number (HTTP status code)",
  "streaming": "boolean",
  "input_tokens": "number",
  "output_tokens": "number",
  "cache_read_tokens": "number",
  "cache_write_tokens": "number",
  "duration_ms": "number"
}

audit.jsonl (current Squid logformat):

{
  "ts": "number (Unix timestamp with ms)",
  "client": "string (IP)",
  "host": "string (domain)",
  "dest": "string (IP:port)",
  "method": "string (CONNECT|GET|POST|...)",
  "status": "number (HTTP status)",
  "decision": "string (TCP_TUNNEL|TCP_DENIED|...)",
  "url": "string"
}

2. Internal schema compliance

  • Writers (token-tracker.js, squid-config.ts) should validate records against the schema before emitting
  • Tests should assert that emitted records conform to the schema
  • Schema changes require a version bump so consumers can handle migrations

3. Publish schemas as artifacts

  • Include schema files in releases (e.g., schemas/token-usage.v1.schema.json)
  • Embed schema version in each JSONL record (e.g., "_schema": "token-usage/v1") or in a companion .schema.json file alongside the JSONL
  • Document schema evolution policy (additive-only for minor versions, breaking changes = new major)

Benefits

  • Auditability: compliance tools can validate that AWF logs contain expected fields
  • Extensibility: new fields (e.g., resolved_model for aliasing, rate_limit_applied) can be added with confidence that consumers handle unknowns gracefully
  • Interoperability: gh-aw, external SIEM tools, and research scripts can rely on a stable contract
  • Regression detection: CI tests validate writers never emit non-conforming records

Current Writers

File Writer Location
token-usage.jsonl api-proxy token tracker containers/api-proxy/token-tracker.js:34
audit.jsonl Squid logformat src/squid-config.ts:603-609
events.jsonl Copilot CLI (external) Consumed via --session-state-dir

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions