Skip to content

feat: support token-based limits in addition to line counts #54

@jakekaplan

Description

@jakekaplan

loq currently enforces file size via line counts. For LLM-assisted workflows, token count is often the more meaningful metric — a 300-line file of dense code can burn far more context than a 500-line file of simple declarations.

Proposal

Add token-based limits alongside line-based limits. Metrics can be set per-rule — some files are human-facing (lines make sense), others are agent-facing (tokens matter more).

Config surface

Three mutually exclusive limit fields — self-describing, no ambiguity:

# Top-level defaults (pick one)
default_max_lines = 500       # today's config, still works
# default_max_tokens = 4000   # or this
# default_max = 4000          # or this (requires default_metric)
# default_metric = "tokens"

[[rules]]
path = "**/*.rs"
max_lines = 500               # implies lines, no metric field needed

[[rules]]
path = "prompts/**/*.md"
max_tokens = 8000             # implies tokens, no metric field needed

[[rules]]
path = "scripts/**"
max = 4000                    # generic form, inherits default_metric

Rules:

  • max_lines and max_tokens are self-describing — no separate metric field needed
  • max is the generic form, inherits metric from default_metric (errors if none set)
  • Using more than one of max / max_lines / max_tokens on the same rule or at the top level is a validation error
  • metric field on a rule is only valid alongside max, not alongside max_lines / max_tokens
  • Existing configs with max_lines / default_max_lines work unchanged — fully backwards compatible

What needs to change

loq_core

  • Add a Metric enum (Lines | Tokens)
  • Add max and max_tokens as limit fields alongside existing max_lines
  • Validate mutual exclusivity during config parsing
  • Carry Metric through report types so output formatters know the unit

loq_fs

  • New count_tokens.rs module — approximate token counting via bytes / 4
  • Branch on metric in check_file to call line counting or token counting
  • Include metric in cache config hash so cache invalidates on metric change

loq_cli

  • Dynamic unit in output messages ("lines" vs "tokens")
  • Optional --metric CLI flag override (stomps everything — default and all rules)

decide.rs is already metric-agnostic — it just returns a limit: usize. Needs to carry the Metric alongside the limit so the measurement layer knows what to count.

Token counting strategy

Start with a fast approximation: bytes / 4. This is roughly correct for English-heavy code, fast enough to not need special caching treatment, and avoids pulling in a tokenizer dependency. A future iteration can add real tokenizer support (tiktoken-rs with cl100k_base, o200k_base, etc.) behind a config flag.

Other considerations

  • Output phrasing: token counts from bytes / 4 are approximate — decide whether to mark them with ~ or just say "tokens" without hedging
  • Baseline files: may need to store the metric so baselines created under one metric don't silently apply to another
  • Binary files: existing null-byte detection still applies before token counting; no change needed

Decisions

  • Naming: three mutually exclusive fields (max / max_lines / max_tokens) — self-describing, backwards compatible
  • Per-rule metrics: supported — rules can independently use lines or tokens
  • Tokenizer: deferred. bytes / 4 approximation first, real tokenizer later
  • Default metric: lines — existing configs unchanged
  • CLI override: --metric stomps everything (default + all rules)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions