Scan a project, count LLM tokens, and estimate cost — before you ship the call.
npx token-tally . --model gpt-4o┌─────────────────────────┬─────────────────┐
│ Metric │ Value │
├─────────────────────────┼─────────────────┤
│ Model │ gpt-4o (openai) │
│ Files scanned │ 42 │
│ Total input tokens │ 18,204 │
│ Estimated output tokens │ 0 │
│ Input price / 1M tok │ $2.5000 │
│ Output price / 1M tok │ $10.0000 │
│ Input cost │ $0.0455 │
│ Output cost │ $0.000000 │
│ Total cost │ $0.0455 │
└─────────────────────────┴─────────────────┘
- Live pricing — fetches the latest model prices from LiteLLM on every run, cached locally for 24 hours.
- Provider-aware tokenizers — exact counts for OpenAI (
js-tiktoken) and optionally for Anthropic and Gemini via their APIs. DeepSeek usescl100k_baseas a close approximation. - Interactive model picker — run without arguments to launch a wizard with ↑/↓ navigation, live fuzzy search across all available models, and prompts for every option.
.gitignore-aware scanner — skips ignored files by default; supports--include/--excludeglobs across 40+ file extensions.- CI-friendly —
--jsonoutput,--budgetexit code 2,--offlinefor hermetic builds. - Context window guard —
--warn-contextflags when total tokens exceed the model's limit.
Run on demand (no install required):
npx token-tally . --model gpt-4oInstall globally:
npm i -g @emstack/token-tallytoken-tally [path] --model <model> [options]Run without a model to launch the interactive picker:
token-tally
# or during development:
bun run src/cli.tsThe wizard lists all available models with ↑/↓ navigation and live search, then prompts for every option — press Enter to accept the shown default.
| Flag | Default | Description |
|---|---|---|
[path] |
. |
Directory to scan. Defaults to the current working directory. |
-m, --model <name> |
— | LLM model ID used for tokenization and pricing. e.g. gpt-4o, claude-3-5-sonnet-20241022. |
-i, --include <glob...> |
all code files | Glob patterns for files to include. Multiple patterns are space-separated. |
-e, --exclude <glob...> |
— | Glob patterns for files to skip. |
--no-gitignore |
gitignore respected | Disables .gitignore filtering. |
--max-files <n> |
unlimited | Caps the total number of files scanned. |
--output-tokens <n> |
20% of input | Estimated output tokens to include in the total cost calculation. See Output tokens below. |
--budget <usd> |
— | Exit with code 2 if total cost exceeds this USD amount. |
--warn-context |
false |
Warn when total tokens exceed the model's max_input_tokens. |
--json |
false |
Emit machine-readable JSON instead of a table. |
-v, --verbose |
false |
Show a per-file token and cost breakdown. |
--refresh |
false |
Force re-fetch of the remote price table, bypassing the 24-hour cache. |
--offline |
false |
Use only the local cache or bundled static prices; never hit the network. |
--concurrency <n> |
min(8, cpus) |
Number of parallel file workers. |
--anthropic-api-key <key> |
$ANTHROPIC_API_KEY |
Use the Anthropic messages.count_tokens API for exact Claude 3+ counts. |
--gemini-api-key <key> |
$GOOGLE_API_KEY |
Use the Google countTokens API for exact Gemini counts. |
LLM APIs charge for both the tokens you send (input) and the tokens the model returns (output). Because token-tally scans your source files statically, it cannot know how long the model's response will be.
When --output-tokens is not set, token-tally uses 20% of the total input token count as a default estimate.
This is a conservative heuristic based on the observation that typical LLM responses are 10–30% the size of the input context.
Override it whenever you know your expected response length:
| Scenario | Suggested value |
|---|---|
| Quick summary or classification | 500–1 000 |
| Moderate answer with explanation | 2 000–4 000 |
| Long code generation / detailed analysis | 8 000–16 000 |
| Full context-window response | up to max_output_tokens of the model |
# use a fixed output token count
token-tally . --model claude-opus-4 --output-tokens 4000
# disable the output cost estimate entirely
token-tally . --model claude-opus-4 --output-tokens 0Per-file breakdown:
token-tally src --model gpt-4o -vCI cost gate (fail if total exceeds $0.05):
token-tally . --model gpt-4o --budget 0.05 --jsonWarn if the project won't fit in a single context window:
token-tally . --model claude-3-5-sonnet-20241022 --warn-contextForce-refresh prices and stay offline after:
token-tally . --model gpt-4o --refresh
token-tally . --model gpt-4o --offlineExact token counts for Claude 3+ via API:
ANTHROPIC_API_KEY=sk-... token-tally . --model claude-3-5-sonnet-20241022- name: Check token cost
run: npx token-tally . --model gpt-4o --budget 1.00 --json > tally.jsonEach provider uses a different tokenization strategy. The tool picks the right one automatically based on the model name.
OpenAI — exact via js-tiktoken
The same BPE library OpenAI uses internally. The encoder is selected per model family:
| Model family | Encoder |
|---|---|
gpt-4o, o1, o3, o4, gpt-4.1, gpt-5 |
o200k_base |
gpt-4, gpt-3.5, older |
cl100k_base |
Result matches the API token counter to the token.
Anthropic — approximate by default, exact with API key
Without a key, the legacy Claude 2 BPE tokenizer (@anthropic-ai/tokenizer) is used offline.
It was accurate for Claude 2, but drifts ~5–10% on Claude 3+ because Anthropic updated their tokenizer.
# enable exact counting via the official API
ANTHROPIC_API_KEY=sk-... token-tally . --model claude-3-5-sonnet-20241022Gemini — rough approximation by default, exact with API key
Google does not publish an offline tokenizer. The fallback formula is:
tokens ≈ ceil(characters / 4)
This holds reasonably for average English text (~4 chars/token) but can diverge by ±20–40% on code, non-Latin scripts, or very short strings.
# enable exact counting via the Generative Language API
GOOGLE_API_KEY=... token-tally . --model gemini-1.5-proDeepSeek — close approximation
Uses cl100k_base (GPT-4 family BPE). DeepSeek's tokenizer is derived from the same family and produces near-identical results in practice, but it is not identical — expect ~2–5% drift.
No API-based exact mode is available for DeepSeek.
A warning is printed in the output whenever counts are approximate.
On every run, token-tally fetches LiteLLM's community-maintained price table and caches it at ~/.cache/token-tally/prices.json for 24 hours.
--refreshforces a re-fetch.--offlineskips the network entirely, using the cache or the bundled static fallback.- If the network fetch fails, the stale cache is used with a warning.
Prices are taken directly from the input_cost_per_token and output_cost_per_token fields in the LiteLLM table — no rounding or transformation is applied.
total = (input_tokens × input_cost_per_token)
+ (output_tokens × output_cost_per_token)
Output tokens default to 0. Pass --output-tokens <n> to include an expected response length in the estimate.
Note: The formula does not account for system prompts billed separately, API call overhead, caching discounts, or streaming surcharges. Use it as a planning estimate, not a billing guarantee.
| Provider | Token accuracy | How to get exact counts |
|---|---|---|
| OpenAI | 100% — exact | Built-in, no key needed |
| Anthropic | ~90–95% without key | Pass --anthropic-api-key |
| Gemini | ~60–80% without key | Pass --gemini-api-key |
| DeepSeek | ~95–98% | No exact mode available |
Pricing accuracy depends on LiteLLM's community table being up to date. Major models are typically current; niche or very new models may lag by a few days.
bun install # install dependencies
bun run dev # run CLI locally
bun test # run tests
bun run typecheck # TypeScript check
bun run lint # ESLint
bun run build # build dist/cli.js- CONTRIBUTING.md — how to contribute
- CODE_OF_CONDUCT.md — community standards
- SECURITY.md — reporting vulnerabilities
- CHANGELOG.md — release history
MIT