Skip to content

Commit d5e2041

Browse files
feat: library overhaul for any LLM
1 parent fe62239 commit d5e2041

File tree

12 files changed

+1982
-3167
lines changed

12 files changed

+1982
-3167
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
- uses: pnpm/action-setup@v3
1818
- uses: actions/setup-node@v4
1919
with:
20-
node-version: 20
20+
node-version: 22
2121
cache: pnpm
2222
- run: pnpm install
2323
- run: pnpm run lint
@@ -29,7 +29,7 @@ jobs:
2929
- uses: pnpm/action-setup@v3
3030
- uses: actions/setup-node@v4
3131
with:
32-
node-version: 20
32+
node-version: 22
3333
cache: pnpm
3434
- run: pnpm install
3535
- run: pnpm run test:types
@@ -41,7 +41,7 @@ jobs:
4141
- uses: pnpm/action-setup@v3
4242
- uses: actions/setup-node@v4
4343
with:
44-
node-version: 20
44+
node-version: 22
4545
cache: pnpm
4646
- run: pnpm install
4747
- run: pnpm run test

.github/workflows/release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ jobs:
1818
- uses: pnpm/action-setup@v3
1919
- uses: actions/setup-node@v4
2020
with:
21-
node-version: 20
21+
node-version: 22
2222
registry-url: https://registry.npmjs.org/
2323
cache: pnpm
2424

README.md

Lines changed: 63 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# tokenx
22

3-
GPT token count and context size utilities when approximations are good enough. For advanced use cases, please use a full tokenizer like [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). This library is intended to be used for quick estimations and to avoid the overhead of a full tokenizer, e.g. when you want to limit your bundle size.
3+
Fast and lightweight token count estimation for any LLM without requiring a full tokenizer. This library provides quick approximations that are good enough for most use cases while keeping your bundle size minimal.
4+
5+
For advanced use cases requiring precise token counts, please use a full tokenizer like [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer).
46

57
## Benchmarks
68

@@ -11,18 +13,20 @@ The following table shows the accuracy of the token count approximation for diff
1113
| --- | --- | --- | --- |
1214
| Short English text | 10 | 11 | 10.00% |
1315
| German text with umlauts | 56 | 49 | 12.50% |
14-
| Metamorphosis by Franz Kafka (English) | 31892 | 33930 | 6.39% |
15-
| Die Verwandlung by Franz Kafka (German) | 40621 | 34908 | 14.06% |
16-
| 道德經 by Laozi (Chinese) | 14387 | 11919 | 17.15% |
17-
| TypeScript ES5 Type Declarations (~ 4000 loc) | 48408 | 51688 | 6.78% |
16+
| Metamorphosis by Franz Kafka (English) | 31892 | 35705 | 11.96% |
17+
| Die Verwandlung by Franz Kafka (German) | 40621 | 35069 | 13.67% |
18+
| 道德經 by Laozi (Chinese) | 14387 | 12059 | 16.18% |
19+
| TypeScript ES5 Type Declarations (~ 4000 loc) | 48553 | 52434 | 7.99% |
1820
<!-- END GENERATED TOKEN COUNT TABLE -->
1921

2022
## Features
2123

22-
- 🌁 Estimate token count without a full tokenizer
23-
- 📐 Supports multiple model context sizes
24-
- 🗣️ Supports accented characters, like German umlauts or French accents
24+
- ⚡ Fast token estimation without a full tokenizer
25+
- 🌍 Multi-language support with configurable language rules
26+
- 🗣️ Built-in support for accented characters (German, French, Spanish, etc.)
27+
- 🔧 Configurable and extensible
2528
- 🪽 Zero dependencies
29+
- 📦 Tiny bundle size
2630

2731
## Installation
2832

@@ -42,76 +46,72 @@ yarn add tokenx
4246
## Usage
4347

4448
```ts
45-
import {
46-
approximateMaxTokenSize,
47-
approximateTokenSize,
48-
isWithinTokenLimit
49-
} from 'tokenx'
49+
import { estimateTokenCount, isWithinTokenLimit } from 'tokenx'
5050

51-
const prompt = 'Your prompt goes here.'
52-
const inputText = 'Your text goes here.'
51+
const text = 'Your text goes here.'
5352

54-
// Estimate the number of tokens in the input text
55-
const estimatedTokens = approximateTokenSize(inputText)
53+
// Estimate the number of tokens in the text
54+
const estimatedTokens = estimateTokenCount(text)
5655
console.log(`Estimated token count: ${estimatedTokens}`)
5756

58-
// Calculate the maximum number of tokens allowed for a given model
59-
const modelName = 'gpt-3.5-turbo'
60-
const maxResponseTokens = 1000
61-
const availableTokens = approximateMaxTokenSize({
62-
prompt,
63-
modelName,
64-
maxTokensInResponse: maxResponseTokens
65-
})
66-
console.log(`Available tokens for model ${modelName}: ${availableTokens}`)
67-
68-
// Check if the input text is within a specific token limit
57+
// Check if text is within a specific token limit
6958
const tokenLimit = 1024
70-
const withinLimit = isWithinTokenLimit(inputText, tokenLimit)
59+
const withinLimit = isWithinTokenLimit(text, tokenLimit)
7160
console.log(`Is within token limit: ${withinLimit}`)
72-
```
7361

74-
## API
75-
76-
### `approximateTokenSize`
62+
// Use custom options for different languages or models
63+
const customOptions = {
64+
defaultCharsPerToken: 4, // More conservative estimation
65+
languageConfigs: [
66+
{ pattern: /[你我他]/g, averageCharsPerToken: 1.5 }, // Custom Chinese rule
67+
]
68+
}
7769

78-
Estimates the number of tokens in a given input string based on common English patterns and tokenization heuristics. Work well for other languages too, like German.
79-
80-
**Usage:**
81-
82-
```ts
83-
const estimatedTokens = approximateTokenSize('Hello, world!')
70+
const customEstimate = estimateTokenCount(text, customOptions)
71+
console.log(`Custom estimate: ${customEstimate}`)
8472
```
8573

86-
**Type Declaration:**
87-
88-
```ts
89-
function approximateTokenSize(input: string): number
90-
```
74+
## API
9175

92-
### `approximateMaxTokenSize`
76+
### `estimateTokenCount`
9377

94-
Calculates the maximum number of tokens that can be included in a response given the prompt length and model's maximum context size.
78+
Estimates the number of tokens in a given input string using heuristic rules that work across multiple languages and text types.
9579

9680
**Usage:**
9781

9882
```ts
99-
const maxTokens = approximateMaxTokenSize({
100-
prompt: 'Sample prompt',
101-
modelName: 'text-davinci-003',
102-
maxTokensInResponse: 500
83+
const estimatedTokens = estimateTokenCount('Hello, world!')
84+
85+
// With custom options
86+
const customEstimate = estimateTokenCount('Bonjour le monde!', {
87+
defaultCharsPerToken: 4,
88+
languageConfigs: [
89+
{ pattern: /[éèêëàâîï]/i, averageCharsPerToken: 3 }
90+
]
10391
})
10492
```
10593

10694
**Type Declaration:**
10795

10896
```ts
109-
function approximateMaxTokenSize({ prompt, modelName, maxTokensInResponse }: {
110-
prompt: string
111-
modelName: ModelName
112-
/** The maximum number of tokens to generate in the reply. 1000 tokens are roughly 750 English words. */
113-
maxTokensInResponse?: number
114-
}): number
97+
function estimateTokenCount(
98+
text?: string,
99+
options?: TokenEstimationOptions
100+
): number
101+
102+
interface TokenEstimationOptions {
103+
/** Default average characters per token when no language-specific rule applies */
104+
defaultCharsPerToken?: number
105+
/** Custom language configurations to override defaults */
106+
languageConfigs?: LanguageConfig[]
107+
}
108+
109+
interface LanguageConfig {
110+
/** Regular expression to detect the language */
111+
pattern: RegExp
112+
/** Average number of characters per token for this language */
113+
averageCharsPerToken: number
114+
}
115115
```
116116

117117
### `isWithinTokenLimit`
@@ -122,12 +122,18 @@ Checks if the estimated token count of the input is within a specified token lim
122122

123123
```ts
124124
const withinLimit = isWithinTokenLimit('Check this text against a limit', 100)
125+
// With custom options
126+
const customCheck = isWithinTokenLimit('Text', 50, { defaultCharsPerToken: 3 })
125127
```
126128

127129
**Type Declaration:**
128130

129131
```ts
130-
function isWithinTokenLimit(input: string, tokenLimit: number): boolean
132+
function isWithinTokenLimit(
133+
text: string,
134+
tokenLimit: number,
135+
options?: TokenEstimationOptions
136+
): boolean
131137
```
132138

133139
## License

build.config.ts

Lines changed: 0 additions & 10 deletions
This file was deleted.

package.json

Lines changed: 20 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
"name": "tokenx",
33
"type": "module",
44
"version": "0.4.1",
5-
"packageManager": "pnpm@9.14.4",
6-
"description": "GPT token estimation and context size utilities without a full tokenizer",
5+
"packageManager": "pnpm@10.11.0",
6+
"description": "Fast and lightweight token estimation for any LLM without requiring a full tokenizer",
77
"author": "Johann Schopplich <[email protected]>",
88
"license": "MIT",
99
"homepage": "https://github.com/johannschopplich/tokenx#readme",
@@ -16,50 +16,43 @@
1616
},
1717
"keywords": [
1818
"ai",
19-
"gpt",
19+
"llm",
2020
"token",
21-
"tiktoken"
21+
"tokenizer",
22+
"estimation",
23+
"tiktoken",
24+
"anthropic",
25+
"openai"
2226
],
2327
"sideEffects": false,
2428
"exports": {
2529
".": {
26-
"types": "./dist/index.d.mts",
27-
"import": {
28-
"types": "./dist/index.d.mts",
29-
"default": "./dist/index.mjs"
30-
},
31-
"require": {
32-
"types": "./dist/index.d.cts",
33-
"default": "./dist/index.cjs"
34-
},
35-
"default": "./dist/index.mjs"
30+
"types": "./dist/index.d.ts",
31+
"default": "./dist/index.js"
3632
}
3733
},
38-
"main": "./dist/index.cjs",
39-
"module": "./dist/index.mjs",
4034
"types": "./dist/index.d.ts",
4135
"files": [
4236
"dist"
4337
],
4438
"scripts": {
45-
"build": "unbuild",
39+
"build": "tsdown",
4640
"docs:generate": "tsx scripts/generateTable.ts",
47-
"dev": "unbuild --stub",
4841
"lint": "eslint .",
4942
"lint:fix": "eslint . --fix",
5043
"test": "vitest",
5144
"test:types": "tsc --noEmit",
5245
"release": "bumpp"
5346
},
5447
"devDependencies": {
55-
"@antfu/eslint-config": "^3.11.2",
56-
"@types/node": "^22.10.1",
57-
"bumpp": "^9.8.1",
58-
"eslint": "^9.15.0",
59-
"gpt-tokenizer": "^2.7.0",
60-
"tsx": "^4.19.2",
61-
"typescript": "^5.7.2",
62-
"unbuild": "^3.0.0-rc.11",
63-
"vitest": "^2.1.6"
48+
"@antfu/eslint-config": "^4.13.2",
49+
"@types/node": "^22.15.29",
50+
"bumpp": "^10.1.1",
51+
"eslint": "^9.28.0",
52+
"gpt-tokenizer": "^2.9.0",
53+
"tsdown": "^0.12.6",
54+
"tsx": "^4.19.4",
55+
"typescript": "^5.8.3",
56+
"vitest": "^3.2.0"
6457
}
6558
}

0 commit comments

Comments
 (0)