Skip to content

Commit 3472081

Browse files
docs: clarify CSV vs TOON use cases
1 parent cdb9058 commit 3472081

File tree

3 files changed

+59
-49
lines changed

3 files changed

+59
-49
lines changed

README.md

Lines changed: 29 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,12 @@
88
[![npm downloads (total)](https://img.shields.io/npm/dt/@toon-format/toon.svg)](https://www.npmjs.com/package/@toon-format/toon)
99
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
1010

11-
**Token-Oriented Object Notation** is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input, not output.
11+
**Token-Oriented Object Notation** is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for *LLM input* as a lossless, drop-in representation of JSON data.
1212

1313
TOON's sweet spot is **uniform arrays of objects** – multiple fields per row, same structure across items. It borrows YAML's indentation-based structure for nested objects and CSV's tabular format for uniform data rows, then optimizes both for token efficiency in LLM contexts. For deeply nested or non-uniform data, JSON may be more efficient.
1414

15+
TOON achieves CSV-like compactness while adding explicit structure that helps LLMs parse and validate data reliably.
16+
1517
> [!TIP]
1618
> Think of TOON as a translation layer: use JSON programmatically, convert to TOON for LLM input.
1719
@@ -71,41 +73,48 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
7173
> [!TIP]
7274
> Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.
7375
74-
The benchmarks test datasets that favor TOON's strengths (uniform tabular data). Real-world performance depends heavily on your data structure.
76+
### Token Efficiency
77+
78+
Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
79+
80+
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
81+
82+
> [!NOTE]
83+
> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy) tables.
7584
7685
<!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
7786

7887
### Token Efficiency
7988

8089
```
8190
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
82-
vs JSON (-42.3%) 15,145
83-
vs JSON compact (-23.7%) 11,455
84-
vs YAML (-33.4%) 13,129
85-
vs XML (-48.8%) 17,095
91+
vs JSON (42.3%) 15,145
92+
vs JSON compact (23.7%) 11,455
93+
vs YAML (33.4%) 13,129
94+
vs XML (48.8%) 17,095
8695
8796
📈 Daily Analytics ██████████░░░░░░░░░░░░░░░ 4,507 tokens
88-
vs JSON (-58.9%) 10,977
89-
vs JSON compact (-35.7%) 7,013
90-
vs YAML (-48.8%) 8,810
91-
vs XML (-65.7%) 13,128
97+
vs JSON (58.9%) 10,977
98+
vs JSON compact (35.7%) 7,013
99+
vs YAML (48.8%) 8,810
100+
vs XML (65.7%) 13,128
92101
93102
🛒 E-Commerce Order ████████████████░░░░░░░░░ 166 tokens
94-
vs JSON (-35.4%) 257
95-
vs JSON compact (-2.9%) 171
96-
vs YAML (-15.7%) 197
97-
vs XML (-38.7%) 271
103+
vs JSON (35.4%) 257
104+
vs JSON compact (2.9%) 171
105+
vs YAML (15.7%) 197
106+
vs XML (38.7%) 271
98107
99108
─────────────────────────────────────────────────────────────────────
100109
Total ██████████████░░░░░░░░░░░ 13,418 tokens
101-
vs JSON (-49.1%) 26,379
102-
vs JSON compact (-28.0%) 18,639
103-
vs YAML (-39.4%) 22,136
104-
vs XML (-56.0%) 30,494
110+
vs JSON (49.1%) 26,379
111+
vs JSON compact (28.0%) 18,639
112+
vs YAML (39.4%) 22,136
113+
vs XML (56.0%) 30,494
105114
```
106115

107116
<details>
108-
<summary><strong>Show detailed examples</strong></summary>
117+
<summary><strong>View detailed examples</strong></summary>
109118

110119
#### ⭐ GitHub Repositories
111120

@@ -242,9 +251,6 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
242251

243252
<!-- /automd -->
244253

245-
> [!NOTE]
246-
> Token savings are measured against formatted JSON (2-space indentation) as the primary baseline. Additional comparisons include compact JSON (minified), YAML, and XML to provide a comprehensive view across common data formats. Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (GPT-5 tokenizer). Actual savings vary by model and tokenizer.
247-
248254
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
249255

250256
### Retrieval Accuracy
@@ -909,6 +915,7 @@ By default, the decoder validates input strictly:
909915
- Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
910916
- **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
911917
- **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
918+
- **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency.
912919
- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
913920
- **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
914921

benchmarks/results/token-efficiency.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,29 +2,29 @@
22

33
```
44
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
5-
vs JSON (-42.3%) 15,145
6-
vs JSON compact (-23.7%) 11,455
7-
vs YAML (-33.4%) 13,129
8-
vs XML (-48.8%) 17,095
5+
vs JSON (42.3%) 15,145
6+
vs JSON compact (23.7%) 11,455
7+
vs YAML (33.4%) 13,129
8+
vs XML (48.8%) 17,095
99
1010
📈 Daily Analytics ██████████░░░░░░░░░░░░░░░ 4,507 tokens
11-
vs JSON (-58.9%) 10,977
12-
vs JSON compact (-35.7%) 7,013
13-
vs YAML (-48.8%) 8,810
14-
vs XML (-65.7%) 13,128
11+
vs JSON (58.9%) 10,977
12+
vs JSON compact (35.7%) 7,013
13+
vs YAML (48.8%) 8,810
14+
vs XML (65.7%) 13,128
1515
1616
🛒 E-Commerce Order ████████████████░░░░░░░░░ 166 tokens
17-
vs JSON (-35.4%) 257
18-
vs JSON compact (-2.9%) 171
19-
vs YAML (-15.7%) 197
20-
vs XML (-38.7%) 271
17+
vs JSON (35.4%) 257
18+
vs JSON compact (2.9%) 171
19+
vs YAML (15.7%) 197
20+
vs XML (38.7%) 271
2121
2222
─────────────────────────────────────────────────────────────────────
2323
Total ██████████████░░░░░░░░░░░ 13,418 tokens
24-
vs JSON (-49.1%) 26,379
25-
vs JSON compact (-28.0%) 18,639
26-
vs YAML (-39.4%) 22,136
27-
vs XML (-56.0%) 30,494
24+
vs JSON (49.1%) 26,379
25+
vs JSON compact (28.0%) 18,639
26+
vs YAML (39.4%) 22,136
27+
vs XML (56.0%) 30,494
2828
```
2929

3030
<details>

benchmarks/scripts/token-efficiency-benchmark.ts

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ interface FormatMetrics {
1212
name: string
1313
tokens: number
1414
savings: number
15-
savingsPercent: string
15+
savingsPercent: number
1616
}
1717

1818
interface BenchmarkResult {
@@ -75,7 +75,7 @@ for (const example of BENCHMARK_EXAMPLES) {
7575
name: formatName,
7676
tokens,
7777
savings,
78-
savingsPercent: formatName === 'toon' ? '0.0' : ((savings / tokens) * 100).toFixed(1),
78+
savingsPercent: formatName === 'toon' ? 0 : (savings / tokens) * 100,
7979
})
8080
}
8181

@@ -91,14 +91,14 @@ for (const example of BENCHMARK_EXAMPLES) {
9191

9292
// Calculate total savings percentages
9393
const totalToonTokens = totalTokensByFormat.toon!
94-
const totalSavingsPercent: Record<string, string> = {}
94+
const totalSavingsPercent: Record<string, number> = {}
9595
for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) {
9696
if (formatName === 'toon') {
97-
totalSavingsPercent[formatName] = '0.0'
97+
totalSavingsPercent[formatName] = 0
9898
}
9999
else {
100100
const savings = totalTokens - totalToonTokens
101-
totalSavingsPercent[formatName] = ((savings / totalTokens) * 100).toFixed(1)
101+
totalSavingsPercent[formatName] = (savings / totalTokens) * 100
102102
}
103103
}
104104

@@ -107,7 +107,7 @@ const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml']
107107
const datasetRows = results
108108
.map((result) => {
109109
const toon = result.formats.find(f => f.name === 'toon')!
110-
const percentage = Number.parseFloat(result.formats.find(f => f.name === 'json-pretty')!.savingsPercent)
110+
const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent
111111
const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens
112112
const toonStr = toon.tokens.toLocaleString('en-US')
113113

@@ -116,7 +116,10 @@ const datasetRows = results
116116
const comparisonLines = formatOrder.map((formatName) => {
117117
const format = result.formats.find(f => f.name === formatName)!
118118
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
119-
const labelWithSavings = `vs ${label} (-${format.savingsPercent}%)`.padEnd(27)
119+
const signedPercent = format.savingsPercent >= 0
120+
? `−${format.savingsPercent.toFixed(1)}%`
121+
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
122+
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
120123
const tokenStr = format.tokens.toLocaleString('en-US').padStart(6)
121124
return ` ${labelWithSavings}${tokenStr}`
122125
})
@@ -140,7 +143,8 @@ const totalComparisonLines = formatOrder.map((formatName) => {
140143
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
141144
const tokens = totalTokensByFormat[formatName]!
142145
const percent = totalSavingsPercent[formatName]!
143-
const labelWithSavings = `vs ${label} (-${percent}%)`.padEnd(27)
146+
const signedPercent = percent >= 0 ? `−${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%`
147+
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
144148
const tokenStr = tokens.toLocaleString('en-US').padStart(6)
145149
return ` ${labelWithSavings}${tokenStr}`
146150
})
@@ -176,7 +180,7 @@ const detailedExamples = results
176180
177181
**Configuration:** ${result.description}
178182
179-
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent}% reduction vs JSON)
183+
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
180184
181185
**JSON** (${json.tokens.toLocaleString('en-US')} tokens):
182186
@@ -192,8 +196,7 @@ ${encode(displayData)}
192196
})
193197
.join('\n\n')
194198

195-
const markdown = `### Token Efficiency
196-
199+
const markdown = `
197200
\`\`\`
198201
${barChartSection}
199202
\`\`\`

0 commit comments

Comments
 (0)