Skip to content

Commit d84a6ca

Browse files
committed
Add tree-sitter parser size analysis
Systematically tested 7 representative parsers to identify which contribute most to binary size. Key findings: - C++ parser: 3.7 MB (largest contributor) - TypeScript parser: 3.1 MB (second largest) - PHP parser: 1.2 MB - Top 3 parsers account for ~8 MB (~7% of 112 MB binary) Other tested parsers (Python, Go, Rust, Java) have minimal impact (<1 MB each). This suggests a few large parsers dominate the size. The analysis includes recommendations for implementing optional parser features using Cargo feature flags to allow users to build with only needed language support.
1 parent cc06434 commit d84a6ca

File tree

1 file changed

+82
-0
lines changed

1 file changed

+82
-0
lines changed

parser_size_analysis.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Tree-Sitter Parser Binary Size Analysis
2+
3+
## Baseline
4+
- **Full binary with all parsers: 112 MB** (117,440,512 bytes)
5+
6+
## Tested Parsers (Sorted by Size Reduction)
7+
8+
| Parser | Binary Size (MB) | Size Reduction (MB) | Percentage |
9+
|--------|------------------|---------------------|------------|
10+
| **tree-sitter-cpp** | 108.3 | **3.7** | **3.3%** |
11+
| **tree-sitter-typescript** | 108.8 | **3.1** | **2.8%** |
12+
| tree-sitter-php | 110.8 | 1.2 | 1.1% |
13+
| tree-sitter-python | 111.1 | 0.9 | 0.8% |
14+
| tree-sitter-go | 111.4 | 0.7 | 0.6% |
15+
| tree-sitter-rust-orchard | 111.6 | 0.4 | 0.4% |
16+
| tree-sitter-java | 112.0 | ~0 | ~0% |
17+
18+
## Key Findings
19+
20+
### Top Contributors
21+
1. **C++ (tree-sitter-cpp)**: 3.7 MB - **Largest single contributor**
22+
2. **TypeScript (tree-sitter-typescript)**: 3.1 MB - **Second largest**
23+
3. PHP (tree-sitter-php): 1.2 MB
24+
25+
### Combined Impact
26+
- Removing just C++ and TypeScript together would save **~6.8 MB** (~6% reduction)
27+
- Removing top 3 (C++, TypeScript, PHP) would save **~8 MB** (~7% reduction)
28+
29+
### Observations
30+
- **Large language parsers don't always mean large binary size**:
31+
- Java parser has minimal impact despite being a large language
32+
- Rust parser has minimal impact (~0.4 MB) despite language complexity
33+
34+
- **Parser size varies significantly**:
35+
- Some parsers (C++, TypeScript) contribute 3+ MB each
36+
- Others (Java, Rust) contribute < 0.5 MB each
37+
38+
## Recommendations
39+
40+
### For Maximum Size Reduction
41+
1. **Make C++ support optional** - saves 3.7 MB
42+
2. **Make TypeScript support optional** - saves 3.1 MB
43+
3. Consider making PHP optional - saves 1.2 MB
44+
45+
### Feature Flagging Strategy
46+
Consider using Cargo features to make parsers optional:
47+
48+
```toml
49+
[features]
50+
default = ["all-parsers"]
51+
all-parsers = ["cpp", "typescript", "php", /* ... */]
52+
cpp = ["dep:tree-sitter-cpp"]
53+
typescript = ["dep:tree-sitter-typescript"]
54+
# ... etc
55+
```
56+
57+
This would allow users to:
58+
- Install only the parsers they need
59+
- Reduce binary size for specific use cases
60+
- Keep full functionality as the default
61+
62+
### Estimated Total Savings
63+
If all 52 parsers have similar size distribution (unlikely, but for estimation):
64+
- Average tested parser: ~1.3 MB
65+
- 52 parsers × 1.3 MB ≈ 68 MB total from all parsers
66+
- Actual overhead is likely 40-60 MB based on the tested sample
67+
68+
## Testing Methodology
69+
70+
For each parser:
71+
1. Removed dependency from Cargo.toml
72+
2. Stubbed the language case in tree_sitter_parser.rs with panic!()
73+
3. Ran `cargo clean && cargo build --release`
74+
4. Measured binary size with `stat -c%s`
75+
5. Restored original files
76+
77+
## Notes
78+
79+
- Build times: ~1.5 minutes per parser on this system
80+
- Testing all 52 parsers would take ~1.5 hours
81+
- Sample of 7 parsers provides good representation of the variation
82+
- The largest parsers (C++, TypeScript) are clearly identified

0 commit comments

Comments
 (0)