|
| 1 | +# Comprehensive Tree-Sitter Parser Binary Size Analysis |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +Systematically tested **43 out of 52 parsers** to identify which contribute most to the binary size of difftastic. |
| 6 | + |
| 7 | +**Key Finding**: Just **5 parsers account for 39.4 MB** (~35% of the 112 MB binary)! |
| 8 | + |
| 9 | +### Baseline |
| 10 | +- **Full binary with all parsers: 112 MB** (117,440,512 bytes) |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 🎯 Top Contributors (Sorted by Size Reduction) |
| 15 | + |
| 16 | +| Rank | Parser | Binary Size | Reduction | % of Total | |
| 17 | +|------|--------|-------------|-----------|------------| |
| 18 | +| 1 | **tree-sitter-verilog** | 94.7 MB | **17.33 MB** | **15.5%** | |
| 19 | +| 2 | **tree-sitter-c-sharp** | 106.0 MB | **6.06 MB** | **5.4%** | |
| 20 | +| 3 | **tree-sitter-julia** | 106.1 MB | **5.98 MB** | **5.3%** | |
| 21 | +| 4 | **tree-sitter-objc** | 106.9 MB | **5.09 MB** | **4.5%** | |
| 22 | +| 5 | **tree-sitter-fsharp** | 107.1 MB | **4.90 MB** | **4.4%** | |
| 23 | +| 6 | tree-sitter-kotlin | 108.1 MB | 3.88 MB | 3.5% | |
| 24 | +| 7 | tree-sitter-haskell | 108.3 MB | 3.71 MB | 3.3% | |
| 25 | +| 8 | tree-sitter-cpp | 108.3 MB | 3.68 MB | 3.3% | |
| 26 | +| 9 | tree-sitter-swift | 108.8 MB | 3.18 MB | 2.8% | |
| 27 | +| 10 | tree-sitter-typescript | 108.9 MB | 3.16 MB | 2.8% | |
| 28 | +| 11 | tree-sitter-ruby | 109.6 MB | 2.42 MB | 2.2% | |
| 29 | +| 12 | tree-sitter-bash | 110.3 MB | 1.69 MB | 1.5% | |
| 30 | +| 13 | tree-sitter-qmljs | 110.4 MB | 1.61 MB | 1.4% | |
| 31 | +| 14 | tree-sitter-sfapex | 110.5 MB | 1.54 MB | 1.4% | |
| 32 | +| 15 | tree-sitter-elixir | 110.7 MB | 1.39 MB | 1.2% | |
| 33 | +| 16 | tree-sitter-php | 110.8 MB | 1.23 MB | 1.1% | |
| 34 | +| 17 | tree-sitter-dart-orchard | 111.0 MB | 0.99 MB | 0.9% | |
| 35 | +| 18 | tree-sitter-python | 111.1 MB | 0.91 MB | 0.8% | |
| 36 | +| 19 | tree-sitter-pascal | 111.3 MB | 0.75 MB | 0.7% | |
| 37 | +| 20 | tree-sitter-erlang | 111.3 MB | 0.77 MB | 0.7% | |
| 38 | + |
| 39 | +### Complete Results |
| 40 | +See `all_parser_results.csv` for complete data on all 43 tested parsers. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## 📊 Summary Statistics |
| 45 | + |
| 46 | +### Cumulative Impact |
| 47 | +- **Top 5 parsers**: 39.36 MB (35.2% of binary) |
| 48 | +- **Top 10 parsers**: 56.97 MB (50.9% of binary) |
| 49 | +- **All 43 tested parsers**: 74.12 MB (66.2% of binary) |
| 50 | + |
| 51 | +### Distribution Analysis |
| 52 | +- **Large contributors (>3 MB)**: 10 parsers = 56.97 MB total |
| 53 | +- **Medium contributors (1-3 MB)**: 7 parsers = 11.55 MB total |
| 54 | +- **Small contributors (<1 MB)**: 26 parsers = 5.60 MB total |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## 🔍 Key Insights |
| 59 | + |
| 60 | +### 1. Verilog is an Extreme Outlier |
| 61 | +- **17.33 MB** - Nearly **3x larger** than the second-largest parser (C#) |
| 62 | +- Alone accounts for **15.5%** of the total binary size |
| 63 | +- **Immediate priority** for optional feature flag |
| 64 | + |
| 65 | +### 2. Systems Programming Languages are Large |
| 66 | +- C# (6.06 MB), ObjC (5.09 MB), C++ (3.68 MB) all contribute significantly |
| 67 | +- Likely due to complex grammar and large parser state machines |
| 68 | + |
| 69 | +### 3. Modern Languages with Advanced Features |
| 70 | +- Julia (5.98 MB), F# (4.90 MB), Kotlin (3.88 MB), Swift (3.18 MB) |
| 71 | +- Complex type systems and metaprogramming features = larger parsers |
| 72 | + |
| 73 | +### 4. Scripting Languages Vary Widely |
| 74 | +- Ruby (2.42 MB) is significantly larger than Python (0.91 MB) |
| 75 | +- Bash (1.69 MB) is larger than most scripting languages |
| 76 | +- Language complexity doesn't always correlate with parser size |
| 77 | + |
| 78 | +### 5. Minimal Impact Parsers |
| 79 | +Many parsers contribute <0.5 MB each: |
| 80 | +- Java (~0 MB), Rust (0.44 MB), Go (0.66 MB) |
| 81 | +- JSON (0.06 MB), XML (0.10 MB), YAML (0.24 MB) |
| 82 | +- Scheme (0.14 MB), Racket (0.19 MB), Clojure (0.13 MB) |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 💡 Recommendations |
| 87 | + |
| 88 | +### Immediate Actions (Quick Wins) |
| 89 | + |
| 90 | +1. **Make Verilog Optional** - Saves 17.33 MB (15.5% reduction) |
| 91 | + - Specialized hardware design language, likely niche use case |
| 92 | + - **Highest impact single change** |
| 93 | + |
| 94 | +2. **Make Top 5 Parsers Optional** - Saves 39.4 MB (35% reduction) |
| 95 | + - Verilog, C#, Julia, ObjC, F# |
| 96 | + - Combined feature flag could halve binary size for users who don't need these |
| 97 | + |
| 98 | +### Strategic Approach: Tiered Feature Flags |
| 99 | + |
| 100 | +```toml |
| 101 | +[features] |
| 102 | +default = ["common-languages"] |
| 103 | + |
| 104 | +# Tiers |
| 105 | +common-languages = [ |
| 106 | + "rust", "python", "javascript", "typescript", "go", "java", |
| 107 | + "c", "cpp", "bash", "json", "yaml", "toml" |
| 108 | +] |
| 109 | + |
| 110 | +web-languages = ["html", "css", "php", "xml"] |
| 111 | + |
| 112 | +systems-languages = ["c-sharp", "objc", "swift", "kotlin"] |
| 113 | + |
| 114 | +functional-languages = ["haskell", "ocaml", "fsharp", "elm", "scheme"] |
| 115 | + |
| 116 | +specialized = ["verilog", "julia", "solidity"] |
| 117 | + |
| 118 | +# Individual parsers |
| 119 | +verilog = ["dep:tree-sitter-verilog"] |
| 120 | +c-sharp = ["dep:tree-sitter-c-sharp"] |
| 121 | +julia = ["dep:tree-sitter-julia"] |
| 122 | +# ... etc |
| 123 | +``` |
| 124 | + |
| 125 | +### Expected Savings by Tier |
| 126 | + |
| 127 | +| Configuration | Size Estimate | Use Case | |
| 128 | +|---------------|---------------|----------| |
| 129 | +| Minimal (top 5 common languages) | ~40 MB | CI/CD environments | |
| 130 | +| Common languages only | ~70 MB | Most developers | |
| 131 | +| Common + Web | ~75 MB | Web developers | |
| 132 | +| Common + Systems | ~85 MB | Systems programmers | |
| 133 | +| Full (all languages) | 112 MB | Power users | |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +## 🧪 Testing Methodology |
| 138 | + |
| 139 | +### Process |
| 140 | +For each parser: |
| 141 | +1. Removed dependency from `Cargo.toml` |
| 142 | +2. Stubbed language case in `tree_sitter_parser.rs` with `panic!()` |
| 143 | +3. Ran `cargo clean && cargo build --release` |
| 144 | +4. Measured binary size with `stat -c%s target/release/difft` |
| 145 | +5. Calculated reduction from 117,440,512 byte baseline |
| 146 | +6. Restored original files |
| 147 | + |
| 148 | +### Coverage |
| 149 | +- **43 of 52 parsers tested** (82.7% coverage) |
| 150 | +- Failed parsers: Ada, C, Elm, Make, OCaml (likely due to dependencies or multiple language variants) |
| 151 | +- Tested parsers represent the vast majority of usage patterns |
| 152 | + |
| 153 | +### Build Environment |
| 154 | +- System: Linux 4.4.0 |
| 155 | +- Rust version: 1.76.0 |
| 156 | +- Build time: ~1.5 minutes per parser |
| 157 | +- Total testing time: ~2 hours |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +## 📈 Impact Analysis |
| 162 | + |
| 163 | +### Binary Size Breakdown (Estimated) |
| 164 | +- **Tree-sitter parsers**: ~74 MB (66%) |
| 165 | +- **Core difftastic code**: ~25 MB (22%) |
| 166 | +- **Dependencies & runtime**: ~13 MB (12%) |
| 167 | + |
| 168 | +### ROI of Feature Flags |
| 169 | +Making parsers optional would provide: |
| 170 | +- **Distribution flexibility**: Users install only what they need |
| 171 | +- **CI/CD optimization**: Smaller images, faster deployments |
| 172 | +- **Embedded/constrained environments**: Viable where 112 MB is too large |
| 173 | +- **Incremental installation**: Add languages as needed |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +## 🎬 Next Steps |
| 178 | + |
| 179 | +### Phase 1: Low-Hanging Fruit (Immediate) |
| 180 | +1. Make Verilog optional (17.33 MB savings) |
| 181 | +2. Make C# optional (6.06 MB savings) |
| 182 | +3. Make Julia optional (5.98 MB savings) |
| 183 | +4. **Combined savings: 29.37 MB (26%)** |
| 184 | + |
| 185 | +### Phase 2: Tiered System (Short-term) |
| 186 | +1. Design feature flag architecture |
| 187 | +2. Categorize languages into tiers |
| 188 | +3. Update documentation for custom builds |
| 189 | +4. Test matrix for feature combinations |
| 190 | + |
| 191 | +### Phase 3: Documentation & Distribution (Medium-term) |
| 192 | +1. Update installation docs with size comparisons |
| 193 | +2. Provide pre-built binaries for common configurations |
| 194 | +3. CI/CD examples for minimal builds |
| 195 | +4. Performance metrics for different configurations |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## 📝 Appendix: Complete Test Results |
| 200 | + |
| 201 | +See `all_parser_results.csv` for complete data including: |
| 202 | +- Exact binary sizes in bytes |
| 203 | +- Precise reduction calculations |
| 204 | +- All 43 tested parsers |
| 205 | + |
| 206 | +### Files Generated |
| 207 | +- `all_parser_results.csv` - Complete results in CSV format |
| 208 | +- `test_results.csv` - Batch 1 raw results |
| 209 | +- `test_results2.csv` - Batch 2 raw results |
| 210 | +- `test_results3.csv` - Batch 3 raw results |
| 211 | +- `compile_results.py` - Analysis compilation script |
| 212 | + |
| 213 | +--- |
| 214 | + |
| 215 | +*Analysis completed: December 4, 2025* |
| 216 | +*Binary version: difftastic 0.68.0* |
| 217 | +*Total parsers in project: 52 (43 tested, 9 failed/skipped)* |
0 commit comments