Where Python Broke at a Trillion Tokens: Rewriting Our Tokenizer in Rust

### Talk title

Where Python Broke at a Trillion Tokens: Rewriting Our Tokenizer in Rust

### Short talk description

Tokenization is the most-used, least-examined line of code in any LLM pipeline. At BharatGen, building a trillion-token pretraining corpus across 22 Indian languages, our Python tokenizer quietly became the bottleneck — and not where we expected it to be.

This talk is a field report from rewriting that tokenizer in Rust. We'll start with something most Python developers never inspect: what your tokenizer actually does under the hood, and why Indic scripts make it expensive. Then the real story — profiling to find where the time actually goes (rarely the obvious place), deciding when crossing into Rust is worth it and when it isn't, and what a ~3× speedup actually cost in engineering effort.

You'll leave able to tell when Python is genuinely your bottleneck — and when it isn't.

### Long talk description

Every LLM pipeline runs on tokenization, yet almost no one profiles it. It's treated as a solved, invisible step — until you're processing a trillion tokens and it isn't.

This is a field report from BharatGen, where we build pretraining data for sovereign Indic-language models. While scaling our data pipeline across 22 Indian languages, our Python tokenizer went from "fast enough" to a measurable bottleneck. This talk walks through what happened next — diagnosis, decision, and rewrite — and the general lessons that apply to any Python codebase with a hot path.

The talk has three movements, all serving one question: when is Python actually your bottleneck, and what do you do about it?

First, what a tokenizer actually does — the byte-level and subword mechanics most developers use daily but never look inside, and the concrete reason Indic scripts (Devanagari, Tamil, Bengali, and others) are more expensive to tokenize than English. This is the "unseen" part: visual, concrete, and useful whether or not you work on LLMs.

Second, finding the bottleneck — how we profiled the pipeline, why the slow part wasn't where we assumed, and the difference between code that feels slow and code that measurably is. The honest version: Python is often not the problem people think it is.

Third, the Rust rewrite — why Rust, where the language boundary goes (Python orchestration, Rust hot loop), what PyO3/maturin integration looks like in practice, how we kept it SLURM-native, and — most importantly — what the rewrite cost in engineering effort against the speedup it bought. Including the cases where crossing into Rust would have been the wrong call.

Attendees leave with a transferable mental model for performance decisions, a clearer view of a component they use every day, and an honest account of a real migration — payoffs and regrets included.

### What format do you have in mind?

Talk (20-25 minutes + Q&A)

### Talk outline / Agenda

0:00 – 0:03  The setup: a trillion tokens, 22 languages, and a tokenizer that
             used to be "fast enough." Why tokenization is the most-used,
             least-profiled code in any LLM pipeline.

0:03 – 0:09  What a tokenizer actually does: byte-level / subword mechanics
             under the hood, and a concrete look at why Indic scripts cost
             more tokens — and more compute — than English.

0:09 – 0:14  Finding the real bottleneck: how we profiled the pipeline, where
             the time actually went vs. where we assumed, and separating
             "feels slow" from "is slow."

0:14 – 0:20  The Rust rewrite: choosing the language boundary (Python
             orchestration, Rust hot path), PyO3/maturin in practice, staying
             SLURM-native, and what the ~3× actually cost to build.


0:20 – 0:25  Q&A.

### Key takeaways

- A working mental model for deciding when Python is genuinely your bottleneck — and when "rewrite it in Rust" is the wrong instinct.
- A clear, practical understanding of what a tokenizer does under the hood, and why multilingual / Indic text changes the cost equation.
- How to profile a data pipeline to find the real hot path, instead of optimizing where you assume the problem is.
- A realistic picture of a Python↔Rust migration: PyO3/maturin mechanics, where to put the language boundary, and the engineering cost behind the speedup.
- An honest account of the trade-offs — including what we would do differently.

### What domain would you say your talk falls under?

Data Science and Machine Learning

### Duration (including Q&A)

30 minutes (20-22 minutes talk + 8-10 minutes Q&A)

### Prerequisites and preparation

- Working knowledge of Python and comfort reading code.
- No prior experience with tokenizers, LLMs, or Rust required — every concept is introduced from first principles.
- Helpful but not required: having once tried to make a Python program faster.
- No setup or installation needed for attendees.

### Resources and references

- Hugging Face `tokenizers` library (Rust-backed reference implementation)
- PyO3 and maturin documentation (Python–Rust bindings)
- Recent research on tokenizer fertility for Indic languages (e.g., IndicSuperTokenizer, MUTANT — arXiv, 2025)
- Python profiling tools: cProfile, py-spy, scalene

### Link to slides/demos (if available)

_No response_

### Twitter/X handle (optional)

_No response_

### LinkedIn profile (optional)

https://www.linkedin.com/in/nauman-data-llm/

### Profile picture URL (optional)

_No response_

### Speaker bio

I'm a Data Platform Engineer at BharatGen, India's sovereign AI initiative (an IIT Bombay-led consortium), where I lead a small team building the data infrastructure behind Indic-language foundation models. My work centers on multilingual pretraining data at scale — corpora spanning 22 Indian languages — and the systems that process it: distributed data pipelines, large-scale deduplication and filtering, and the unglamorous performance engineering that decides whether a pipeline finishes in days or weeks.

I work primarily in Python and Rust, and I'm drawn to the parts of an ML data stack that rarely get talked about. I also co-organize the Kafka Mumbai community. This talk comes straight out of a real problem my team hit — and what we learned fixing it.

### Availability

23/05/2026

### Accessibility & special requirements

_No response_

### Speaker checklist

- [x] I have read and understood the [PyDelhi guidelines](https://github.com/pydelhi/talks/blob/main/guidelines/speaking.md) for submitting proposals and giving talks
- [x] I have read and acknowledged the [PyDelhi accessibility guidelines](https://github.com/pydelhi/talks/blob/main/guidelines/accessibility.md) and will ensure my presentation materials (slides, videos, demos) follow these recommendations
- [x] I will make my talk accessible to all attendees and will proactively ask for any accommodations or special requirements I might need
- [x] I agree to share slides, code snippets, and other materials used during the talk with the community
- [x] I will follow PyDelhi's Code of Conduct and maintain a welcoming, inclusive environment throughout my participation
- [x] I understand that PyDelhi meetups are community-centric events focused on learning, knowledge sharing, and networking, and I will respect this ethos by not using this platform for self-promotion or hiring pitches during my presentation, unless explicitly invited to do so by means of a sponsorship or similar arrangement
- [ ] If the talk is recorded by the PyDelhi team, I grant permission to release the video on PyDelhi's YouTube channel under the CC-BY-4.0 license, or a different license of my choosing if I am specifying it in my proposal or with the materials I share

### Additional comments

This talk is a direct fit for the "Unseen in AI & Python" theme - a component everyone uses and almost no one inspects, told as a real project story rather than a polished demo. I'm happy to adjust scope or depth for the slot available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where Python Broke at a Trillion Tokens: Rewriting Our Tokenizer in Rust #419

Talk title

Short talk description

Long talk description

What format do you have in mind?

Talk outline / Agenda

Key takeaways

What domain would you say your talk falls under?

Duration (including Q&A)

Prerequisites and preparation

Resources and references

Link to slides/demos (if available)

Twitter/X handle (optional)

LinkedIn profile (optional)

Profile picture URL (optional)

Speaker bio

Availability

Accessibility & special requirements

Speaker checklist

Additional comments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Where Python Broke at a Trillion Tokens: Rewriting Our Tokenizer in Rust #419

Description

Talk title

Short talk description

Long talk description

What format do you have in mind?

Talk outline / Agenda

Key takeaways

What domain would you say your talk falls under?

Duration (including Q&A)

Prerequisites and preparation

Resources and references

Link to slides/demos (if available)

Twitter/X handle (optional)

LinkedIn profile (optional)

Profile picture URL (optional)

Speaker bio

Availability

Accessibility & special requirements

Speaker checklist

Additional comments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions