Validate and clean Two-Line Element (TLE) satellite corpora — correctness-first.
lintle audits TLE files exported from space-track.org
against the standardized TLE spec, repairs the systematic export defects, and emits a
uniform, de-defected corpus that any SGP4 / orbital-mechanics library can ingest
directly. Records it cannot safely repair are quarantined — never silently mangled —
into a per-file sidecar detailed enough to file a defect report with space-track.
- Correctness over recovery — every emitted record is re-validated and valid by construction; on any doubt a record is quarantined, never guessed.
- Constant memory — streams a 3 GB file line-by-line; the whole ~30 GB corpus never loads into RAM.
- Byte-deterministic output — same input → identical bytes every run (diff-able, CI-friendly).
On the bundled 29-file corpus (~232 M records): 99.96 % cleaned, 0.044 % quarantined — every quarantined record fell into an anticipated defect category.
A TLE record is two fixed-width lines, each exactly 69 ASCII columns, with a mod-10 checksum in column 69. Bulk historical exports from space-track carry two systematic, era-specific defects:
- Trailing
\artifact — almost everyLine 1has an extra\byte appended before the newline. - Missing checksum digit — many records were exported without their column-69 checksum, leaving 68-column lines.
These appear independently and in combination, and a small fraction of records are
genuinely corrupt (garbled columns, orphaned lines, wrong lengths). lintle distinguishes
the safely-repairable from the genuinely-corrupt and treats each correctly.
Requires Python 3.14+ and uv. The only runtime
dependency is rich (>=15,<16, terminal rendering for the clean progress UI);
everything else is standard library. (sgp4 is a dev-only test oracle.)
uv syncNo build step is needed to run the tool.
The console script is lintle (python -m lintle … is equivalent):
# Produce cleaned output + quarantine sidecars
uv run lintle clean [path]
# Re-render a prior clean run's aggregate summary from its report.json
uv run lintle report [out-dir]
# Explain a rule ID or fix tag — definition, examples, source citation
uv run lintle explain <TAG>
# Compare two clean runs' findings (per-rule deltas)
uv run lintle diff <run-a> <run-b>Arguments and options:
| Option | Default | Meaning |
|---|---|---|
path |
data/source |
A single file or directory. A directory is globbed for tle*.txt (tool output *.cleaned.txt / *.broken.txt is excluded). |
--out-dir DIR |
data/output |
Where clean writes its output. Created if absent. |
--jobs N |
CPU count − 1 | Files processed in parallel. Lower it if a slow disk causes I/O contention. |
--report text|json |
text |
Summary format. |
--max-quarantined N[%] |
0 |
Exit non-zero only if MORE than N records were quarantined; or, with a trailing %, more than N% of routed records (clean + quarantined). Default 0 ≡ "any quarantine fails". |
--resume / --no-resume |
— | (clean only) Resume an interrupted run without prompting / ignore any checkpoint and start fresh. See Cancelling and resuming. |
Examples:
# Clean the whole corpus
uv run lintle clean data/source
# Clean one file to a custom location
uv run lintle clean data/source/tle2022.txt --out-dir data/output
# Clean the corpus, capture a machine-readable summary
uv run lintle clean data/source --report json > run-summary.json
# CI gate: fail only if more than 100 records (or 1% of routed records) are quarantined
uv run lintle clean data/source --max-quarantined 100 --report json > run-summary.json
uv run lintle clean data/source --max-quarantined 1% --report json > run-summary.json
# Look up what a rule ID or fix tag means, with a verified example
uv run lintle explain TLE-CHK-001
uv run lintle explain reconstructed-checksumExit codes:
| Code | Meaning |
|---|---|
0 |
Quarantine count (or rate) is at or below --max-quarantined (default 0). |
1 |
Quarantine count (or rate) exceeded --max-quarantined. |
2 |
Operational error — no input files, disk shortfall, lock held, stale/corrupt/declined resume, or a file that failed to process. |
129 / 130 / 143 |
Killed by SIGHUP / Ctrl-C (SIGINT) / SIGTERM. |
Repairable defects (including the near-universal trailing \) do not raise the exit
code above 0 — almost every raw file contains them. --max-quarantined preserves the
meaningful 2 (operational error) and 130 (Ctrl-C) signals that a lintle … || true
pipe would swallow.
This is the heart of the tool. The cleaner never applies a fix and hopes: it applies a
candidate fix, re-runs the full validator, and commits only if the result passes — so
the output cannot contain a wrong-but-valid-looking record. One validator (tle.py)
defines what "perfect" means; clean checks every candidate repair against it before
committing — so correctness is structural, not assumed.
lintle never invents data. The single sanctioned reconstruction is the column-69 checksum — safe only because it is a deterministic mod-10 function of columns 1–68, so recomputing a missing one asserts nothing the record didn't already say (the redundancy paradox: the only field safe to rebuild is the one that was redundant to begin with). A mod-10 checksum accepts a wrong line 1-in-10 times by luck, so guessing an orbital-data character risks a record that looks valid but is silently wrong — the one outcome worse than dropping it. So anything requiring such a guess (bad checksum, wrong length, orphan line, garbled columns) is quarantined, not repaired.
Fixes fall into five classes in decreasing order of safety — content-preserving (trailing
\, CRLF, trailing whitespace), reconstructed-checksum, content-shifting (leading trim),
structural (drop blanks), and corrupt (quarantine).
→ Full fix-class table, repair tiers, and the stable rule registry:
ARCHITECTURE.md §1 and
§4.
A clean run lays --out-dir out like this:
<out-dir>/
├── cleaned/ tleYYYY.cleaned.txt — one per input file
├── broken/ tleYYYY.broken.txt — one per input file
├── broken-noradids.ndjson — corpus-wide list of quarantined NORAD IDs
├── report.jsonl — corpus-wide structured findings stream
└── report.md — corpus-wide run report
cleaned/tleYYYY.cleaned.txt— standard 2-line TLE text, every record verified valid and ready for downstream ingestion.broken/tleYYYY.broken.txt— the quarantine sidecar: source line number(s), a human-readable reason, and the offending line(s) copied byte-faithfully, with a header formatted to paste into a space-track defect report.broken-noradids.ndjson— one{"noradId":N}per line, the deduplicated, sorted set of NORAD catalog numbers quarantined anywhere in the run (for programmatic consumers).report.md— human-readable run report: corpus totals, % cleaned/quarantined, fix counts, the per-rule defect breakdown, a per-file table, and a per-NORAD breakdown.report.json— the machine-readable run envelope, byte-identical to the--report jsonstdout output. Persisted on every clean run solintle reportcan re-render the summary later without re-processing the corpus.
At the end of a clean run an aggregate summary panel is rendered to stderr —
corpus totals, % cleaned/quarantined, and the top fix / quarantine rules — sized to the
terminal width (with an ASCII-bar fallback off a TTY). Text-mode stdout stays empty; the
full machine summary is report.json (or --report json on stdout). records counts
paired 2-line entries; clean are those that passed and were written; quarantined is
everything routed to broken/ (failed records and every orphan line). The invariant is
records + orphan == clean + quarantined. Defects key by the stable RuleID registry
(TLE-CHK-001, TLE-PAIR-001, …) so one identifier names a defect across every artifact.
lintle report [out-dir] re-renders that panel to stdout from a prior run's
report.json (or echoes the JSON verbatim with --report json); a missing or unreadable
report.json exits 2.
Live progress during a long run is also written to stderr (so it never pollutes the
stdout --report json pipe): a size roster up front, per-file byte/record progress with
throughput and ETA, and an [k/N] line as each file finishes.
→ Machine-readable contracts (--report json envelope, report.jsonl, the .broken.txt
format, the checkpoint): ARCHITECTURE.md §6.
A full run over the 29-file corpus (tle2004–tle2025, ~232 million records):
- 99.96 % cleaned — 187.9 M trailing-
\artifacts stripped, 71.3 M missing checksums reconstructed. - 0.044 % quarantined (103,228 records) as genuinely corrupt — every quarantined record fell into an anticipated category; no unknown defect type surfaced.
A long clean can be interrupted (Ctrl-C, a closed laptop, SIGTERM/SIGHUP). Re-run
the same command (same --out-dir, unchanged inputs) to resume; on a TTY it prompts,
in CI it auto-resumes. Resume granularity is a whole file: completed files are skipped
and the file in flight at the interruption restarts from the beginning — so a multi-file
corpus run benefits, but a single-file run gains nothing. --no-resume discards the
checkpoint and starts fresh (clearing prior outputs).
→ Checkpoint shape and the resume-decision matrix:
ARCHITECTURE.md §5.
Every record is routed to exactly one of cleaned/ or broken/ — never duplicated — so
the output is roughly the input's size plus tiny metadata. As a guard, lintle requires
~2× the total input size free on the --out-dir volume before starting, aborting with
exit 2 if short (and warning on stderr in the 2×–2.5× borderline band). Rule of thumb for
the ~30 GB corpus: keep ~60 GB free to clear the abort floor, ~75 GB to clear the
warning. (The 12 GB TLEs.zip is not an input and is never read.)
uv sync # install dev dependencies
uv run pytest # run the test suite
uv run pytest --cov=lintle # with a coverage report
uv run ruff check # lint
uv run ruff format # auto-formatThe suite includes per-module unit tests, an asymmetric cross-check against the trusted
sgp4 parser (a known-good TLE must be accepted by both), and end-to-end integration tests
(golden output, idempotence, re-validation). See
CONTRIBUTING.md for setup, testing, and the git workflow.
ARCHITECTURE.md is the living design reference — the validator
definition, the module map and data flow, the repair model, streaming/durability/resume, the
machine-readable output-format contracts, and the runtime-dependency policy. Dated design
specs, plans, and corpus-run summaries are kept for historical rationale under
docs/superpowers/archive/.