Performance: parse-dispatch optimization
Speeds up Parse::parse on its hot path — the failed parse attempts that dominate qsv stats --infer-dates on non-date columns, where every value previously ran the full regex is_match chain before failing. Four behavior-preserving changes:
- Structural byte pre-filter (
cannot_be_date, backed by a 256-entryDATE_BYTElookup table). Any input containing a byte that cannot appear in an accepted format (_,#, non-ASCII, etc.) returnsErrimmediately, skipping the regex chain. - Dispatch reorder — the cheap regex-gated families run first; the two parsers without a family regex gate (
unix_timestamp,rfc2822) move last. Result-preserving: floats match no family gate (so still reachunix_timestamp), and rfc2822 inputs always carry a timezone that the$-anchoredmonth_dmy_*regexes reject — whilemonth_dmy_*only succeeds without a timezone, which makes rfc2822 fail. The two are mutually exclusive. unix_timestampfirst-byte gate — bail beforefast_float2::parseunless the first byte is one ofdigit + - . i I n N(the only leadsfast_float2accepts; it rejects leading whitespace).rfc2822colon gate — every RFC 2822 datetime has a time-of-day, so colon-free inputs can't be rfc2822; skipparse_from_rfc2822for them.
Benchmarks (M4 Max, release, 1000 values/iter)
| path | 0.15.0 | 0.15.1 | change |
|---|---|---|---|
non-date strings, parse_failures (category_value_N) |
125 µs | 60 µs | −52% |
genuine ISO datetimes, parse_throughput |
398 µs | 370 µs | −7% |
non-date words, parse_word_failures |
133 µs | 127 µs | −5% |
accepted-formats mix, parse_all |
8.2 µs | 7.9 µs | −4% |
Correctness
No public API change and byte-identical parse results — verified three ways:
- All 26 unit tests + 6 doctests pass (doctests cover the full accepted-formats and DMY lists). New
prefilter_rejects_non_date_stringsregression test pins the pre-filter (rejects junk; still accepts every separator a real date uses, plus the bareinf/nanthe unix-timestamp path accepts). - Integration: qsv
statsoutput is byte-identical before/after on a mixed-format dataset, and qsv's fullcargo test stats -F all_featuressuite passes (752 passed, 0 failed) against this release.
Why it matters for qsv
qsv stats --infer-dates (and, via the stats cache, frequency, schema, tojsonl, sqlp, joinp, pivotp, describegpt) parse dates through this crate. Genuine date-typed columns infer ~3% faster at the qsv level; non-date columns are unaffected. (Integration profiling showed the larger remaining --infer-dates overhead on non-date-heavy data lives in qsv's own sniff step, not this crate — tracked for follow-up on the qsv side.)
Dev-only
- Added
parse_failuresandparse_word_failuresbenches for the failure hot paths. - Updated parse-order docs in
CLAUDE.md. - A
RegexSetsingle-pass classification idea was prototyped and rejected — measured ~1.6–2.1× slower on the common early-gate date path (no early exit; per-is_matchsetup overhead, not scanning, dominates).
Full diff
0.15.0...0.15.1 — PR #9.