Skip to content

0.15.1

Latest

Choose a tag to compare

@jqnatividad jqnatividad released this 30 May 14:45

Performance: parse-dispatch optimization

Speeds up Parse::parse on its hot path — the failed parse attempts that dominate qsv stats --infer-dates on non-date columns, where every value previously ran the full regex is_match chain before failing. Four behavior-preserving changes:

  1. Structural byte pre-filter (cannot_be_date, backed by a 256-entry DATE_BYTE lookup table). Any input containing a byte that cannot appear in an accepted format (_, #, non-ASCII, etc.) returns Err immediately, skipping the regex chain.
  2. Dispatch reorder — the cheap regex-gated families run first; the two parsers without a family regex gate (unix_timestamp, rfc2822) move last. Result-preserving: floats match no family gate (so still reach unix_timestamp), and rfc2822 inputs always carry a timezone that the $-anchored month_dmy_* regexes reject — while month_dmy_* only succeeds without a timezone, which makes rfc2822 fail. The two are mutually exclusive.
  3. unix_timestamp first-byte gate — bail before fast_float2::parse unless the first byte is one of digit + - . i I n N (the only leads fast_float2 accepts; it rejects leading whitespace).
  4. rfc2822 colon gate — every RFC 2822 datetime has a time-of-day, so colon-free inputs can't be rfc2822; skip parse_from_rfc2822 for them.

Benchmarks (M4 Max, release, 1000 values/iter)

path 0.15.0 0.15.1 change
non-date strings, parse_failures (category_value_N) 125 µs 60 µs −52%
genuine ISO datetimes, parse_throughput 398 µs 370 µs −7%
non-date words, parse_word_failures 133 µs 127 µs −5%
accepted-formats mix, parse_all 8.2 µs 7.9 µs −4%

Correctness

No public API change and byte-identical parse results — verified three ways:

  • All 26 unit tests + 6 doctests pass (doctests cover the full accepted-formats and DMY lists). New prefilter_rejects_non_date_strings regression test pins the pre-filter (rejects junk; still accepts every separator a real date uses, plus the bare inf/nan the unix-timestamp path accepts).
  • Integration: qsv stats output is byte-identical before/after on a mixed-format dataset, and qsv's full cargo test stats -F all_features suite passes (752 passed, 0 failed) against this release.

Why it matters for qsv

qsv stats --infer-dates (and, via the stats cache, frequency, schema, tojsonl, sqlp, joinp, pivotp, describegpt) parse dates through this crate. Genuine date-typed columns infer ~3% faster at the qsv level; non-date columns are unaffected. (Integration profiling showed the larger remaining --infer-dates overhead on non-date-heavy data lives in qsv's own sniff step, not this crate — tracked for follow-up on the qsv side.)

Dev-only

  • Added parse_failures and parse_word_failures benches for the failure hot paths.
  • Updated parse-order docs in CLAUDE.md.
  • A RegexSet single-pass classification idea was prototyped and rejected — measured ~1.6–2.1× slower on the common early-gate date path (no early exit; per-is_match setup overhead, not scanning, dominates).

Full diff

0.15.0...0.15.1 — PR #9.