perf: short-circuit limited dataset JSON previews#936
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an incremental JSON decoding mechanism to the dataset registry catalog to optimize row reading when a limit is specified. By using json.JSONDecoder().raw_decode, the system can now extract a specific number of rows from large JSON files without parsing the entire payload, which significantly improves performance for data previews. The changes include the implementation of _limited_rows_from_json_text and _json_text_first_array_start, along with comprehensive unit tests and updated performance probe configurations. The reviewer suggested refactoring the repeated whitespace-skipping logic into a dedicated helper function to improve maintainability.
| def _json_text_first_array_start(json_text: str) -> int | None: | ||
| cursor = 0 | ||
| text_length = len(json_text) | ||
| while cursor < text_length and json_text[cursor].isspace(): | ||
| cursor += 1 | ||
| if cursor >= text_length: | ||
| return None | ||
| if json_text[cursor] == "[": | ||
| return cursor | ||
| if json_text[cursor] != "{": | ||
| return None | ||
|
|
||
| decoder = json.JSONDecoder() | ||
| cursor += 1 | ||
| while cursor < text_length: | ||
| while cursor < text_length and json_text[cursor].isspace(): | ||
| cursor += 1 | ||
| if cursor >= text_length or json_text[cursor] == "}": | ||
| return None | ||
| try: | ||
| key, cursor = decoder.raw_decode(json_text, cursor) | ||
| except json.JSONDecodeError: | ||
| return None | ||
| if not isinstance(key, str): | ||
| return None | ||
| while cursor < text_length and json_text[cursor].isspace(): | ||
| cursor += 1 | ||
| if cursor >= text_length or json_text[cursor] != ":": | ||
| return None | ||
| cursor += 1 | ||
| while cursor < text_length and json_text[cursor].isspace(): | ||
| cursor += 1 | ||
| if key in {"rows", "data"} and cursor < text_length and json_text[cursor] == "[": | ||
| return cursor | ||
| try: | ||
| _, cursor = decoder.raw_decode(json_text, cursor) | ||
| except json.JSONDecodeError: | ||
| return None | ||
| while cursor < text_length and json_text[cursor].isspace(): | ||
| cursor += 1 | ||
| if cursor < text_length and json_text[cursor] == ",": | ||
| cursor += 1 | ||
| continue | ||
| if cursor < text_length and json_text[cursor] == "}": | ||
| return None | ||
| return None | ||
| return None |
There was a problem hiding this comment.
The logic for skipping whitespace, while cursor < text_length and json_text[cursor].isspace(): cursor += 1, is repeated multiple times in this function and also in _limited_rows_from_json_text. To improve maintainability and reduce code duplication, consider extracting this into a small helper function.
For example:
def _skip_whitespace(text: str, cursor: int, text_length: int) -> int:
while cursor < text_length and text[cursor].isspace():
cursor += 1
return cursorYou could then replace the repeated while loops with a call to this helper, like cursor = _skip_whitespace(json_text, cursor, text_length).
Melix PR Scoped Performance Report
Changed Files
Dataset registry split match string stem
Hub catalog tag normalization single pass
Hub catalog next cursor fast parse
Dataset registry snapshot inference single pass
Multimodal fast-path signature top-level key cache
Multimodal preprocessing local URI parse elision
Runtime utils kwarg signature cache
Runtime utils package version cache
Runtime utils top-level weight streaming
MLX text stop kwarg signature cache
MLX text stop-filter prefix cache
Text family config copy elision
MLX audio speech signature cache
Stream assembler parser-mode cache
Deterministic image edit digest reuse
Deterministic image output byte accounting
Deterministic embedding duplicate input cache
Stream assembler structural prefix cache
Stream assembler token-byte fast decode
Benchmark evaluation report running aggregates
Benchmark export run-scan single pass
Benchmark queue decoded-record cache
Benchmark store matrix streaming
Closure audit probe-source short circuit
Phase8 metrics closure-audit reuse
Evaluation job-id high-water mark
Evaluation sample probe aggregation
Evaluation answer normalization fast path
Evaluation latency percentile vector reuse
Evaluation dialogue diagnostics top-k
Evaluation compare target lookup early stop
Evaluation store compare summary CSV streaming
Evaluation store samples CSV streaming
Evaluation compare target lookup short-circuit
Evaluation final-result cache-hit materialization
Evaluation final-result JSON typed-score running aggregate
Training config target-module cache
LoRA experiment run-dir name scan
Training dataset quality/token summary
Training dataset validation split partial selection
Training dataset validation sample-limit loading
Maintenance bench report readback
Maintenance percentile vector reuse
Maintenance prompt shape vector repeat
Maintenance benchmark parameter normalization single convert
Upload receipt published-files scandir
Download pipeline snapshot manifest base reuse
Worker registry resident-bytes accumulator
Job registry derived-model single pass
MLX-LM structured result tail parse
MLX-VLM family-config cache
MLX-VLM Gemma4 weight-presence single-pass scan
Job registry restore sort elision
Deterministic rerank query-context reuse
Rerank core bounded top-k selection
PR-scoped performance registry cache
PR-scoped performance scope changed-files JSON read bytes
PR-scoped performance scope matcher
PR-scoped performance report results scandir
Package macOS fallback build product scandir
Dev-up MLX Metal dist-info scandir
Quantization gate manifest event streaming
Model ops bundle artifact byte accounting
Model registry plain-local generation-config stat elision
Real model support HF cache latest snapshot
Swift CLI JSON envelope encoding
Code evaluation code-block last-match streaming
Code evaluation payload JSON byte loading
Code evaluation stdio tail single stat
Code evaluation runner script cache
Code evaluation fallback test count line scan
Code evaluation nonblank test count streaming
Changed-scope coverage empty-path short circuit
Changed-scope coverage diff parser
LoRA reward summary candidate min/max reuse
Deterministic embedding projection allocation
Deterministic OCR token count scan
Deterministic VLM completion token scan
Statistical evidence bootstrap single sort
Statistical evidence category breakdown single pass
Video preprocessing URI byte length reuse
Startup signals lazy worker log excerpts
Startup version comparison single pass
Release gates M9 failure count single pass
Event extraction alignment accepted-edge cache
Event extraction semantic value-group cache
Event extraction group actor alias cache
Event extraction fenced JSON trim
MLX audio WAV PCM streaming
MLX audio local URI zero-copy preprocess
Training dataset chunker top-level base copy
MLX audio generate signature cache
Dataset registry preview limit short-circuit
Hub catalog size hint regex precompile
Multimodal preprocessing image URI single parse
Quantization indexed shard min single pass
Quantization QAT source scan scandir
Engine generate usage token elision
Vision family prompt token count scan
Integration Swift binary resolution scandir fallback
|
Summary
.jsondataset previews by incrementally decoding canonical top-levelrows/dataarrays instead of fully materializing the payload before applyinglimit.dataset-registry-preview-limit-short-circuitfocused commands to include the new regression tests.Plan or Spec
docs/plans/2026-05-08-dataset-json-preview-limit.mddataset-registry-preview-limit-short-circuitininfra/perf/pr_scoped_probes.json.Commands Run
Coverage and Metrics
TOTAL 111 0 100%from the registeredcoverage_command.dataset-registry-preview-limit-short-circuit):old_mean=132.835093 ms,new_mean=2.523591 ms,delta_ms=-130.311502,speedup=52.64x,elapsed_reduction=98.10%.old_peak_bytes_mean=17588055.286,new_peak_bytes_mean=5361759.571,peak_delta_bytes=-12226295.715,peak_reduction=69.51%.Known Gaps
rows, anddataarray payloads; non-canonical JSON shapes keep the existing full-decode fallback.Evidence Checklist
N/Ais stated explicitly with the reason.