Skip to content

perf: parallel file processing and faster BibTeX classification#1093

Merged
florath merged 2 commits intomainfrom
perf/parallel-mass-eval
Mar 14, 2026
Merged

perf: parallel file processing and faster BibTeX classification#1093
florath merged 2 commits intomainfrom
perf/parallel-mass-eval

Conversation

@coding-ai-assistant
Copy link
Contributor

Summary

  • Parallel file processing: replace sequential for loop in mass_eval with asyncio.Semaphore + asyncio.gather, controlled by new --max-parallel-files option (default 8). Each file's parse is offloaded to a ProcessPoolExecutor (spawn) so the event loop is never blocked.
  • Faster BibTeX classification: pre-compile all preprint and venue-type regex patterns at module level; _is_preprint_entry now does a single regex search instead of 7 sub-calls each re-reading 7 fields — eliminates millions of redundant field reads on large files (e.g. 80 K-entry ACL anthology file).
  • State safety: add files_lock (asyncio.Lock) to serialise cross-file mutations on MassEvalState; periodic checkpointing moved to a background task.

Test plan

  • pytest tests/ passes (unit + integration)
  • aletheia-probe mass-eval --help shows --max-parallel-files
  • Collect run processes multiple files concurrently (visible in log: several [N/total] Processing lines at same timestamp)
  • No connection pool exhausted regression (requires openalex/opencitations platform pool PRs deployed)
  • Resume from checkpoint still works after interruption mid-run

florath added 2 commits March 14, 2026 16:06
bibtex_parser.py:
- Pre-compile all preprint patterns into a single combined regex
  (_ALL_PREPRINT_RE) so _is_preprint_entry does one search instead of
  iterating 26+ patterns across 7 sub-checkers
- _is_preprint_entry now computes _get_preprint_check_content once
  instead of once per sub-checker (was 7× redundant field reads)
- Remove duplicate _is_preprint_entry call in _detect_venue_type;
  entries reaching that path are already confirmed non-preprint
- Pre-compile _detect_venue_type pattern groups (_SYMPOSIUM_RE,
  _WORKSHOP_RE, _CONFERENCE_RE, _JOURNAL_RE) at module level

mass_eval.py:
- Process multiple .bib files concurrently using asyncio.Semaphore
  and asyncio.gather instead of a sequential for-loop
- Offload blocking pybtex parse_file_all to a ProcessPoolExecutor
  (spawn mode) so the event loop is never frozen during parsing;
  raw_entry field is stripped before pickling back to the main process
- Add files_lock (asyncio.Lock) to serialise cross-file mutations on
  MassEvalState (completed_files, failed_files, checkpoint writes)
- Periodic checkpointing moved to a background asyncio task
- New --max-parallel-files CLI option (default 8) controls semaphore
  width and ProcessPoolExecutor size
- Add _null_context helper for backward-compatible single-file paths

assessment_commands.py / context.py:
- Wire --max-parallel-files through Click option and AsyncMassEvalMain
  protocol
@florath florath merged commit 4efe57e into main Mar 14, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant