Skip to content

Migrate to orjson#209

Open
itayfoT wants to merge 29 commits intoEC-DIGIT-CSIRC:mainfrom
envoidshield:main
Open

Migrate to orjson#209
itayfoT wants to merge 29 commits intoEC-DIGIT-CSIRC:mainfrom
envoidshield:main

Conversation

@itayfoT
Copy link

@itayfoT itayfoT commented Nov 4, 2025

Hey,

To improve performance, we changed to use orjson

itayfoLY and others added 13 commits October 14, 2025 13:59
- Replace json module with orjson for improved performance
- Update all json.loads() calls to orjson.loads()
- Update all json.dumps() calls to orjson.dumps()
- Change file I/O from text mode to binary mode (required for orjson)
- Update exception handling from json.decoder.JSONDecodeError to orjson.JSONDecodeError
- Update test file to use orjson for consistency
- All existing tests pass successfully
feat: add disks.txt parser with df output support
Migrate logarchive parser from json to orjson
- Add UID field to all ps_everywhere output entries
- Extract UID from ps.txt, psthread.txt, spindump-nosymbols.txt, and logarchive (euid)
- Add _sanitize_uid() helper to filter invalid placeholder UIDs (0xAAAAAAAA, 0xFFFFFFFF)
- Update deduplication logic to consider UID as part of uniqueness
- Same process with different UIDs now tracked as separate entries
- Use None for missing/invalid UIDs (not 0)
- All tests pass successfully
- Extract PID and PPID from sources that provide them (ps.txt, psthread.txt, spindump-nosymbols.txt)
- Extract PID only from sources without PPID (logarchive, shutdownlogs, taskinfo)
- Set both to None for sources without process ID information
- Maintain consistent data structure across all sources with pid and ppid fields
- Build PID-to-name mapping from ps.txt, psthread.txt, spindump, and taskinfo
- Resolve PPID to parent process name using the mapping
- Use direct 'parent' field from spindump when available
- Add ppname field to all output entries (None when not resolvable)
- Enriches ~22% of entries with parent process names in typical datasets
@dario-br
Copy link
Contributor

Hi @itayfoT , could you please also remove the changes to the file ps_everywhere.py? And can you also share some performance tests that you did? @cvandeplas did some and initially performance wise orjson did not look that performant. The more data you can share there, the better it will help us to assess.

itayfoLY and others added 15 commits November 27, 2025 18:51
- Increase chunk size from 64KB to 1MB for 10-15x faster processing
- Increase subprocess buffer to 2MB for better pipe utilization
- Fix duplicate message field by extracting message before passing to Event data
- Switch to binary mode with buffered reading for reduced overhead
- Update log_stderr to handle binary mode properly
…2.4x speedup

- Always use unifiedlog_iterator instead of native macOS log parser
- Use --output-format event to get Event format directly from Rust
- Let Rust write file directly (--output) for zero Python overhead
- Use 10 threads for maximum performance (320K+ lines/sec)
- Remove Python-side format conversion (no longer needed)
- Simplify generator to just parse JSON (already in Event format)

Performance improvement:
- Before: 38,774 lines/sec (Python conversion)
- After: 94,220 lines/sec (Rust direct)
- Speedup: 2.4x faster, saves 68 seconds on 4.4M entries
- Add BasebandMetrics_TelephonyRegistration_1_2 query for iOS 15-18
- Add BasebandMetrics_TelephonyActivity_1_2 query for iOS 15-18
- iOS 18 uses new table naming scheme (BasebandMetrics_*_1_2 vs PLBBAGENT_*)
- Enables cell tower registration and RAT activity parsing on modern iOS
feat(apollo): add iOS 15-18 support for telephony tables
…ents

- Enrich __extract_plist_mdm_data with ServerURL, CheckInURL, Topic,
  ServerCapabilities, IdentityCertificateUUID, UDID, and other MDM fields
- Add __extract_plist_profile_events for MCProfileEvents.plist timeline
  (install/remove operations with process and timestamp per profile)
- Add _build_profile_hash_map to resolve profile stub SHA-256 hashes
  from PayloadIdentifier/PayloadUUID, included in both MDM and profile
  event data for cross-referencing

Co-authored-by: Cursor <cursoragent@cursor.com>
JetsamEvent crash logs contain a full snapshot of all running processes
(~750 per event). This adds crashlogs as a new source, extracting process
names and PIDs from JetsamEvent entries, and procPath/procName with
userID/parentPid/parentProc from non-JetsamEvent crash reports.

Co-authored-by: Cursor <cursoragent@cursor.com>
The parse_ips_file method never stored the basename of the .ips file,
causing ghost crash detection to miss files with collision suffixes
like .000.ips that iOS appends for same-second crashes.

Made-with: Cursor
Crash reports from paired devices are stored under
ProxiedDevice-<hash>/ directories and should not be parsed as
iPhone crashes.

Made-with: Cursor
The filter 'id' not in k was stripping bundle_id from parsed messages
because it contains the substring 'id'. Changed to 'table id' not in k
to only filter out table primary key fields while preserving
forensically critical fields like bundle_id.

Made-with: Cursor
… scanner

- LogarchiveParser.get_log_files() now globs both system_logs.logarchive and
  collect_system_logs.logarchive; existing multi-folder merge logic handles dedup
- yarascan ignore_folders includes collect_system_logs.logarchive alongside
  system_logs.logarchive; fixed prior .pop() that would crash if folder absent

Made-with: Cursor
Log which logarchive folders are found, parse output sizes,
and handle RuntimeError from unifiedlog_iterator failures
to diagnose whether collect_system_logs.logarchive is being
parsed or silently skipped.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants