Skip to content

⚡ Optimize printable string detection#30

Merged
r0ny123 merged 2 commits intomasterfrom
codex/optimize-string-printable-check
May 7, 2026
Merged

⚡ Optimize printable string detection#30
r0ny123 merged 2 commits intomasterfrom
codex/optimize-string-printable-check

Conversation

@r0ny123
Copy link
Copy Markdown
Owner

@r0ny123 r0ny123 commented May 6, 2026

💡 What

Precomputes printable ASCII character membership once at module import time and uses an integer-indexed lookup table inside detect_ascii_len() and detect_unicode_len().

🎯 Why

The previous hot-loop condition called chr(char) in string.printable for every byte checked. That repeatedly allocated one-character strings and scanned string.printable during string detection. The new lookup keeps the same printable-byte semantics while making the inner-loop check constant time with no per-iteration chr() call.

📊 Measured Improvement

Measured with a focused in-process benchmark using dummy SMDA reports, 4,096 printable ~ characters, warmup, GC disabled during samples, and best-of-seven timing on this Windows checkout. The machine was noisy, so best sample is the most stable comparison point.

Path Baseline best Optimized best Change
detect_ascii_len 4.668365s total / 1867.35us per call 4.232579s total / 1693.03us per call 9.3% faster
detect_unicode_len 3.421829s total / 2737.46us per call 2.690248s total / 2152.20us per call 21.4% faster

✅ Verification

  • python -m ruff format --check .
  • python -m ruff check .
  • python -m pytest -q (42 passed, 7 subtests passed)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes string extraction by replacing repeated character checks with a pre-computed lookup table. Feedback suggests expanding the table to 256 elements to eliminate redundant range checks in the detect_ascii_len and detect_unicode_len loops, further improving performance.

Comment thread smda/utility/StringExtractor.py Outdated
Comment thread smda/utility/StringExtractor.py Outdated
Comment thread smda/utility/StringExtractor.py Outdated
@r0ny123
Copy link
Copy Markdown
Owner Author

r0ny123 commented May 7, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@r0ny123 r0ny123 marked this pull request as ready for review May 7, 2026 11:47
Addresses gemini-code-assist review on PR #30. With a 256-entry table,
non-ASCII bytes (>=127) naturally return False, so the explicit
char < 127 guard in the detect_ascii_len and detect_unicode_len hot
loops becomes redundant. Removing it saves one comparison per loop
iteration on the string-detection hot path.

https://claude.ai/code/session_01PHLmsRuiwBQJ3n7gvR7Aa5
@r0ny123 r0ny123 merged commit f33265e into master May 7, 2026
7 checks passed
@r0ny123 r0ny123 deleted the codex/optimize-string-printable-check branch May 7, 2026 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants