Skip to content

feat: add cdylib target, harden C FFI, fix capture offsets, and add timestamp event boundaries#210

Open
jackluo923 wants to merge 4 commits intoy-scope:log-mechanicfrom
jackluo923:log-mechanic
Open

feat: add cdylib target, harden C FFI, fix capture offsets, and add timestamp event boundaries#210
jackluo923 wants to merge 4 commits intoy-scope:log-mechanicfrom
jackluo923:log-mechanic

Conversation

@jackluo923
Copy link
Member

Description

Add several improvements to the Rust lexer's build targets, FFI safety, correctness,
and functionality:

  • cdylib target: Add cdylib to crate-type so the library can be loaded via dlopen
    (e.g. Python cffi).

  • Runtime-configurable debug output: Replace unconditional println! in DFA/NFA
    construction with a debug_println! macro gated behind an AtomicBool, controllable at
    runtime via debug::set_debug(). This lets upper-layer consumers (e.g. Python bindings)
    suppress noisy internal output by default and enable it only when needed.

  • Fix capture boundary offsets: The DFA used inclusive byte offsets (input[i..=j],
    input[..=consumed]), causing incorrect capture slices. For example, matching
    (?<key>[a-z]+)=(?<val>[0-9]+) against foo=123 would return 12 instead of 123 for
    the val capture because the end offset pointed to the start of the last character rather
    than past it. Fix by tracking consumed_end as an exclusive byte offset and using
    input[i..j] / input[..consumed_end].

  • Fix delimiter hex encoding: pattern_for_delimiters formatted codepoints with {:x},
    producing \u{a} for newline, which the regex parser rejected. Fix by zero-padding with
    {:02x} to produce \u{0a}.

  • Support underscores in capture group names: Named capture groups like (?<my_var>...)
    are common convention but were rejected by the regex parser. Add support by switching from
    alphanumeric1 to take_while1(|c| c.is_alphanumeric() || c == '_').

  • Fix FFI panic on invalid input: FFI functions like schema_add_rule and lexer_new
    would panic (abort the calling process) on invalid UTF-8 or bad regex patterns. Now they
    return bool / null pointer so callers can handle errors gracefully.

  • Harden FFI lifecycle: lexer_new returns *mut Lexer (null on failure) instead of
    Box<Lexer>. lexer_delete takes *mut Lexer with explicit null check. Add set_debug,
    schema_rule_count, and schema_rule_name introspection functions.

  • Timestamp rules + event boundary detection: Add is_timestamp field to Schema::Rule
    and add_timestamp_rule() method. Add is_event_start field to Fragment, set when a
    timestamp rule matches at byte offset 0 or immediately after a newline, enabling downstream
    consumers to split multi-line log events.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • cargo test — all unit tests pass, including new tests for capture boundaries and event
    boundary detection
  • End-to-end validation via
    log-surgeon-ffi 0.1.0b10 — the
    published Python package uses the Rust cdylib from this branch as its cffi backend,
    exercising the FFI, capture offsets, and timestamp event boundary features against real
    log data across Python 3.9–3.13 on Linux and macOS

Jack Luo added 4 commits February 7, 2026 13:59
Add "cdylib" to crate-type so the library can be loaded via dlopen
(e.g. Python cffi). Replace unconditional println! calls in DFA/NFA
construction with a debug_println! macro gated behind an AtomicBool
flag, controllable at runtime via debug::set_debug().
Track consumed_end (exclusive byte offset past last consumed char)
instead of consumed (start of last consumed char). Use input[i..j]
(exclusive) for capture slices and input[..consumed_end] for the
lexeme. Pass consumed_end to final_operations for correct capture
end offsets.

Also zero-pad hex codes in pattern_for_delimiters ({:02x}) so the
regex parser accepts single-digit codepoints like \u{0a}.
…spection

- schema_set_delimiters and schema_add_rule now return bool instead
  of panicking on invalid UTF-8 or regex parse failures
- lexer_new returns *mut Lexer (null on failure) instead of Box<Lexer>
- lexer_delete takes *mut Lexer with explicit null check
- Add set_debug FFI to toggle debug output at runtime
- Add schema_rule_count and schema_rule_name FFI for rule introspection
- Accept underscores in regex capture group names (?<my_var>...)
- Fix doc comment typos: "interal" -> "internal", "nolonger" -> "no longer"
Add is_timestamp field to Schema::Rule and add_timestamp_rule() method.
Add is_event_start field to Fragment, set to true when a timestamp rule
matches at byte offset 0 or immediately after a newline. This enables
downstream consumers to split multi-line log events.

Also add schema_add_timestamp_rule FFI and is_event_start to CLogFragment.
@jackluo923 jackluo923 requested a review from a team as a code owner February 8, 2026 01:13
@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant