Skip to content

Callback perf#89

Draft
tmcgilchrist wants to merge 15 commits into
tarides:mainfrom
tmcgilchrist:callback_perf
Draft

Callback perf#89
tmcgilchrist wants to merge 15 commits into
tarides:mainfrom
tmcgilchrist:callback_perf

Conversation

@tmcgilchrist

Copy link
Copy Markdown
Collaborator

On top of #88

Trying out a rewrite to avoid GC allocations in the callback paths. The issues we've seen around dropped events, unterminated fuchsia (#20) and olly crashes could all be linked back to taking a GC during one of the runtime events callbacks.

Still merges the users' existing OCAMLRUNPARAM value
Honour users intent if OCAML_RUNTIME_EVENTS_PRESERVE is set and leave
the file in place.
Fixes some EINTR edge-cases for system calls (sleepf and waitpid).
Adds a C stub olly_is_process_alive that uses kill(pid, 0) on Unix and
OpenProcess + GetExitCodeProcess on Windows.
Changed emit from trace -> Event.t -> unit to trace -> ~ring_id:int ->
~ts:int64 -> ~name:string -> ~kind:Event.kind -> unit, passing fields
directly instead of boxing them into a record on every event. This
should avoid one Event.t allocation per event. Event.t type is
preserved since kind is still needed as an extensible type.
This should avoid allocations in Printf (1-2 allocations), Some(fun oc
-> ...) closure + option wrapper for counters, and %t function
argument indirection.
Provides a direct path for counter events that takes ~value:int
instead of ~kind:(Counter value).
Thread_ref.ref only supports values 1–255 (inline fuchsia trace refs),
which broke when  we bumped max_doms to 4096. Thread_ref.inline
supports arbitrary pid/tid values — the trade-off is 2 extra words per
event record in the trace file, which is negligible.
Key design choices:
  - C stubs (fxt_put_event_header, fxt_put_arg_header_i32/i64) pack multi-field headers into int64
  words using C bit manipulation — zero Int64 boxing
  - String interning (32K table) — repeated phase names like "major", "minor" are emitted once as
  string records, then referenced by index. Makes traces ~50% smaller and avoids redundant string
  writes
  - Thread interning (256 table) — domain thread refs registered once, referenced by index
  - Single Bytes.t buffer (64KB) flushed to out_channel — no Buf_chain, no locks, no pool (olly is
  single-threaded)
  - %caml_bytes_set64u compiler intrinsic for timestamp writes — single instruction, no allocation
With a self-contained Fuchsia implementation there is no reason to
keep this package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant