Version: 1.0 Status: Production-Ready Core Features: Cross-Language, Zero-Copy, Lock-Free, Cache-Line Friendly
cTP (coroTracer Protocol) is not a traditional TCP/UDP-based network communication protocol, but rather a physical memory mapping (mmap) contract strictly based on byte alignment.
Due to the extreme performance demands of modern M:N coroutine schedulers, traditional RPC or Socket log collection solutions introduce intolerable serialization and context switching overhead. The cTP protocol, by strictly dictating the binary layout and atomic barrier (Memory Barriers) rules of the shared memory (/tmp/corotracer.shm), enables the tested target program (C++, Rust, Zig, etc.) to record timing at speeds approaching the L1 Cache, while the Go engine harvests non-blockingly in a completely independent process.
The entire shared memory file is strictly divided into fixed-size memory blocks. The first 1KB is dedicated to global state negotiation, followed by N consecutive 1KB coroutine observation stations (Station).
[ Shared Memory File: corotracer.shm ]
=======================================================================
| Offset (Hex) | Size (Bytes) | Block Name |
=======================================================================
| 0x00000000 | 1024 (1KB) | GlobalHeader |
| 0x00000400 | 1024 (1KB) | StationData #0 |
| 0x00000800 | 1024 (1KB) | StationData #1 |
| ... | ... | ... |
| Header + N*1K| 1024 (1KB) | StationData #N |
=======================================================================
Mandatory Constraint: When implementing this protocol in any language, the total size of the structure must be strictly guaranteed to be exactly 1024 bytes, completely rejecting the compiler's implicit Padding, to ensure absolute cross-language ABI consistency.
Alignment Requirement: 1024 Bytes ( alignas(1024) )
Responsibility: Stores cross-process handshake information and the global cursor for the lock-free allocator.
| Offset | Field | Type | Bytes | Description |
|---|---|---|---|---|
0x00 |
magic_number |
uint64 |
8 | Magic number, fixed at 0x434F524F54524352 (ASCII: COROTRCR) |
0x08 |
version |
uint32 |
4 | Protocol version number, currently 1 |
0x0C |
max_stations |
uint32 |
4 | Maximum total number of Stations pre-allocated in the SHM file |
0x10 |
allocated_count |
atomic<uint32> |
4 | [Lock-Free Allocator Cursor] The target program obtains an available Station via atomic increment |
0x14 |
tracer_sleeping |
atomic<uint32> |
4 | Engine sleep flag: 0 = Active, 1 = Sleeping awaiting wakeup |
0x18 |
_reserved |
char[1000] |
1000 | Hard Padding Zone: Pad to a full 1024 bytes |
Alignment Requirement: 64 Bytes ( alignas(64) )
Responsibility: Records a snapshot of a single coroutine state transition.
Design Philosophy: 64 bytes perfectly matches the Cache Line size of modern CPUs. When multiple threads concurrently write to different Epochs, they are physically isolated in different cache lines, completely eliminating the drastic performance drops caused by False Sharing.
| Offset | Field | Type | Bytes | Description |
|---|---|---|---|---|
0x00 |
timestamp |
uint64 |
8 | Nanosecond-level timestamp (e.g., clock_gettime(CLOCK_MONOTONIC)) |
0x08 |
tid |
uint64 |
8 | Real OS thread ID (not high-level language level ID) |
0x10 |
addr |
uint64 |
8 | Instruction address or coroutine heap frame pointer upon suspension/resumption |
0x18 |
seq |
atomic<uint64> |
8 | [Core Concurrency Barrier] Monotonically increasing sequence number. Used for read/write barriers |
0x20 |
reserved |
char[31] |
31 | Reserved space (can be used to store a small amount of business Payload) |
0x3F |
is_active |
bool (uint8) |
1 | State machine flag: 1 = Active (Running), 0 = Suspend (Suspended) |
Alignment Requirement: 1024 Bytes ( alignas(1024) )
Responsibility: Each coroutine instance exclusively occupies one Station throughout its entire lifecycle.
| Offset | Zone | Bytes | Description |
|---|---|---|---|
0x000 |
Header.probe_id |
8 | Probe globally unique ID (recommended to use the memory address at coroutine creation) |
0x008 |
Header.birth_ts |
8 | Nanosecond timestamp of coroutine birth |
0x010 |
Header.is_dead |
1 | Whether the coroutine has finished destruction (1 = Dead) |
0x011 |
Header._pad |
47 | Pad to 64-byte alignment |
0x040 |
Slots[8] |
512 | Event Polling Buffer (RingBuffer): 8 Epochs, totaling 512 Bytes |
0x240 |
Flexible |
448 | Hard Padding Zone: Pad to a full 1024 bytes |
cTP completely abandons Mutex and SpinLock, relying solely on hardware-level memory barriers. Implementing this protocol must comply with the following read/write contract:
- O(1) Lock-Free Allocation: When a new coroutine is born, execute
index = fetch_add(&GlobalHeader.allocated_count, 1, std::memory_order_relaxed). Ifindex < max_stations, exclusively occupyStationData[index]. - Circular Write (Ring Buffer): Upon context switch, obtain the auto-incremented sequence number
seq. Locate the slot:slot = Station.Slots[seq % 8]. - Memory Barrier [Fatal Constraint]:
The probe must first write ordinary data such as
timestamp,tid,addr,is_active. As the final step, it must updatesequsingReleasesemantics:This ensures that when the Go engine seesslot.seq.store(current_seq, std::memory_order_release);
sequpdated, all preceding data has been flushed to physical memory, absolutely preventing dirty reads.
- Local Snapshot: The Go engine maintains a
last_seen_seqs[MAX_STATIONS][8]array locally. - Safe Read (Acquire): When polling
seq, atomic loading must be used:currentSeq := atomic.LoadUint64(&slot.Seq) // Inherently carries an Acquire barrier by default
- Data Extraction: If
currentSeq > last_seen_seqs, extract the data of the current slot, and upon completion, update the locallast_seen_seqs.
To prevent the Go engine from spinning the CPU idly (Busy Wait) during business troughs, a UDS wakeup mechanism is introduced:
- After N consecutive harvests with no data, the Go engine sets
GlobalHeader.tracer_sleepingto1, and subsequently blocks reading the UDS (Unix Domain Socket). - After writing data, if the C++ probe detects
tracer_sleeping == 1, it sends a single-byte signal'1'to the UDS (using non-blockingO_NONBLOCKwrite; failures are directly ignored, absolutely never blocking the target program). - Upon receiving the signal, the Go engine is instantly awakened by the kernel, resets
tracer_sleepingto0, and enters the next round of frantic harvesting.
Note: The repository now ships a framework-free Rust poll-model SDK under
SDK/rust, aiming to keep the integration as close as possible to the small change surface of the C++ SDK. Other languages are still pending (e.g., Zig is currently unstable).
In Rust, #[repr(C)] and #[repr(align(X))] must be strictly used.
use std::sync::atomic::{AtomicU64, AtomicU32};
#[repr(C, align(64))]
pub struct Epoch {
pub timestamp: u64,
pub tid: u64,
pub addr: u64,
pub seq: AtomicU64,
pub reserved: [u8; 31],
pub is_active: bool,
}
#[repr(C, align(1024))]
pub struct StationData {
pub probe_id: u64,
pub birth_ts: u64,
pub is_dead: bool,
pub _pad: [u8; 47],
pub slots: [Epoch; 8],
pub flexible: [u8; 448],
}