Skip to content

Commit d40777e

Browse files
committed
Antithesis Vector end-to-end ack scenario
This PR introduces a new antithesis scenario confirming that without disk buffers a single Vector instance is capable of maintaining conservation and liveness when e2e acks are enabled. This is a simplification of vector_to_vector_e2e_disk and is intended to demonstrate that e2e acks behave as expected.
1 parent 1dbdb4e commit d40777e

7 files changed

Lines changed: 338 additions & 10 deletions

File tree

tests/antithesis/AGENTS.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,9 @@ The fault profile is the single source of truth: change a shot's faults by editi
3838
```sh
3939
cd tests/antithesis/scenarios
4040
./launch.sh vector_to_vector_e2e_disk # 30-minute run with the pinned profile
41-
DURATION=60 ./launch.sh vector_to_vector_e2e_disk # override duration (minutes)
42-
DRY_RUN=1 ./launch.sh vector_to_vector_e2e_disk # print the exact command, submit nothing
41+
./launch.sh vector_e2e # the no-disk, single-node counterpart
42+
DURATION=60 ./launch.sh vector_e2e # override duration (minutes)
43+
DRY_RUN=1 ./launch.sh vector_e2e # print the exact command, submit nothing
4344
```
4445

4546
The launcher reads tenant and registry from the environment (snouty's variables):
@@ -48,15 +49,15 @@ The launcher reads tenant and registry from the environment (snouty's variables)
4849
- `ANTITHESIS_API_KEY` (or `ANTITHESIS_USERNAME` + `ANTITHESIS_PASSWORD`)
4950
- `ANTITHESIS_REPOSITORY`
5051

51-
`DESCRIPTION`, `TEST_NAME`, `FAULT_NODES`, and `WEBHOOK` are overridable; the
52+
`DESCRIPTION`, `TEST_NAME`, `FAULT_NODES`, and `WEBHOOK` are overridable. The
5253
running git commit is stamped into the description automatically so a shot records
5354
the code it tested. Extra snouty flags pass straight through, e.g.
54-
`./launch.sh vector_to_vector_e2e_disk --recipients you@example.com`.
55+
`./launch.sh vector_e2e --recipients you@example.com`.
5556

5657
The pinned profile submits to the `persistent_storage` webhook and faults the
57-
scenario's SUT nodes (`head` and `tail` for the disk scenario) with node
58-
termination, hang, and throttle, plus `cpu_mod` and `clock_jitter`. The `oracle`
59-
is left out of termination and hang **only** — its obligation ledger lives in
60-
memory, so killing or freezing it would erase the run's source of truth. It is
61-
deliberately still subject to network faults so the egress delivery path is
62-
exercised.
58+
scenario's SUT nodes (`head` and `tail` for the disk scenario, `vector` for
59+
`vector_e2e`) with node termination, hang, and throttle, plus `cpu_mod` and
60+
`clock_jitter`. The `oracle` is never faulted with termination or hang — its
61+
obligation ledger lives in memory, so killing or freezing it would erase the run's
62+
source of truth. It is deliberately still subject to network faults so the egress
63+
delivery path is exercised.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# syntax=docker/dockerfile:1
2+
3+
ARG SANCOV_RUSTFLAGS='["-Cpasses=sancov-module","-Cllvm-args=-sanitizer-coverage-level=3","-Cllvm-args=-sanitizer-coverage-trace-pc-guard","-Clink-args=-Wl,--build-id"]'
4+
5+
############################
6+
# Vector SUT — build stage #
7+
############################
8+
FROM rust:1.92-bookworm AS vector-build
9+
RUN apt-get update && apt-get install -y --no-install-recommends \
10+
protobuf-compiler cmake perl build-essential pkg-config libssl-dev clang binutils \
11+
&& rm -rf /var/lib/apt/lists/*
12+
WORKDIR /src
13+
COPY Cargo.toml Cargo.lock rust-toolchain.toml build.rs ./
14+
COPY lib ./lib
15+
COPY src ./src
16+
COPY proto ./proto
17+
COPY benches ./benches
18+
COPY tests ./tests
19+
COPY vdev ./vdev
20+
COPY scripts ./scripts
21+
ARG SANCOV_RUSTFLAGS
22+
# debug=true keeps DWARF for /symbols. lto=false keeps sancov instrumentation
23+
# predictable and stops the optimizer from dropping the force-linked runtime.
24+
RUN --mount=type=cache,target=/usr/local/cargo/registry \
25+
--mount=type=cache,target=/src/target \
26+
cargo build --release --no-default-features \
27+
--features "sources-http_server,sinks-http,sources-internal_metrics,sinks-prometheus,antithesis-scenario-memory" \
28+
--bin vector \
29+
--config 'build.target = "x86_64-unknown-linux-gnu"' \
30+
--config 'profile.release.debug = true' \
31+
--config 'profile.release.lto = false' \
32+
--config "target.x86_64-unknown-linux-gnu.rustflags = ${SANCOV_RUSTFLAGS}" \
33+
&& cp target/x86_64-unknown-linux-gnu/release/vector /usr/local/bin/vector \
34+
&& echo "validating instrumentation symbols..." \
35+
&& nm /usr/local/bin/vector | grep -q __sanitizer_cov_trace_pc_guard \
36+
&& nm /usr/local/bin/vector | grep -q antithesis_load_libvoidstar \
37+
&& echo "instrumentation OK"
38+
39+
##################################
40+
# Harness (workload) — build stage
41+
##################################
42+
# The workload binaries live in the shared `harness` crate, a member of Vector's
43+
# workspace, so the build needs the workspace root and member manifests. `-p`
44+
# compiles only the harness bins.
45+
FROM rust:1.92-bookworm AS workload-build
46+
RUN apt-get update && apt-get install -y --no-install-recommends binutils \
47+
&& rm -rf /var/lib/apt/lists/*
48+
WORKDIR /src
49+
COPY Cargo.toml Cargo.lock rust-toolchain.toml build.rs ./
50+
COPY lib ./lib
51+
COPY src ./src
52+
COPY proto ./proto
53+
COPY benches ./benches
54+
COPY tests ./tests
55+
COPY vdev ./vdev
56+
COPY scripts ./scripts
57+
ARG SANCOV_RUSTFLAGS
58+
# debug=true keeps DWARF for /symbols.
59+
RUN --mount=type=cache,target=/usr/local/cargo/registry \
60+
--mount=type=cache,target=/src/target \
61+
cargo build --release -p harness \
62+
--config 'build.target = "x86_64-unknown-linux-gnu"' \
63+
--config 'profile.release.debug = true' \
64+
--config "target.x86_64-unknown-linux-gnu.rustflags = ${SANCOV_RUSTFLAGS}" \
65+
&& D=target/x86_64-unknown-linux-gnu/release \
66+
&& cp "$D/oracle" "$D/parallel_driver_produce" "$D/eventually_conservation" /usr/local/bin/ \
67+
&& echo "validating instrumentation symbols..." \
68+
&& nm /usr/local/bin/oracle | grep -q __sanitizer_cov_trace_pc_guard \
69+
&& nm /usr/local/bin/oracle | grep -q antithesis_load_libvoidstar \
70+
&& echo "instrumentation OK"
71+
72+
#####################################
73+
# Runtime: Vector SUT (the one node) #
74+
#####################################
75+
FROM debian:stable-slim AS vector
76+
RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates \
77+
&& rm -rf /var/lib/apt/lists/*
78+
COPY --from=vector-build /usr/local/bin/vector /usr/bin/vector
79+
# Bake the node config plus its benign alternate, which the reload fault swaps in
80+
# to force a sink rebuild.
81+
COPY tests/antithesis/scenarios/vector_e2e/vector.yaml /etc/vector/vector.yaml
82+
COPY tests/antithesis/scenarios/vector_e2e/vector.b.yaml /etc/vector/vector.b.yaml
83+
# The reload fault is an anytime_ test command that runs IN the node container.
84+
# The node stays running because its entrypoint is Vector, not a test command.
85+
COPY --chmod=755 tests/antithesis/scenarios/vector_e2e/anytime_reload.sh /opt/antithesis/test/v1/ve2e/anytime_reload
86+
RUN mkdir -p /symbols && ln -s /usr/bin/vector /symbols/vector
87+
ENV NO_COLOR=1
88+
EXPOSE 8080 9598
89+
ENTRYPOINT ["/usr/bin/vector"]
90+
91+
###################################
92+
# Runtime: workload (oracle + cmds) #
93+
###################################
94+
FROM debian:stable-slim AS workload
95+
RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates \
96+
&& rm -rf /var/lib/apt/lists/*
97+
# The oracle is the entrypoint. The two test-command binaries are the test commands:
98+
# drop them straight into the test template, named by their Antithesis prefix.
99+
COPY --from=workload-build /usr/local/bin/oracle /usr/bin/oracle
100+
COPY --from=workload-build /usr/local/bin/parallel_driver_produce /opt/antithesis/test/v1/ve2e/parallel_driver_produce
101+
COPY --from=workload-build /usr/local/bin/eventually_conservation /opt/antithesis/test/v1/ve2e/eventually_conservation
102+
# Symbolize all three instrumented binaries: the oracle entrypoint and both
103+
# test-command bins. A crash in any of them must resolve against /symbols.
104+
RUN mkdir -p /symbols \
105+
&& ln -s /usr/bin/oracle /symbols/oracle \
106+
&& ln -s /opt/antithesis/test/v1/ve2e/parallel_driver_produce /symbols/parallel_driver_produce \
107+
&& ln -s /opt/antithesis/test/v1/ve2e/eventually_conservation /symbols/eventually_conservation
108+
ENV NO_COLOR=1
109+
# The oracle waits for the SUT, emits setup_complete via the SDK, then serves so
110+
# Antithesis runs the test commands in this container.
111+
ENTRYPOINT ["/usr/bin/oracle"]
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# vector_e2e
2+
3+
The no-disk counterpart of `vector_to_vector_e2e_disk`. Same two properties, one
4+
Vector process, memory buffer instead of `disk_v2`.
5+
6+
**Conservation**: every event the oracle acked must eventually come back, across
7+
arbitrary Antithesis faults. Duplicates are allowed because the contract is
8+
at-least-once. A missing acked id is the bug.
9+
10+
**Liveness**: once faults stop, the node must still accept fresh writes.
11+
12+
## Why a single process
13+
14+
Vector's end-to-end acknowledgements are in-process: a source holds the client's
15+
ack until every sink that received the event has finished. So a single node is the
16+
honest place to test what an e2e ack promises. The producer POSTs to the node's
17+
`http_server` source; the 200 comes back only once the `http` sink has delivered
18+
the event to the oracle and the oracle returned 2xx. That means an acked id has
19+
**already** reached the oracle — which is why conservation can hold even though the
20+
node has no disk buffer and a crash drops whatever is still in memory: those
21+
in-flight events were never acked, so they were never an obligation.
22+
23+
## How it works
24+
25+
One Vector node and one oracle container.
26+
27+
- **vector** takes an `http_server` source (`:8080`) and delivers over `http` to
28+
the oracle through an in-memory buffer with `when_full: block` and e2e acks. It
29+
also exposes Prometheus metrics (`:9598`) for the health gate, and runs the
30+
reload fault: an `anytime_` command swaps `vector.yaml`/`vector.b.yaml` and sends
31+
`SIGHUP`, forcing the sink to rebuild mid-run.
32+
- **oracle** (`:8686`) is one container that injects unique event ids at the node
33+
and runs the HTTP endpoint the node's sink delivers back to.
34+
35+
The oracle keeps its id sets in memory and Antithesis never terminates it, so the
36+
faults under test cannot corrupt the judge. The workload binaries (`oracle`,
37+
`parallel_driver_produce`, `eventually_conservation`) are the shared, buffer-
38+
agnostic bins from `tests/antithesis/harness`, pointed at this topology by the
39+
environment in `docker-compose.yaml`.
40+
41+
## Run
42+
43+
Validate the config locally:
44+
45+
```bash
46+
cd tests/antithesis
47+
docker compose -f scenarios/vector_e2e/docker-compose.yaml build
48+
snouty validate scenarios/vector_e2e
49+
```
50+
51+
Submit a run through the shared launcher, which pins the fault profile (see
52+
`tests/antithesis/AGENTS.md`):
53+
54+
```bash
55+
cd tests/antithesis/scenarios
56+
./launch.sh vector_e2e
57+
```
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
[ -n "${VECTOR_CONFIG_ALT:-}" ] || exit 0
5+
cfg="${VECTOR_CONFIG:?}"
6+
alt="${VECTOR_CONFIG_ALT:?}"
7+
8+
# Vector only ever reads $cfg, so reload alternates $cfg between two immutable
9+
# sources rather than swapping two live files. The alternate $alt is never
10+
# written, and the baseline (the original $cfg) is snapshotted once, so the only
11+
# mutable file is $cfg and the only writes to it are a single rename of a fully
12+
# written temp. The node-termination fault can therefore interrupt this script at
13+
# any point and leave $cfg as one complete config or the other, never half-written
14+
# and never collapsed so both sources hold the same content. Alternation always
15+
# resumes on the next invocation.
16+
base="$cfg.orig"
17+
if [ ! -f "$base" ]; then
18+
cp "$cfg" "$base.tmp"
19+
mv "$base.tmp" "$base"
20+
fi
21+
22+
# Pick whichever source is not currently live. cksum reads from stdin so its
23+
# output is the checksum alone, with no filename to differ on.
24+
if [ "$(cksum <"$cfg")" = "$(cksum <"$alt")" ]; then
25+
next="$base"
26+
else
27+
next="$alt"
28+
fi
29+
cp "$next" "$cfg.tmp"
30+
mv "$cfg.tmp" "$cfg"
31+
32+
# Vector is PID 1 in the node container. SIGHUP triggers reload-from-disk.
33+
kill -HUP 1
34+
sleep 5
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
name: vector-e2e
2+
3+
x-vector-build: &vector-build
4+
context: ../../../..
5+
dockerfile: tests/antithesis/scenarios/vector_e2e/Dockerfile
6+
target: vector
7+
8+
x-node-health: &node-health
9+
test: ["CMD", "curl", "-fsS", "http://localhost:9598/metrics"]
10+
interval: 5s
11+
timeout: 3s
12+
retries: 30
13+
start_period: 10s
14+
15+
services:
16+
vector:
17+
container_name: vector
18+
hostname: vector
19+
platform: linux/amd64
20+
init: true
21+
build: *vector-build
22+
image: ve2e-vector:${ANTITHESIS_IMAGE_TAG:-dev}
23+
entrypoint: ["/usr/bin/vector", "--config", "/etc/vector/vector.yaml"]
24+
# vector runs the reload fault: VECTOR_CONFIG_ALT lets anytime_reload swap
25+
# configs and SIGHUP, forcing the sink to rebuild. No disk buffer, so no volume.
26+
environment:
27+
NO_COLOR: "1"
28+
VECTOR_CONFIG: "/etc/vector/vector.yaml"
29+
VECTOR_CONFIG_ALT: "/etc/vector/vector.b.yaml"
30+
healthcheck: *node-health
31+
32+
oracle:
33+
container_name: oracle
34+
hostname: oracle
35+
platform: linux/amd64
36+
init: true
37+
build:
38+
context: ../../../..
39+
dockerfile: tests/antithesis/scenarios/vector_e2e/Dockerfile
40+
target: workload
41+
image: ve2e-oracle:${ANTITHESIS_IMAGE_TAG:-dev}
42+
environment:
43+
NO_COLOR: "1"
44+
SCENARIO_NAME: "vector_e2e"
45+
VECTOR_SOURCE_URL: "http://vector:8080/"
46+
VECTOR_METRICS_URL: "http://vector:9598/metrics"
47+
VECTOR_METRICS_URLS: "http://vector:9598/metrics"
48+
# Test commands run in this container, so they reach the oracle locally.
49+
ORACLE_URL: "http://127.0.0.1:8686"
50+
depends_on:
51+
vector: { condition: service_healthy }
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
sources:
2+
in:
3+
type: http_server
4+
address: 0.0.0.0:8080
5+
decoding:
6+
codec: json
7+
acknowledgements:
8+
enabled: true
9+
10+
metrics:
11+
type: internal_metrics
12+
scrape_interval_secs: 1
13+
14+
sinks:
15+
out:
16+
type: http
17+
inputs: [in]
18+
uri: http://oracle:8686/ingest
19+
method: post
20+
encoding:
21+
codec: json
22+
# Benign alternate the reload fault swaps in. It differs from vector.yaml only
23+
# by an explicit request timeout, enough to make the reload rebuild the sink.
24+
request:
25+
timeout_secs: 45
26+
buffer:
27+
type: memory
28+
max_events: 500
29+
when_full: block
30+
acknowledgements:
31+
enabled: true
32+
33+
prom:
34+
type: prometheus_exporter
35+
inputs: [metrics]
36+
address: 0.0.0.0:9598
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
sources:
2+
in:
3+
type: http_server
4+
address: 0.0.0.0:8080
5+
decoding:
6+
codec: json
7+
acknowledgements:
8+
enabled: true
9+
10+
metrics:
11+
type: internal_metrics
12+
scrape_interval_secs: 1
13+
14+
sinks:
15+
out:
16+
type: http
17+
inputs: [in]
18+
uri: http://oracle:8686/ingest
19+
method: post
20+
encoding:
21+
codec: json
22+
# Memory buffer, no disk: this is the no-disk counterpart of the disk scenario.
23+
# when_full: block keeps the same backpressure so the source applies it to the
24+
# client instead of dropping. With end-to-end acks the 200 to the producer
25+
# fires only once this sink has delivered to the oracle, so an acked event has
26+
# already arrived — that is why conservation can hold even though a crash loses
27+
# whatever is still in this in-memory buffer.
28+
buffer:
29+
type: memory
30+
max_events: 500
31+
when_full: block
32+
acknowledgements:
33+
enabled: true
34+
35+
prom:
36+
type: prometheus_exporter
37+
inputs: [metrics]
38+
address: 0.0.0.0:9598

0 commit comments

Comments
 (0)