Skip to content

kafka: various test fixes#4301

Open
mmatczuk wants to merge 10 commits intomainfrom
mmt/kafka-fixes-v2
Open

kafka: various test fixes#4301
mmatczuk wants to merge 10 commits intomainfrom
mmt/kafka-fixes-v2

Conversation

@mmatczuk
Copy link
Copy Markdown
Contributor

Commits

  • fix data race on msgChan in connectExplicitTopics
  • adapt soak test duration to test deadline
  • use 127.0.0.1 instead of localhost in test broker address
  • unnest range_of_partitions from explicit_partitions test group
  • use lighter suite for partition-specific test groups
  • reduce partition cache ordering test batch count
  • use lighter suite for unordered partition-specific test groups
  • parallelize manual_partitioner subtest and raise package timeout
  • use lighter suite for unordered SASL test
  • shorten soak test default duration to 1 minute

Jira

  • CON-407
  • CON-413
  • CON-425
  • CON-426
  • CON-427
  • CON-431
  • CON-439
  • CON-440
  • CON-443
  • CON-448

mmatczuk added 10 commits April 20, 2026 15:26
Partition consumer goroutines were spawned before k.msgChan was assigned,
causing a data race between the goroutine reads (line 104) and the
subsequent write (line 304). Move the assignment before the spawn loop.

Fixes CON-407
TestRedpandaRecordOrderSoakTest had a hardcoded 3-minute soak duration
but didn't account for time already consumed by other tests in the
package. When run with a 5-minute timeout alongside other kafka
integration tests, the test was killed before it could finish.

The soak duration now adapts to the remaining time budget from
t.Deadline(), reserving 2 minutes for container setup/teardown. If
insufficient time remains (<30s for soaking), the test skips with a
clear message.

Fixes CON-413
KafkaSeedBroker() returns localhost:<port> which fails when the system
DNS resolver cannot resolve "localhost" (e.g. under Docker networking
issues). Use 127.0.0.1 to bypass DNS resolution entirely.

Fixes CON-425
range_of_partitions was nested inside explicit_partitions as a child
t.Run. Go's test runner releases PAUSEd subtests only when the parent
function returns, so the direct explicit_partitions subtests were blocked
until range_of_partitions completed (~19s), leaving them with a tight
remaining deadline that caused flaky 10s timeouts.

Moving range_of_partitions to a sibling t.Run ensures both groups get
their own share of the parent deadline.

Fixes CON-426
TestRedpandaIntegration runs 5 sequential suite groups that each include
1000-message sequential and parallel tests. Later groups (especially
manual_partitioner) start with a nearly exhausted parent test deadline
because preceding groups consume most of the available time.

Partition-specific groups only need to validate routing correctness, not
throughput. Reduce their message count from 1000 to 100, keeping the
top-level suite at 1000 for thorough coverage.

Fixes CON-427
Lower batches from 1M to 100K (still 1M messages) so the test
completes in ~2s instead of ~20s, avoiding timeout when CPU-starved
by long-running integration tests in the same package.

Fixes CON-431
TestIntegrationUnordered runs 4 sequential suite groups that each include
1000-message sequential and parallel tests. Later groups (especially
manual_partitioner) start with a nearly exhausted parent test deadline
because preceding groups consume most of the available time, causing
the redpanda container to become unresponsive under sustained load.

Partition-specific groups only need to validate routing correctness,
not throughput. Reduce their message count from 1000 to 100, keeping
the top-level suite at 1000 for thorough coverage. Mirrors the same
fix applied to TestRedpandaIntegration in edb48668f.

Fixes CON-439
TestIntegrationUnordered's manual_partitioner group did not call
t.Parallel() like its sibling subtests, so it ran sequentially after
the top-level suite, adding to the total elapsed time of the kafka
integration package. The go test binary-level 5-minute timeout would
fire before TestIntegrationUnordered could even complete its pre-test
topic creation.

Call t.Parallel() in manual_partitioner to match the other groups,
and bump the kafka package timeout to 10m in packages.json, matching
other heavy integration packages (redpanda, mssqlserver, etc.).

Fixes CON-443
TestIntegrationUnorderedSasl's stream/parallel groups each ran 1000
messages, the same as the non-SASL TestIntegrationUnordered. The SASL
variant only validates authentication wiring, not throughput, so the
heavy message counts just consumed the parent package budget with no
additional coverage. When other tests in the kafka package run first
under the package timeout, the SASL subtests were left with no time
to even connect to the broker.

Drop the message count from 1000 to 100 to match the pattern used by
the partition-specific groups in TestRedpandaIntegration (commit
edb48668f). Standalone runtime drops from ~18s to ~6s.

Fixes CON-440
TestRedpandaRecordOrderSoakTest used to consume ~3 minutes of soak
(plus 2 minutes of setup/teardown budget) per run, dominating the
kafka package runtime. When it ran before TestRedpandaRecordOrderIntegration
under the package timeout, the latter was starved and killed by the
package-level test timeout before it could complete.

The chaos scenarios (leadership transfer + broker restarts) produce
ample coverage within 1 minute, and the comment at the top of the
test already notes that long soaks are meant to be run explicitly
via `-count 1000` overnight. Standalone runtime drops from ~184s to
~72s.

Fixes CON-448
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Commits
LGTM

Review
Bundle of test-stability fixes for the kafka integration package: real data race fix in connectExplicitTopics, adaptive soak-test duration, lighter suites for partition/SASL subtests, test-group restructuring, container-addr helper, and a package timeout bump. All changes are well-motivated by the commit messages and look sound.

LGTM

This was referenced Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant