Fix Waterdrop background thread leak causing flaky Rails.logger mock failures#27807
Fix Waterdrop background thread leak causing flaky Rails.logger mock failures#27807
Conversation
|
11 successful. Running 20 more. |
|
1 failure, but not waterdrop-related: I'll keep rerunning. This was on the flaky spec doc already 👍 |
|
Failed on the 21st run with Rerunning again. Edit: I saw this on |
|
The waterdrop error was not seen in any of the 60 CI runs I did. The errors we did hit were known, so this doesn't appear to introduce any new errors. |
There was a problem hiding this comment.
Pull request overview
Fixes a CI-flake in the Kafka/WaterDrop specs by ensuring the real Rdkafka::Producer created during a non-test-environment spec is properly shut down, preventing its background polling thread from leaking into later test examples.
Changes:
- Close the
Kafka::ProducerManagerWaterDrop producer in the specafterhook before resetting the singleton.
Summary
Bug investigated and fixed by AI
Several CI test groups have been experiencing intermittent failures with this pattern:
The failures were appearing in unrelated specs across multiple files (appointment_service_spec.rb, vnp_veteran_spec.rb, immunizations_spec.rb, etc.) — none of which have anything to do with Kafka.
Root Cause
spec/lib/kafka/avro_producer_spec.rbhas a context that tests behavior in non-test environments. To do this, it stubs Rails.env.to_s to return 'development' and callsSingleton.__init__(Kafka::ProducerManager)to re-initialize the singleton. This causes ProducerManager to boot with a real Rdkafka::Producer and register theerror.occurredcallback that callsRails.logger.error.The
afterblock then callsSingleton.__init__again to reset the singleton reference, but never closes the producer first. The Rdkafka native polling thread keeps running in the background after the spec finishes. In CI, there is no reachable Kafka broker, sord_kafka_pollimmediately and repeatedly generates a connection error, which fires:This asynchronous call lands on whatever Rails.logger mock happens to be active in a later test group. Any spec using a strict
expect(Rails.logger).to receive(:error).with(...)is vulnerable — the Waterdrop noise consumes the expectation before the intended log message arrives.Fix
Call producer.close before resetting the singleton.
WaterDrop::Producer#closeflushes pending messages, closes the rdkafka handle, and stops the native polling thread. The leak stops entirely.Related issue(s)
None. Noticed while on support.
Testing done
Screenshots
Note: Optional
What areas of the site does it impact?
(Describe what parts of the site are impacted andifcode touched other areas)
Acceptance criteria