Skip to content

Fix Waterdrop background thread leak causing flaky Rails.logger mock failures#27807

Merged
rmtolmach merged 2 commits intomasterfrom
waterdrop_error
Apr 21, 2026
Merged

Fix Waterdrop background thread leak causing flaky Rails.logger mock failures#27807
rmtolmach merged 2 commits intomasterfrom
waterdrop_error

Conversation

@rmtolmach
Copy link
Copy Markdown
Contributor

@rmtolmach rmtolmach commented Apr 16, 2026

Summary

Bug investigated and fixed by AI
Several CI test groups have been experiencing intermittent failures with this pattern:

  RSpec::Mocks::MockExpectationError:                                                                                                                                 
    received :error with unexpected arguments                                                                                                                         
      expected: (/some expected message/)                                                                                                                             
           got: ("Waterdrop [waterdrop-abc123]: librdkafka.error occurred")                                                                                           

The failures were appearing in unrelated specs across multiple files (appointment_service_spec.rb, vnp_veteran_spec.rb, immunizations_spec.rb, etc.) — none of which have anything to do with Kafka.

Root Cause

spec/lib/kafka/avro_producer_spec.rb has a context that tests behavior in non-test environments. To do this, it stubs Rails.env.to_s to return 'development' and calls Singleton.__init__(Kafka::ProducerManager) to re-initialize the singleton. This causes ProducerManager to boot with a real Rdkafka::Producer and register the error.occurred callback that calls Rails.logger.error.

The after block then calls Singleton.__init__ again to reset the singleton reference, but never closes the producer first. The Rdkafka native polling thread keeps running in the background after the spec finishes. In CI, there is no reachable Kafka broker, so rd_kafka_poll immediately and repeatedly generates a connection error, which fires:

 Rails.logger.error "Waterdrop [#{producer_id}]: librdkafka.error occurred"

This asynchronous call lands on whatever Rails.logger mock happens to be active in a later test group. Any spec using a strict expect(Rails.logger).to receive(:error).with(...) is vulnerable — the Waterdrop noise consumes the expectation before the intended log message arrives.

Fix

Call producer.close before resetting the singleton. WaterDrop::Producer#close flushes pending messages, closes the rdkafka handle, and stops the native polling thread. The leak stops entirely.

Related issue(s)

None. Noticed while on support.

Testing done

  • Run tests in a loop 50 times

Screenshots

Note: Optional

What areas of the site does it impact?

(Describe what parts of the site are impacted andifcode touched other areas)

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

@rmtolmach
Copy link
Copy Markdown
Contributor Author

11 successful. Running 20 more.

@rmtolmach
Copy link
Copy Markdown
Contributor Author

1 failure, but not waterdrop-related:

  1) Mobile::V0::Claim GET /v0/claim/:id with lighthouse upstream service when the claim is found when :cst_multi_claim_provider_mobile is enabled when no type parameter is provided defaults to lighthouse and returns claim successfully (backward compatibility)
     Failure/Error: assert_schema_conform(200)

     Committee::InvalidResponse:
       Expected `200` status code, but it was `401`.
     # /usr/local/bundle/cache/ruby/3.3.0/gems/committee-5.6.1/lib/committee/test/methods.rb:32:in `assert_response_schema_confirm'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/committee-5.6.1/lib/committee/test/methods.rb:8:in `assert_schema_conform'
     # ./modules/mobile/spec/requests/mobile/v0/lighthouse_claim_spec.rb:268:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:105:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:124:in `block in run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:110:in `run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec_ext/rspec_ext.rb:12:in `run_with_retry'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:37:in `block (2 levels) in setup'

  2) Mobile::V0::Claim GET /v0/claim/:id with lighthouse upstream service when the claim is found when :cst_multi_claim_provider_mobile is enabled when type=lighthouse parameter is provided routes to lighthouse and returns claim successfully
     Failure/Error: assert_schema_conform(200)

     Committee::InvalidResponse:
       Expected `200` status code, but it was `401`.
     # /usr/local/bundle/cache/ruby/3.3.0/gems/committee-5.6.1/lib/committee/test/methods.rb:32:in `assert_response_schema_confirm'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/committee-5.6.1/lib/committee/test/methods.rb:8:in `assert_schema_conform'
     # ./modules/mobile/spec/requests/mobile/v0/lighthouse_claim_spec.rb:280:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:105:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:124:in `block in run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:110:in `run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec_ext/rspec_ext.rb:12:in `run_with_retry'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:37:in `block (2 levels) in setup'

  3) Mobile::V0::Claim GET /v0/claim/:id with lighthouse upstream service when the claim is found when :cst_multi_claim_provider_mobile is enabled provider field in response includes provider field when no type specified (defaults to lighthouse)
     Failure/Error: expect(response.parsed_body.dig('data', 'attributes', 'provider')).to eq('lighthouse')
       Expected nil to eq "lighthouse".
     # ./modules/mobile/spec/requests/mobile/v0/lighthouse_claim_spec.rb:307:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:105:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:124:in `block in run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:110:in `run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec_ext/rspec_ext.rb:12:in `run_with_retry'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:37:in `block (2 levels) in setup'

  4) Mobile::V0::Claim GET /v0/claim/:id with lighthouse upstream service when the claim is found when :cst_multi_claim_provider_mobile is enabled provider field in response includes provider field when type=lighthouse specified
     Failure/Error: expect(response.parsed_body.dig('data', 'attributes', 'provider')).to eq('lighthouse')
       Expected nil to eq "lighthouse".
     # ./modules/mobile/spec/requests/mobile/v0/lighthouse_claim_spec.rb:316:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:105:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:124:in `block in run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:110:in `run'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec_ext/rspec_ext.rb:12:in `run_with_retry'
     # /usr/local/bundle/cache/ruby/3.3.0/gems/rspec-retry-0.6.2/lib/rspec/retry.rb:37:in `block (2 levels) in setup'

I'll keep rerunning. This was on the flaky spec doc already 👍

@rmtolmach
Copy link
Copy Markdown
Contributor Author

rmtolmach commented Apr 17, 2026

Failed on the 21st run with Error: The action 'Run Specs' has timed out after 15 minutes. 😳 which is NOT a good error. We've seen this in the past, but we don't know where it comes from...

Rerunning again.

Edit: I saw this on master on 4/20 so it's not related to this PR.

@rmtolmach
Copy link
Copy Markdown
Contributor Author

The waterdrop error was not seen in any of the 60 CI runs I did. The errors we did hit were known, so this doesn't appear to introduce any new errors.

@rmtolmach rmtolmach marked this pull request as ready for review April 20, 2026 21:53
@rmtolmach rmtolmach requested a review from a team as a code owner April 20, 2026 21:53
Copilot AI review requested due to automatic review settings April 20, 2026 21:53
@rmtolmach rmtolmach requested a review from a team as a code owner April 20, 2026 21:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a CI-flake in the Kafka/WaterDrop specs by ensuring the real Rdkafka::Producer created during a non-test-environment spec is properly shut down, preventing its background polling thread from leaking into later test examples.

Changes:

  • Close the Kafka::ProducerManager WaterDrop producer in the spec after hook before resetting the singleton.

@rmtolmach rmtolmach merged commit 67ee2e1 into master Apr 21, 2026
355 of 361 checks passed
@rmtolmach rmtolmach deleted the waterdrop_error branch April 21, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants