Skip to content

Conversation

@dodomorandi
Copy link

@dodomorandi dodomorandi commented Apr 24, 2023

  • Link to issue, e.g. Resolves #NNN
  • Documentation added (if applicable)
  • Tests added
  • Branch rebased on top of current main (git pull --rebase origin main)
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Main issue: #4094

Changes proposed in this pull request:

  • Added Heartbeat field to StreamSource
  • Added getMirrorHeartbeat helper function
  • Added getStreamSourceHeartbeat and getSourceHeartbeat helper functions
  • Use getMirrorHeartbeat to define mirror heartbeat and related health checks
  • Use getStreamSourceHeartbeat and getSourceHeartbeat to define stream sources heartbeat and related health checks

CC @paolobarbolini

@dodomorandi dodomorandi requested a review from a team as a code owner April 24, 2023 14:19
@dodomorandi
Copy link
Author

Except for the failing CI (for which TBH I am a bit perplexed), feel free to give me some feedback about the value of the PR (and the relative issue). If you think it is basically fine, I will add some tests for the new behavior.

@derekcollison
Copy link
Member

The data race in TestJetStreamClusterSourceWithOptStartTime is legit.

@dodomorandi
Copy link
Author

@derekcollison can you give me a hint about a possible cause? The reason I don't understand is that, without specifying any idle_heartbeat parameter, the behavior should have not been changed in any possible way. What I am not seeing?

@derekcollison
Copy link
Member

The stack for the datarace will give you the info needed, in terms of when it violates and what was previous operation, etc..

@dodomorandi
Copy link
Author

@derekcollison Thanks for the patient! There was just a single point of data race it slipped through. Tests are now green.

Let me know if the change is reasonable for you, in that case I can add a few tests.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, may need to be targeted at dev branch for 2.10 vs main which is 2.9.x atm.

Let's add some tests etc.

Thanks!

@dodomorandi
Copy link
Author

@derekcollison Sorry to bother, but I am banging my head against the tests and I probably need a hint from someone who knows the codebase.

The ideal test would be to use both (relatively) small and large custom heartbeat and monitor the behavior of the consumers in order to check everything is working as expected. For instance, I should expect a failing test if I change back the heartbeat value to the previously default one in the checks for stalled streams.

The issue I am facing is I am unable to verify this situation. For instance, the test TestJetStreamPushConsumerIdleHeartbeats exactly performs this task, but it has direct access to the testing consumer (because it just creates it) in order to subscribe to the replies and check for the heartbeat. On the other hand, when a stream is configured to use a mirror or some sources, the created consumer is not exposed in any way, and I am unable to get the heartbeat messages just subscribing to >.

I honestly do not know if I am missing something obvious or the situation is really a bit tricky, in any case I think people with the knowledge of the codebase could give me very useful information. Even if my changes have a small impact, I think that there are a couple of sharp edges that really need some additional tests.

Thank you in advance!

@derekcollison
Copy link
Member

Yes can be tricky in there, the consumer is gettable, but a bit nuanced.

Can you point me to a line in your test where you would need direct access and I can probably help.

@dodomorandi dodomorandi changed the base branch from main to dev May 2, 2023 09:52
@dodomorandi
Copy link
Author

Sure, and thank you for helping me! 😊

I just pushed a wip commit, which is pretty much stolen from TestJetStreamPushConsumerIdleHeartbeats.

As you can see I am trying to replicate the subscription to the consumer in order to get the heartbeat messages.

However (and pretty obviously), in the original test the subject is used as a delivery subject for the consumer. However, in my case the subject of the consumer is changed a bit and other parameters are set -- and my noobness with the codebase does not let me grasp what is the specific reason I am unable to see the the heartbeat messages from the inbox subscription 😅.

I would be really happy to solve this without bothering you, but I think I need some help 😞. Thank you in advance for all the support!

@derekcollison
Copy link
Member

We have a customer that is looking at this functionality so we will introduce for 2.10 in some for or fashion.

}

func (mset *stream) getMirrorHeartbeat() time.Duration {
if mset.cfg.Mirror.Heartbeat == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dodomorandi My guess is that the data race is in this method since no read lock is being used (unlike all of the other methods on mset when fields are accessed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data race now is fixed, the only thing is that I keep using the initial value of the heartbeat. AFAIK this should not be an issue because I don't expect the value to change across locking blocks. Is my assumption correct or it would be better to use the current value of the heartbeat here?

@dodomorandi dodomorandi force-pushed the source-heartbeat branch 3 times, most recently from 255b205 to b4d89b2 Compare July 6, 2023 09:11
@bruth
Copy link
Member

bruth commented Aug 5, 2023

Looks like the final bit to remove is the local replace in go.mod?

Instead of using the default heartbeat for stream sources and mirror,
allow more customizability. The mechanism of the health check keeps
working in a linear fashion. Every time the heartbeat is not specified,
things work just like before.
@dodomorandi
Copy link
Author

@bruth Sorry for the delay. I just resynced the repo. Regarding to your last comment, I think that this PR needs a change in nats.go to correctly support the new Heartbeat field before being merged, and the change is pretty simple as you can see from here.

If you want I can open a PR in nats.go in order to add that field and link to this PR, just let me know.

@neilalexander neilalexander added the post-freeze We'll come back to this after the freeze period label Aug 21, 2023
@paolobarbolini
Copy link

paolobarbolini commented Sep 25, 2023

The approach of #4352 looks interesting. Since this didn't make it in 2.10 we're going to patch nats-server in our Yocto build by hard-coding a higher value for heartbeat interval. We'll see whether this option makes sense to be configured globally or at a per stream-by-stream level.

@maurobalbi
Copy link

We have a customer that is looking at this functionality so we will introduce for 2.10 in some for or fashion.

Has this been introduced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

post-freeze We'll come back to this after the freeze period

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants