Skip to content

Conversation

@jehiah
Copy link
Member

@jehiah jehiah commented Nov 21, 2020

This updates nsqd to have a --topology-region and --topology-zone config argument and prefer sending messages to same-zone and same-region consumers as proposed in #1300

@jehiah jehiah self-assigned this Nov 21, 2020
@jehiah
Copy link
Member Author

jehiah commented Nov 21, 2020

I think this is ready for a first round of review @mreiferson @ploxiln (along w/ it's pair nsqio/go-nsq#312

Some items to decide on -

  • Should -mem-queue-size=0 disable zone local and region local consumption? (i.e. should it continue to effectively write to disk first?)
  • Is there an approach to changing the consumption of the disk read chan so it prefers zoneLocal and regionLocal consumers? (this would probably mean a goroutine that consumes it and does put() instead of messagePump consuming the disk backend directly. If there is an approach is it important to do as part of this?

Once this is squared away and we are happy with it i'll follow up with documentation PR's and exposing region/zone in nsqadmin.

@jehiah jehiah requested review from mreiferson and ploxiln November 21, 2020 06:45
Copy link
Member

@ploxiln ploxiln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some ideas, for what I think are minor simplifications that leave the overall design the same. Overall looks good to me.

flagSet.Duration("http-client-connect-timeout", opts.HTTPClientConnectTimeout, "timeout for HTTP connect")
flagSet.Duration("http-client-request-timeout", opts.HTTPClientRequestTimeout, "timeout for HTTP request")
flagSet.String("topology-region", opts.TopologyRegion, "A region represents a larger domain, made up of one or more zones")
flagSet.String("topology-zone", opts.TopologyZone, "A zone represents a logical failure domain")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably mention something about "for preferring closer consumer"

@jehiah jehiah force-pushed the topology_aware_msg_delivery_1301 branch from f390edc to 96adf24 Compare November 28, 2020 01:39
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch 2 times, most recently from 91517a4 to fa8d2b4 Compare October 24, 2023 19:02
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch from 2b90b29 to 42d28f5 Compare November 27, 2023 16:50
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch from 88161cb to af1edde Compare January 2, 2024 17:37
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch from af1edde to 4631064 Compare January 31, 2024 21:19
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch 2 times, most recently from e93d2b5 to b84cfe2 Compare April 5, 2024 17:21
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch 6 times, most recently from 8f51ed1 to 00fcfa9 Compare April 17, 2024 21:05
@zoemccormick
Copy link

zoemccormick commented May 15, 2024

@jehiah @ploxiln @mreiferson This PR is now officially ready for review - in conjunction with nsqio/go-nsq#312 and nsqio/nsqio.github.io#89.

We have written up an experience report based on our observations running these changes for 2 months in our production environment - it can be found here. It also lays out the changes to nsqadmin.

Please let me know if you have any other questions! Thanks in advance!

@mreiferson
Copy link
Member

@jehiah hello! what's the plan here for actually landing this? Are y'all waiting on me to review everything 😏?

@jehiah
Copy link
Member Author

jehiah commented Oct 17, 2024

@mreiferson sorry for the radio silence on this - I think we are ready to give more attention to this in Q4 so if you have any comments on this PR (or the related go-nsq changes) in the next few weeks please share them otherwise the topology changes are ready to land and we'll do that next month.

I think the general plan is to still land this behind the feature flag and get them in a 1.4 release and consider removing the FF in 1.5. We have been happy with this running at Bitly for a while now, but i'm keeping an eye on kubernetes/enhancements#4747 as it will make adopting in a K8s environment easier.

@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch from 051b99f to 641bae3 Compare January 21, 2025 15:43
@zoemccormick zoemccormick force-pushed the topology_aware_msg_delivery_1301 branch from 641bae3 to 1cd6297 Compare January 21, 2025 19:30
@jehiah jehiah merged commit 3103474 into nsqio:master Jan 27, 2025
8 checks passed
@martin-sucha
Copy link

I tried this in our deployment and it enabling the experiment didn't have the intended effect. I didn't see any effect on cross-zone traffic after enabling it, messages were sent to all zones.

I changed the zoneLocalMsgChan to have a large capacity (30k), then I observed an effect on consumer clients.

However, I observed that the message routing probably interfered with how concurrency is managed, as our consumer pods scaled down to almost zero after a few minutes. Since the nsq client divides the concurrency (RDY) between the nsqd instances, it seems to me that 2/3 of the in-flight messages that would be cross-zone traffic otherwise are not used for sending messages.

So perhaps the go-nsq client should divide the RDY requests only by the number of nsqd instances in the current zone? Resp. divide MaxInFlight for each zone separately?

What value of nsq.Config.MaxInFlight do you usually use with the topology-aware consumption? We have low values (7-10) per pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants