fix: Improve SSE connection error handling #147

rolodato · 2025-01-16T22:08:15Z

When failing to connect to SSE, we currently retry forever and immediately after each failure. This results in spamming connection attempts forever:

2025/01/16 18:52:23.753837 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753874 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753937 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753982 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
...

We might also want to have a limit of the number of consecutive failures, but that can come later.

Also, panic if we were able to connect to the realtime endpoint and it did not open an SSE connection. This can happen if you accidentally use any other endpoint such as https://api.flagsmith.com, or anything that responds over HTTP such as https://example.com. We might want to use something less heavy-handed than a panic to allow for cleanups, but this scenario is a clear misconfiguration where we can't do anything else.

Tested manually, but no tests were added for these scenarios.

gagantrivedi · 2025-01-17T04:14:45Z

SSE is now part of the library that we use for http we should use that instead

rolodato · 2025-01-17T14:45:16Z

Could we release this change as-is and create a backlog item to move to resty SSE later?

gagantrivedi · 2025-01-20T07:10:59Z

Could we release this change as-is and create a backlog item to move to resty SSE later?

We'd still need some tests

rolodato · 2025-03-03T20:46:41Z

@gagantrivedi Added a test for when the SSE service responds with a non-SSE connection. Not sure why pre-commit is failing, any idea what's up with that?

gagantrivedi · 2025-03-04T10:43:37Z

realtime.go

 			defer resp.Body.Close()
+			if resp.Header.Get("Content-Type") != "text/event-stream" {
+				panic("realtime server did not open an SSE connection")
+			}


I don't see the need of panic here, we should log it as error and exit

With the current API for realtime updates, we have 2 options:

Log an error and panic

Log an error and stop receiving realtime updates

Neither option is great. Panicking is not ideal, but it's the only way that users can handle any errors. If we don't want to panic, I would propose implementing a different API, which could look something like this:

Deprecate WithRealtime config option

Add methods to start and stop listening for realtime updates. The method to start listening should return a channel where messages are returned, which can be either actual update messages or errors.

If SSE connection drops, or anything else went wrong, an error should be sent on the returned channel. The SDK should transparently handle reconnecting, sending a message on the channel when this happens.

Having explicit methods to start or stop receiving updates gives more control to the user, allows proper error handling and makes this much easier to test. WDYT?

I think it's a much bigger lift with little upside for the following reasons:

We would have to update all the SDKs.

It would make the API more complex.

I don't think many people would want that level of control.

Now, I see another option here:
Use exponential backoff retry and log the error. If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

We would have to update all the SDKs.

Not necessarily - each SDK can have whatever API is most convenient for using real-time updates. We can update other SDKs later if we want to, or not.

It would make the API more complex.

It's more complex in the sense that there are more methods, but it's less opinionated. We're removing assumptions from our code and letting customers make those decisions themselves.

Instead of using channels, one alternative approach would be to support message/error handlers like resty already does for SSE, e.g. OnMessage and OnError: https://resty.dev/docs/server-sent-events/. We already have WithErrorHandler which we could use or extend.

I don't think many people would want that level of control.

This is more about reliability than control, especially for customers who are self-hosting SSE, or customers using Flagsmith-managed deployments where reducing the number of API calls is important.

Currently there is no way to handle any SSE errors, whether that's in the initial connection or when receiving a message.

We have at least one user who is interested in this (@ajinkyasurya) - maybe you could expand with your use case and how the current API is not suitable?

If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

I don't think this is a good solution. Logs should not be part of our public API, especially when it comes to error handling.

Sorry, I think I am missing some context here:

We're removing assumptions from our code and letting customers make those decisions themselves.

What decisions are we talking about here? And why exponential backoff retry will not work?

If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

This is not feasible solution. We will be relying on SSE service completely to fetch any update.

If the sdk connection with SSE service is dropped/interrupted due to any reason, we want to be able to send out metrics as and when this happens since it enables us to track uptime for our platform.

We are planning to entirely depend on SSE service to get updates. If under any circumstances, the connection was dropped and could not be established again, we would like to fallback to polling or fetching updates even once. Any updates/events sent over channel will help us decide that.

Having said that, as @rolodato said, we expect you to asynchronously handle reconnection. We simply want to be informed of any events for metrics and debugging.

I agree that panic is not ideal though, can we please use exponential backoff retry and log the error for now and you can continue with a bigger refactor outside of this PR?

gagantrivedi · 2025-03-04T10:46:48Z

client_test.go

@@ -1,18 +1,17 @@
-package flagsmith_test
+package flagsmith


Are we moving from flagsmith_test to flagsmith here to access private method of real-time module? If yes, we should create a different file for that like analytics_test

rolodato added 3 commits January 16, 2025 19:03

Panic if successfully connected to a non-SSE realtime endpoint

faecb9e

Sleep between failed connections to realtime endpoint

24049e1

Close body before panic

a887d62

rolodato added 2 commits March 3, 2025 17:16

Move tests to same package + add test for non-SSE server

c10ba18

Update golangci-lint version

0e160f4

rolodato requested review from gagantrivedi and removed request for gagantrivedi March 3, 2025 20:31

gagantrivedi requested changes Mar 4, 2025

View reviewed changes

rolodato closed this Mar 22, 2025

fix: Improve SSE connection error handling #147

fix: Improve SSE connection error handling #147

Uh oh!

Conversation

rolodato commented Jan 16, 2025

Uh oh!

gagantrivedi commented Jan 17, 2025

Uh oh!

rolodato commented Jan 17, 2025

Uh oh!

gagantrivedi commented Jan 20, 2025

Uh oh!

rolodato commented Mar 3, 2025

Uh oh!

gagantrivedi Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

rolodato Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gagantrivedi Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolodato Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

gagantrivedi Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

ajinkyasurya Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

ajinkyasurya Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

gagantrivedi Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rolodato Mar 4, 2025 •

edited

Loading

gagantrivedi Mar 6, 2025 •

edited

Loading