-
Notifications
You must be signed in to change notification settings - Fork 17
fix: Improve SSE connection error handling #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
SSE is now part of the library that we use for http we should use that instead |
|
Could we release this change as-is and create a backlog item to move to resty SSE later? |
We'd still need some tests |
|
@gagantrivedi Added a test for when the SSE service responds with a non-SSE connection. Not sure why pre-commit is failing, any idea what's up with that? |
| defer resp.Body.Close() | ||
| if resp.Header.Get("Content-Type") != "text/event-stream" { | ||
| panic("realtime server did not open an SSE connection") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the need of panic here, we should log it as error and exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current API for realtime updates, we have 2 options:
- Log an error and panic
- Log an error and stop receiving realtime updates
Neither option is great. Panicking is not ideal, but it's the only way that users can handle any errors. If we don't want to panic, I would propose implementing a different API, which could look something like this:
- Deprecate
WithRealtimeconfig option - Add methods to start and stop listening for realtime updates. The method to start listening should return a channel where messages are returned, which can be either actual update messages or errors.
- If SSE connection drops, or anything else went wrong, an error should be sent on the returned channel. The SDK should transparently handle reconnecting, sending a message on the channel when this happens.
Having explicit methods to start or stop receiving updates gives more control to the user, allows proper error handling and makes this much easier to test. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a much bigger lift with little upside for the following reasons:
- We would have to update all the SDKs.
- It would make the API more complex.
- I don't think many people would want that level of control.
Now, I see another option here:
Use exponential backoff retry and log the error. If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to update all the SDKs.
Not necessarily - each SDK can have whatever API is most convenient for using real-time updates. We can update other SDKs later if we want to, or not.
It would make the API more complex.
It's more complex in the sense that there are more methods, but it's less opinionated. We're removing assumptions from our code and letting customers make those decisions themselves.
Instead of using channels, one alternative approach would be to support message/error handlers like resty already does for SSE, e.g. OnMessage and OnError: https://resty.dev/docs/server-sent-events/. We already have WithErrorHandler which we could use or extend.
I don't think many people would want that level of control.
This is more about reliability than control, especially for customers who are self-hosting SSE, or customers using Flagsmith-managed deployments where reducing the number of API calls is important.
Currently there is no way to handle any SSE errors, whether that's in the initial connection or when receiving a message.
We have at least one user who is interested in this (@ajinkyasurya) - maybe you could expand with your use case and how the current API is not suitable?
If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.
I don't think this is a good solution. Logs should not be part of our public API, especially when it comes to error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I am missing some context here:
We're removing assumptions from our code and letting customers make those decisions themselves.
What decisions are we talking about here? And why exponential backoff retry will not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.
This is not feasible solution. We will be relying on SSE service completely to fetch any update.
- If the sdk connection with SSE service is dropped/interrupted due to any reason, we want to be able to send out metrics as and when this happens since it enables us to track uptime for our platform.
- We are planning to entirely depend on SSE service to get updates. If under any circumstances, the connection was dropped and could not be established again, we would like to fallback to polling or fetching updates even once. Any updates/events sent over channel will help us decide that.
- Having said that, as @rolodato said, we expect you to asynchronously handle reconnection. We simply want to be informed of any events for metrics and debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that panic is not ideal though, can we please use exponential backoff retry and log the error for now and you can continue with a bigger refactor outside of this PR?
| @@ -1,18 +1,17 @@ | |||
| package flagsmith_test | |||
| package flagsmith | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we moving from flagsmith_test to flagsmith here to access private method of real-time module? If yes, we should create a different file for that like analytics_test
When failing to connect to SSE, we currently retry forever and immediately after each failure. This results in spamming connection attempts forever:
We might also want to have a limit of the number of consecutive failures, but that can come later.
Also, panic if we were able to connect to the realtime endpoint and it did not open an SSE connection. This can happen if you accidentally use any other endpoint such as
https://api.flagsmith.com, or anything that responds over HTTP such ashttps://example.com. We might want to use something less heavy-handed than a panic to allow for cleanups, but this scenario is a clear misconfiguration where we can't do anything else.Tested manually, but no tests were added for these scenarios.