Skip to content

Conversation

@rolodato
Copy link
Contributor

When failing to connect to SSE, we currently retry forever and immediately after each failure. This results in spamming connection attempts forever:

2025/01/16 18:52:23.753837 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753874 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753937 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
2025/01/16 18:52:23.753982 ERROR FLAGSMITH: Error connecting to realtime server: Get "http://127.0.0.1/sse/environments/ihAHot7taBsh6hbHJKdYHs/stream": dial tcp 127.0.0.1:80: connect: connection refused
...

We might also want to have a limit of the number of consecutive failures, but that can come later.

Also, panic if we were able to connect to the realtime endpoint and it did not open an SSE connection. This can happen if you accidentally use any other endpoint such as https://api.flagsmith.com, or anything that responds over HTTP such as https://example.com. We might want to use something less heavy-handed than a panic to allow for cleanups, but this scenario is a clear misconfiguration where we can't do anything else.

Tested manually, but no tests were added for these scenarios.

@gagantrivedi
Copy link
Member

SSE is now part of the library that we use for http we should use that instead

@rolodato
Copy link
Contributor Author

Could we release this change as-is and create a backlog item to move to resty SSE later?

@gagantrivedi
Copy link
Member

Could we release this change as-is and create a backlog item to move to resty SSE later?

We'd still need some tests

@rolodato rolodato requested review from gagantrivedi and removed request for gagantrivedi March 3, 2025 20:31
@rolodato
Copy link
Contributor Author

rolodato commented Mar 3, 2025

@gagantrivedi Added a test for when the SSE service responds with a non-SSE connection. Not sure why pre-commit is failing, any idea what's up with that?

defer resp.Body.Close()
if resp.Header.Get("Content-Type") != "text/event-stream" {
panic("realtime server did not open an SSE connection")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need of panic here, we should log it as error and exit

Copy link
Contributor Author

@rolodato rolodato Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current API for realtime updates, we have 2 options:

  • Log an error and panic
  • Log an error and stop receiving realtime updates

Neither option is great. Panicking is not ideal, but it's the only way that users can handle any errors. If we don't want to panic, I would propose implementing a different API, which could look something like this:

  • Deprecate WithRealtime config option
  • Add methods to start and stop listening for realtime updates. The method to start listening should return a channel where messages are returned, which can be either actual update messages or errors.
  • If SSE connection drops, or anything else went wrong, an error should be sent on the returned channel. The SDK should transparently handle reconnecting, sending a message on the channel when this happens.

Having explicit methods to start or stop receiving updates gives more control to the user, allows proper error handling and makes this much easier to test. WDYT?

Copy link
Member

@gagantrivedi gagantrivedi Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a much bigger lift with little upside for the following reasons:

  • We would have to update all the SDKs.
  • It would make the API more complex.
  • I don't think many people would want that level of control.

Now, I see another option here:
Use exponential backoff retry and log the error. If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to update all the SDKs.

Not necessarily - each SDK can have whatever API is most convenient for using real-time updates. We can update other SDKs later if we want to, or not.

It would make the API more complex.

It's more complex in the sense that there are more methods, but it's less opinionated. We're removing assumptions from our code and letting customers make those decisions themselves.

Instead of using channels, one alternative approach would be to support message/error handlers like resty already does for SSE, e.g. OnMessage and OnError: https://resty.dev/docs/server-sent-events/. We already have WithErrorHandler which we could use or extend.

I don't think many people would want that level of control.

This is more about reliability than control, especially for customers who are self-hosting SSE, or customers using Flagsmith-managed deployments where reducing the number of API calls is important.

Currently there is no way to handle any SSE errors, whether that's in the initial connection or when receiving a message.

We have at least one user who is interested in this (@ajinkyasurya) - maybe you could expand with your use case and how the current API is not suitable?

If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

I don't think this is a good solution. Logs should not be part of our public API, especially when it comes to error handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I am missing some context here:

We're removing assumptions from our code and letting customers make those decisions themselves.

What decisions are we talking about here? And why exponential backoff retry will not work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is really interested in when and what error the SDK is receiving, they can simply parse the logs.

This is not feasible solution. We will be relying on SSE service completely to fetch any update.

  1. If the sdk connection with SSE service is dropped/interrupted due to any reason, we want to be able to send out metrics as and when this happens since it enables us to track uptime for our platform.
  2. We are planning to entirely depend on SSE service to get updates. If under any circumstances, the connection was dropped and could not be established again, we would like to fallback to polling or fetching updates even once. Any updates/events sent over channel will help us decide that.
  3. Having said that, as @rolodato said, we expect you to asynchronously handle reconnection. We simply want to be informed of any events for metrics and debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that panic is not ideal though, can we please use exponential backoff retry and log the error for now and you can continue with a bigger refactor outside of this PR?

@@ -1,18 +1,17 @@
package flagsmith_test
package flagsmith
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we moving from flagsmith_test to flagsmith here to access private method of real-time module? If yes, we should create a different file for that like analytics_test

@rolodato rolodato closed this Mar 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants