Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: graphql-yoga #11077

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

feat: graphql-yoga #11077

wants to merge 17 commits into from

Conversation

mattkrick
Copy link
Member

@mattkrick mattkrick commented Mar 27, 2025

Description

fix #11065

This PR is big. I'm sorry. Here's a video walkthrough of the changes: https://www.loom.com/share/6905c3b988e946bb9e3341f54747662b

What it does:

  • removes graphql-executor in favor of handling all graphql requests with the socket server
  • removes trebuchet in favor of graphql-ws
  • adds graphql-yoga instead of our homegrown solution
  • replaces sentry with datadog for all graphql errors

Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Copy link
Contributor

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Copy link
Contributor

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

@rafaelromcar-parabol
Copy link
Contributor

I worry that graphql-ws is a one-man project. It's only his author coding and we don't control it at all. Maybe even with that, it's worth to switch from our custom thing to this, but I wanted to share my concern.

@mattkrick
Copy link
Member Author

totally valid! it is just one person, but it seems to be pretty stable as it's the defacto solution for https://github.com/graphql-hive/graphql-yoga/graphs/contributors. Then again, graphql-yoga only has 3 contributors 😅 .

@rafaelromcar-parabol
Copy link
Contributor

I mean, if you think it's the best we can get, let's go with it and hope it doesn't stop or drift from what we need 🤞🏼 curious: why are we using graphql-ws instead of graphql-yoga? I haven't check them so this question might be stupid 😅

Copy link
Contributor

github-actions bot commented Apr 2, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 2, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 2, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

@mattkrick mattkrick requested a review from Dschoordsch April 2, 2025 23:51
Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 3, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

@rafaelromcar-parabol
Copy link
Contributor

This looks GREAT! Thanks for the work man, this is huge. Some notes and questions:

  • Please hold this PR and do not merge and release until all questions are answered and until we prepare the release in the Helm chart too.

    • This might break stuff and might provoke a downtime, so we might even want to release it afterhours. And for sure we should do a release only with that, with nothing else at all and wait say 48h before releasing anything else, so we isolate the problems.
    • We should also test this in an on-demand-env.
    • What happens to users in the previous version to this? Have you tested?
  • You are saying we only need one server, so just WS and no more GQL-EX:

    • What about scalability? Could we add "any" number of WS?
    • What about the SERVER_ID? I mean, imagine we have WS 1, 2 and 3 and then we do a release or a restart and now we have WS 4, 5 and 6. Is that OK for users if the SERVER_ID are new every time or we must keep the same IDs?
  • Getting rid of intranet-graphql breaks our pings and thus all our alerting. We need to change that before releasing in any environment. Should we replace it with calls to /graphql?

  • Chronos and the Embedder would now communicate with the server through the API instead of through Redis, isn't it? that would mean that, if using the appOrigin (that being a public URL) they would be going through unnecessary network. Could we have a different endpoint for those services to use when they want to communicate with the server? That would improve performance in real world scenarios. Like in our Kube clusters, staying in private network!

  • ReadinessCheck stills depends on external, which would mean removing a server from the load balancer if the database is unavailable, instead of, if the db is unavailable, have the app do something (return a controlled error and retry). But it's great that we already have something! It's steps to success!

@rafaelromcar-parabol rafaelromcar-parabol self-requested a review April 3, 2025 16:10
Signed-off-by: Matt Krick <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 3, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Copy link
Contributor

github-actions bot commented Apr 3, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Copy link
Contributor

github-actions bot commented Apr 3, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

@mattkrick
Copy link
Member Author

mattkrick commented Apr 3, 2025

We should also test this in an on-demand-env.

sure!

What happens to users in the previous version to this? Have you tested?

when the new version is deployed they must upgrade their client to the new version. should happen automatically, if it doesn't they'll get an error & when they hit refresh everything will be good

You are saying we only need one server, so just WS and no more GQL-EX:

yep!

What about scalability? Could we add "any" number of WS?

yep

What about the SERVER_ID? I mean, imagine we have WS 1, 2 and 3 and then we do a release or a restart and now we have WS 4, 5 and 6. Is that OK for users if the SERVER_ID are new every time or we must keep the same IDs?

yeah, the ID doesn't matter, as long as they're all unique

Getting rid of intranet-graphql breaks our pings and thus all our alerting. We need to change that before releasing in any environment. Should we replace it with calls to /graphql?

good call! yep. the shape of the payload also changes. you must remove isPrivate from the request.body

Chronos and the Embedder would now communicate with the server through the API instead of through Redis, isn't it? that would mean that, if using the appOrigin (that being a public URL) they would be going through unnecessary network. Could we have a different endpoint for those services to use when they want to communicate with the server? That would improve performance in real world scenarios. Like in our Kube clusters, staying in private network!

chronos now talks HTTP via public URL (open to changing that? but that seems difficult... the busiest cronjob fires once every 5 mins, i think we'll be OK if our public endpoint gets 1-2 extra requests every 5 mins 😅

Embedder talks via postgres, so no change necessary

ReadinessCheck stills depends on external, which would mean removing a server from the load balancer if the database is unavailable, instead of, if the db is unavailable, have the app do something (return a controlled error and retry). But it's great that we already have something! It's steps to success!

Here's my hot take-- readiness CAN include upstream dependencies, liveness CANNOT. if a server can't connect to PG, we want to take it out of rotation (ie don't serve it traffic), but we shouldn't kill it immediately. maybe the DB is down, maybe it is under heavy load. the readiness probe will still fire & if it comes back up, then we have a pre-warmed container ready to go.

From the docs:

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

https://elastisys.com/the-difference-between-liveness-and-readiness-probes-in-kubernetes/

Copy link
Contributor

github-actions bot commented Apr 3, 2025

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR will be delayed and might be rejected due to its size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test graphql-yoga
2 participants