Skip to content

Test graphql-yoga #11065

Open
Open
@mattkrick

Description

@mattkrick

graphql-yoga is a batteries included graphql server. It comes with a transport layer called graphql-ws, which could replace trebuchet. I'm enticed by this combo because it offers @stream out of the box, has good logging & error reporting, and there's a LOT less custom stuff than we have on our server that we've been running for 6 years or so. Another perk is that we could get rid of the gqlExecutor & only have a single server, which means less redis throughput, faster resolutions, and a bunch fewer errors.

Hurdles

  • resubscribing is tricky. either we need to call the websocket onMessage handler with a mock subscribe message from the client, or we need to handle the messaging ourselves on the resubscribe, or we need to modify the async iterator without ending it (ie changing the channels that it is subscribed to. the last option is more imperative, but performs best. the imperative part is we have to know the channels to add/remove, and that logic currently lives in the subscription resolver.
  • reliable messaging and casual ordering - reliable messaging will be hard. will start with just casual
  • pings need testing
  • ignore replies if no opId was supplied - can't do this unless we rewrite parts
  • queue and retry messages? - won't include at first
  • test uploads
  • cache parsing/validation
  • remove all calls to publishInternal, etc.
  • reuse dataLoader for subs
  • datadog + other extensions
  • troubleshoot reconnect logic. change a line in the server so it disconnects, the client should recon
  • test memory leaks (context, dataloaders)
  • test SSR
  • test webhooks
  • figure out what to do with graphql-jit. it doesn't work with stream/defer yet, and we'll have to rewrite our wrapper to make it work with datadog. and some real-world tests show it's not all that great https://deezer.io/graphql-jit-is-it-worth-it-64e66f21dbb8. Holding off for now
  • permanent TTL with warning, just in case i forgot something -- not gonna do this, will just log every minute the workerCount to verify
  • test chronos
  • test SSO
  • create a new org, make sure the resubscribe happens with proper authToken

problem with the dataloaders: 1 server has the dataloader, but subscribers exist on n servers. previously, we sent each subscriber back to the mutating GQL executor in order to reuse the dataloader.

options for solving the dataloader problem:

  • we could do the same thing as before. each subscriber sends the SourceStream payload to the mutating server that still holds the dataloader. unfortunately, this means we'd have to write custom subscribe code since yoga does not allow separation of source stream and response stream.
  • when we publish, we could eagerly serialize & publish the dataloader to redis. then the subscribers could attempt to get the dataloader locally, then try to get it from redis if a local copy doesn't exist. since publish can be called more than once, the mutation server would have to keep track of whether it already serialized & published the dataloader before.
  • we could wait until a subscriber needs the dataloader, then the dataloader could publish to the mutating server asking for the serialized dataloader, then it replies with the dataloader. this gets around the need to store which dataloader got published already, but increases latency

i like option 2 because it does not increase latency & reduces traffic between servers.

there's a problem though.

  • mutation 1 publishes an updateTask, the TaskSubscription sets context.dataLoader to the dataloader of mutation 1
  • mutation 2 publishes an updateTask, the TaskSubscription sets context.dataLoader to the dataloader of mutation 2
  • mutation 1 is still resolving, but is now using the dataloader for mutation 2

The above isn't a problem because graphql-yoga resolves subscriptions in the order that they're received. this is nice because it eliminates the casual ordering problem, but it's a pain because that means dataloaders must have a TTL > the duration it takes to resolve the pending queue.

  • the serialized version must live in redis long enough for 1 subscriber on each server to grab it. each subscriber may have a backlog of response stream payloads to resolve. generally these take < 10ms, but if it involves a fetch to a 3rd party like jira, it could take 3 seconds each. so, let's target 30 seconds?
  • once the serialized version is hydrated onto a server, then all subscribers must use that in-memory version within an appropriate TTL. Since subscribers are generally team members receiving about the same payloads, they should all resolve approximately at the same time. that said, they could all be backlogged from jira fetches, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions