Skip to content

Performance Degradation Under Load From Fragments (Blueprint Walk Churns Memory?) #1352

Open
@jeffutter

Description

@jeffutter

Summary

My team and I have been struggling to track down a GraphQL related performance 'regression' for the past couple of weeks. We have two versions of a query that are effectively identical, one has more of it's fragments as spread/named fragments one has more of them as in-lined fragments. The text document for both are > 200kb and the Blueprints are very large (> 5MB from :erts_debug.size IIRC).

The version with more inline fragments has a slightly smaller text size and blueprint size.

Under heavy load the version with the in-lined fragments has 2-5x worse p95 latency and uses ~2x the CPU.

Image

I'm hoping folks can read this explanation and see if my hypothesis makes sense. I'm making some logical jumps that I haven't been able to specifically measure and substantiate yet.

Details & Background

  • The two versions of the query are:
    • A.) Handwritten by our clients devs (Referred to as the "Good" query from here on out)
    • B.) That same query passed through Apollo Router with only one subgraph behind the router (essentially Apollo doing the optimizations it thinks are appropriate (Referred to as the "Bad" query from here on out - because who likes slow p95s?)
  • Both versions
    • Have a lot of fragments
    • Often have @skip or @include directives on the fragments
    • Return the exact same data
    • Seem to execute the same as far as resolvers go
  • The "Bad" query has
    • Smaller byte size
    • Smaller blueprint size
    • Many of the fragments which were only used once converted to inlined fragments (by Apollo Router)
  • We've seen similar poor performance with a 3rd variation of the query where all fragments were converted to named fragments
  • We have completely removed Apollo Router from the equation by dumping the queries it makes and injecting them into our load test framework
    • Now the load tests make the same requests the router would make as if the router were in place
    • This removes any theory about it being some kind of connection pooling / load balancing issue with Apollo Router
  • In any kind of isolated testing the "Bad" query actually performs faster (It's slightly smaller, why wouldn't it?)
    • We've tried looking at the following
      • Traces in Datadog
      • eprof profiles
      • cprof profiles
      • flamegraphs
    • In any of these situations the "Bad" query is actually (as one might expect) better

The only difference

The only inkling of a difference I can find between the two queries is this:

I instrumented the application with the Datadog native continuous profiler and compared two load tests (left "Good", right "Bad"). The time spent doing elixir things is nearly identical between the two, but on the "Bad" side we see this big blob of time spent in... garbage collection.

Image

When we look at the comparison table it's even more obvious.

Image

For some reason the "Bad" query is causing more churn in memory and thus the GC is burning more CPU cycles, taking those cycles away from doing actual work and leading to increased CPU and p95 latency.

Hypothesis

Why would Inline fragments vs spread fragments make any difference to memory usage? One wouldn't think they would, especially when the fragments being in-lined were only used once to begin with (so we're not duplicating them inline in different objects).

My hypothesis, while rather unsatisfactory is that something about the overall shape of the blueprint crosses a tipping point in how the BEAM manages memory. Other, larger queries with or without fragments might fall into this same pitfall. I suspect it has something to do with the width or depth of maps or size of lists and how walk traverses and updates them. I suspect that since walk maps over the list of children and re-sets it on the map; even if no data is changing, this cause the BEAM to copy the data and GC the old copies.

Supporting Findings

In trying to figure out why in-line vs spread fragments would have any different behavior here I 'instrumented' the run_phase function of Absinthe.Pipeline as follows:

diff --git a/lib/absinthe/pipeline.ex b/lib/absinthe/pipeline.ex
index 16c7d1c2..26d91bef 100644
--- a/lib/absinthe/pipeline.ex
+++ b/lib/absinthe/pipeline.ex
@@ -405,7 +405,19 @@ defmodule Absinthe.Pipeline do
   def run_phase([phase_config | todo] = all_phases, input, done) do
     {phase, options} = phase_invocation(phase_config)
 
-    case phase.run(input, options) do
+    :erlang.garbage_collect()
+    {:memory, before} = :erlang.process_info(self(), :memory)
+
+    res = phase.run(input, options)
+    {:memory, aft} = :erlang.process_info(self(), :memory)
+
+    :erlang.garbage_collect()
+
+    {:memory, aft_gc} = :erlang.process_info(self(), :memory)
+
+    IO.puts("Phase: #{phase} | Allocated: #{aft - before} | Churn: #{aft_gc - aft}")
+
+    case res do
       {:record_phases, result, fun} ->
         result = fun.(result, all_phases)
         run_phase(todo, result, [phase | done])

The idea here is to measure memory used by each phase by doing the following:

  • Force a GC
  • Measure memory used in the process (MemA)
  • Run the phase
  • Measure memory used in the process (MemB)
  • Run a GC
  • Measure memory used in the process (MemC)

This tells us two things:

  • MemB - MemA tells us how much the memory grew while running that pipeline step
  • MemB - MemC tells us how much of that growth is garbage

I then ran this against a sample project I create (working on the red tape to publish a public repo for this).
The sample project allows me to on-the-fly build schemas for an interface with N implementers and then creates queries for those with N either Spread/Named or Inline fragments.

Given the same number of fragments I see a difference like this:

Image

You can see the in-line version on the left has much more allocations and churn than the spread version on the right. The spread version seems to allocate nothing for many of the phases.

What gives?

This is the part I'm having difficulty answering. I dug into some of the more unassuming looking phases like Phase.Document.Arguments.CoerceLists and see that they should essentially do nothing given my query, yet processing them seems to allocate memory. This lead me to the walk implementation that I highlighted under "Hypothesis".

The final leap I'm making which I haven't been able to substantiate is that the inline ("Bad") version's blueprint depth or the width are crossing some threshold that causes the reduce over the children and reassignment to the map to copy. I wonder if it's related to how small vs big maps are handled?

Avoid having more than 32 elements in the map. As soon as there are more than 32 elements in the map, it will require more memory and keys can no longer be shared with other instances of the map.

I'm not even sure there are any large maps happening here though, it could potentially have something to do with updating the lists of children.

In conclusion

As you can see, the last bit of my theory depends on a fuzzy and partial understanding of how the BEAM manages memory, particularly when updating maps. If my hypothesis is accurate, I would love to discuss how this could be improved. I've pondered this a bit and quite frankly don't have many good ideas. It doesn't seem feasible that Absinthe could avoid walking the blueprint and reducing/mapping the children while it goes.

Are there any other completely different avenues I should approach this from?

Are there any good techniques for gathering more information about this specific issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions