Skip to content

🐛 [Bug]: Cache stampede with caching middleware #2182

Open
@demming

Description

@demming

Bug Description

I've come across several instances of the so-called cache stampede in a variety of web app frameworks. Here are the issues I opened in the respective GitHub repos:

  • ASP.NET Core (only Response Caching affected, performance issue remaining),
  • Quarkus (WIP & performance),
  • PlayFramework (awesomely fixed, only minor performance issue remaining),
  • Micronaut (status unknown).

In those cases cache stampede brings the GC to its knees, while with Fiber the effects are less pronounced but still raise the typical 20m RSS to over 300m RSS in just a couple of minutes or so for just 100 connections, which in relative terms is a lot and even compared to my Akka HTTP implementation running on OpenJDK 19 Hotspot (not even the low-profile Semeru OpenJ9) consumes up to 50% more memory.

image

It appears Fiber's caching middleware suffers from it too. It's a kind of cascading failure in distributed systems. Specifically, when I start several concurrent connections to a cached Fiber endpoint, multiple method invocations and (likely) cache populations take place, ranging from "for more than one connections" to "for each connection" which is incredibly inefficient.

There are three basic mitigation strategies, of which one most commonly introduces a mutex, so that all but one (virtual) threads must wait.

For additional details just look into the issues I've linked above. Let me know if you need additional input and how I can help here out. Would appreciate if someone could confirm my findings by reproducing it.

How to Reproduce

One can set up a cached endpoint that can do whatever one likes, I make it a bit less boring and make it sanitize remote HTML, this way I can also gauge the throughput and framework-induced latencies.

One can run

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080"

or any other target URL. You can also run something less trivial, such as a chain of services to assure data equivalence

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080/website?address=<remote_url>"

or even run a recursive request.

As you certainly are aware, this will create 100 concurrent connection and use them for a load test of a short burst ranging over 10 seconds, with implicit connection reuse, and display a histogram of observed latencies.

Expected Behavior

I expect only one cache population on the initial cache miss, not 100. Specifically, getWebpageHandler below should not be invoked more than once per cache cycle, regardless of the number of inbound concurrent connections.

Fiber Version

v2.39.0

Code Snippet (optional)

import (
  "log"
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/cache"
  "github.com/microcosm-cc/bluemonday"
)

var counterGetWebpage = 0
var counterGetWebpageHandler = 0

func main() {
  log.Print("Starting the app...")

  cacheConfig := cache.Config{
    Next: func(c *fiber.Ctx) bool {
      return c.Query("refresh") == "true"
    },
    Expiration:   30 * time.Second,
    CacheControl: true,
    KeyGenerator: func(c *fiber.Ctx) string {
      return c.Path()
    },
  }


  app := fiber.New()

  // ![ ] FIXME: Cache stampede!
  app.Use(cache.New(cacheConfig))

  app.Get("/sanitize/", getWebpageHandler)

  log.Fatal(app.Listen(":3000"))
}

func getWebpageHandler(ctx *fiber.Ctx) error {
  counterGetWebpageHandler += 1
  log.Print("getWebpageHandler invoked #", counterGetWebpageHandler)

  url := ctx.Query("url")
  webpage := getWebpage(url)

  return ctx.SendString(webpage)
}

func getWebpage(url string) string {
  counterGetWebpage += 1
  log.Print("getWebpage invoked #", counterGetWebpage)

  client := fiber.AcquireClient()
  a := client.Get(url)
  status, body, err := a.String()

  if err != nil {
    log.Print("Could not obtain the remote resource: ", err)
  }

  if status > 200 {
    log.Print("Bad return status: ", status)
  }

  fiber.ReleaseClient(client)

  return body
}

Checklist:

  • I agree to follow Fiber's Code of Conduct.
  • I have checked for existing issues that describe my problem prior to opening this one.
  • I understand that improperly formatted bug reports may be closed without explanation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions