Floci load testing & bugs #937

steve-hb · 2026-05-19T12:34:14Z

steve-hb
May 19, 2026

We hit a cluster of bugs while applying a non-trivial backend with terraform-aws against floci, captured across #926/#935/#936. The root cause is coherent: floci handles 1-2 creations cleanly, but past ~10 Lambdas per apply, port/socket reuse, counter wrapping, and synchronization start to break down. The failure modes vary: wrong responses, deployments that succeed but whose invocations fail (usually via EventBridge Scheduler), and stuck deploys where a real quota error retries indefinitely.

The second-order problem is observability. Real AWS rarely hits these paths because customer quotas sit well above defaults. When floci correctly emulates a quota error, the terraform-aws-provider's MaxAttempts=25 burns ~7 minutes before surfacing anything, and floci's own logs say nothing. The response was correct, but nothing told the operator what to set or lookout for.

Two prevention paths:

Per-operation load tests in the existing suite. Cheap, but would only have caught one of the three Lambda bugs we found. The others lived in the interaction between IAM, Lambda, Scheduler, SQS, and log groups, at terraform-driven concurrency.
Expand the terraform compatibility suite: a ~100-resource fixture covering services that actually interact in real applies, with deploy and invoke assertions (the "deploy works, invoke fails" pattern hits every time). This approach would be expensive to maintain, but high signal.

I'd weight strongly toward option 2.

Three smaller asks alongside it:

Treat operator-facing observability as a deliverable: every retryable 4xx that floci returns should also emit a WARN naming the cause and the env var to tune (fix(lambda): make PortAllocator a checkout/release pool #936 is the template).
Would a tiny SDK-side chaos harness be in scope? Hammer each API at rising concurrency and assert the AWS SDK terminates within a bounded time on every retryable-error path. Would have caught the LimitExceeded retry storm in 30s of CI.
Budget terraform-E2E to nightly with a smaller per-PR smoke variant - full apply is expensive.

References: #926, #935, #936

What are your thoughts?

hectorvent · 2026-05-19T12:58:37Z

hectorvent
May 19, 2026
Maintainer

Hi @steve-hb,

Thanks for the fantastic write-up — the root-cause framing across #926, #935, and #936 really clarifies where the cracks show up under terraform-driven concurrency.

I'm aligned on weighting toward option 2. A ~100-resource fixture with deploy and invoke assertions is exactly the shape of test that would catch the "deploy succeeds, invoke fails" class of bugs the per-operation suite misses. On CI cost, let's budget the full terraform-E2E to nightly, with a smaller smoke variant gating each PR — that way we get the signal without paying the full apply cost on every push.

On the smaller asks:

The operator-facing WARN on every retryable 4xx is a great call. I'll adopt fix(lambda): make PortAllocator a checkout/release pool #936's PortAllocator log as the template going forward.
The SDK-side chaos harness is in scope — a bounded-time assertion on retryable paths is cheap insurance against future retry-storm regressions, and it pairs naturally with the WARN log work.

I'll open tracking issues for the nightly suite, the PR smoke variant, and the chaos harness so we can move on them independently.

Really appreciate the time you put into this.

3 replies

hectorvent May 23, 2026
Maintainer

@steve-hb I'd like to proceed with the next phase on this. Could you create issues/tasks for anything still pending adjustment?

steve-hb May 26, 2026
Author

Have to check the next few days, currently busy with a project based on floci for testing actually :)

steve-hb Jun 23, 2026
Author

fyi: still busy with other deadline stuff, but I looked into this more.

Confirmed the IAM one and have a fix on fix/iam-role-policy-concurrency. The services are @ApplicationScoped singletons and storage hands back the same entity instance to every request, so putRolePolicy does an unsynchronized read-modify-write on a plain HashMap and writes get lost (last writer wins). That's where the "PutRolePolicy succeeds but the GetRolePolicy right after says NoSuchEntity" comes from under terraform -parallelism. Fix is concurrent collections plus locking the compound ops, and I've got a test that reproduces it.

The same pattern is all over the place: singleton service, shared entity from storage, mutated in place with no locking. Did a quick pass and hit it in:

Secrets Manager: the version map and the AWSCURRENT/AWSPREVIOUS stage moves aren't guarded at all, so you can end up with two or zero AWSCURRENT, or lost versions
API Gateway v1: methods, integration/method responses, stage variables and tags all mutate plain HashMaps in place
API Gateway v2: tag map race, plus deleteApi's cascade delete races concurrent child creates and leaves orphaned routes/integrations
S3: the put path is locked but deleteObject and the tagging getters aren't
Lambda: tags, permissions, shard sequence numbers, and some container/port leak stuff in the runtime
ACM: tag add/remove and the status transition
CloudFront is the one that's actually fine. It just synchronizes every mutator, which is roughly what the rest need.

Haven't fixed any of these yet, just flagging. IAM fix is ready to look at whenever.
But when looking at fixing those, we need to consider the trade-offs: synchronized is easy, but some of these patches might decrease performance a little - worth it imo. Rather have a race-free platform than a racy-but-performant one. (despite this one guy "benchmarking" his own alternative project against floci thus "proving" his project is somehow superior...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floci

Floci load testing & bugs #937

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Floci

Floci load testing & bugs #937

Uh oh!

steve-hb May 19, 2026

Replies: 1 comment · 3 replies

Uh oh!

hectorvent May 19, 2026 Maintainer

Uh oh!

hectorvent May 23, 2026 Maintainer

Uh oh!

steve-hb May 26, 2026 Author

Uh oh!

steve-hb Jun 23, 2026 Author

steve-hb
May 19, 2026

Replies: 1 comment 3 replies

hectorvent
May 19, 2026
Maintainer

hectorvent May 23, 2026
Maintainer

steve-hb May 26, 2026
Author

steve-hb Jun 23, 2026
Author