Replies: 1 comment 3 replies
-
|
Hi @steve-hb, Thanks for the fantastic write-up — the root-cause framing across #926, #935, and #936 really clarifies where the cracks show up under terraform-driven concurrency. I'm aligned on weighting toward option 2. A ~100-resource fixture with deploy and invoke assertions is exactly the shape of test that would catch the "deploy succeeds, invoke fails" class of bugs the per-operation suite misses. On CI cost, let's budget the full terraform-E2E to nightly, with a smaller smoke variant gating each PR — that way we get the signal without paying the full apply cost on every push. On the smaller asks:
I'll open tracking issues for the nightly suite, the PR smoke variant, and the chaos harness so we can move on them independently. Really appreciate the time you put into this. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We hit a cluster of bugs while applying a non-trivial backend with terraform-aws against floci, captured across #926/#935/#936. The root cause is coherent: floci handles 1-2 creations cleanly, but past ~10 Lambdas per apply, port/socket reuse, counter wrapping, and synchronization start to break down. The failure modes vary: wrong responses, deployments that succeed but whose invocations fail (usually via EventBridge Scheduler), and stuck deploys where a real quota error retries indefinitely.
The second-order problem is observability. Real AWS rarely hits these paths because customer quotas sit well above defaults. When floci correctly emulates a quota error, the terraform-aws-provider's MaxAttempts=25 burns ~7 minutes before surfacing anything, and floci's own logs say nothing. The response was correct, but nothing told the operator what to set or lookout for.
Two prevention paths:
Per-operation load tests in the existing suite. Cheap, but would only have caught one of the three Lambda bugs we found. The others lived in the interaction between IAM, Lambda, Scheduler, SQS, and log groups, at terraform-driven concurrency.
Expand the terraform compatibility suite: a ~100-resource fixture covering services that actually interact in real applies, with deploy and invoke assertions (the "deploy works, invoke fails" pattern hits every time). This approach would be expensive to maintain, but high signal.
I'd weight strongly toward option 2.
Three smaller asks alongside it:
Treat operator-facing observability as a deliverable: every retryable 4xx that floci returns should also emit a WARN naming the cause and the env var to tune (fix(lambda): make PortAllocator a checkout/release pool #936 is the template).
Would a tiny SDK-side chaos harness be in scope? Hammer each API at rising concurrency and assert the AWS SDK terminates within a bounded time on every retryable-error path. Would have caught the LimitExceeded retry storm in 30s of CI.
Budget terraform-E2E to nightly with a smaller per-PR smoke variant - full apply is expensive.
References: #926, #935, #936
What are your thoughts?
Beta Was this translation helpful? Give feedback.
All reactions