Skip to content

feat(e2e): Crossplane DLQ support + drift detection tests + cold-start scaling#270

Open
atemate wants to merge 12 commits intomainfrom
tech-debt-crossplane-e2e-tests/1f2y.fix-scaling-performance-e2e-tests-cold-start-backlog
Open

feat(e2e): Crossplane DLQ support + drift detection tests + cold-start scaling#270
atemate wants to merge 12 commits intomainfrom
tech-debt-crossplane-e2e-tests/1f2y.fix-scaling-performance-e2e-tests-cold-start-backlog

Conversation

@atemate
Copy link
Collaborator

@atemate atemate commented Mar 6, 2026

Summary

  • SQS DLQ queue: Add shared asya-{namespace}-dlq Crossplane Queue resource in asya-crew chart. Each actor queue gets a redrivePolicy pointing at it (constructed from awsAccountId + awsRegion in chart values).
  • Un-skip DLQ tests: Remove @pytest.mark.skip from test_poison_message_moves_to_dlq_e2e and test_dlq_preserves_message_metadata_e2e.
  • Crossplane drift tests: Rewrite 4 queue-health tests — replace "operator recreates queue" framing with Crossplane drift reconciliation, reduce timeout from 360s to CROSSPLANE_RECONCILE_TIMEOUT_SECONDS (default 120s).
  • Cold-start test: Add test_cold_start_backlog_processing — scales actor to 0, enqueues 20-message backlog, asserts 90%+ complete after KEDA scales up.

Merges aints: 1f3k (DLQ un-xfail) + 1fbq (queue health un-skip) + 1f2y (cold-start scaling) → single task 1f2y.

Test plan

  • helm lint deploy/helm-charts/asya-crew/ passes
  • helm lint deploy/helm-charts/asya-crossplane/ passes
  • helm template with awsAccountId=000000000000 shows redrivePolicy in SQS queue
  • helm template with awsAccountId= shows no redrivePolicy
  • DLQ tests collected without skip (grep h6h2 returns 0)
  • Queue health tests collected without skip (grep pwx6 returns 0), renamed to test_crossplane_recreates_*
  • Cold-start test collected, polls for actual pod drain before enqueuing
  • make lint clean

@github-actions github-actions bot added charts feat New feature implementation test labels Mar 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively adds SQS DLQ support, enables the corresponding tests, and refactors the queue health tests to align with the new Crossplane-based reconciliation strategy. A new cold-start scaling test is also introduced, which is a great addition for ensuring scale-from-zero reliability. However, a security audit identified two injection vulnerabilities in the Crossplane composition template where Helm values are rendered without proper escaping. Specifically, a potential template injection via awsAccountId and a JSON injection via maxReceiveCount were found. These issues should be remediated by using the toJson filter to ensure values are safely handled during template rendering. Additionally, the implementation of the new cold-start test could be optimized, as waiting for task completions can be significantly improved to reduce test execution time.

atemate added 4 commits March 6, 2026 21:51
…t test, increase Crossplane reconcile timeout to 300s
Crossplane's SQS provider default drift detection cycle is ~10 minutes.
Queue health chaos tests delete a queue and expect Crossplane to recreate
it within 300s, which is impossible with the 10-min default poll interval.

Add a DeploymentRuntimeConfig for provider-aws-sqs that passes
--poll-interval to the provider pod, gated by providers.aws.pollInterval
in values.yaml (empty = use provider default). Set it to "10s" in the
sqs-s3 E2E test profile so drift is detected within seconds, making the
chaos tests reliably complete within the 300s window.

The annotation-based trigger (_trigger_crossplane_reconcile) remains as
an explicit immediate trigger, complementing the short poll interval.
…-aws-sqs

provider-aws-sqs v1.19.0 (upjet-generated family provider) does not expose
--poll-interval as a CLI flag. Passing it via DeploymentRuntimeConfig causes
the provider pod to crash-loop with 'unknown long flag --poll-interval', which
fails the entire 'Deploy E2E cluster' step before any tests can run.

The annotation-based trigger (_trigger_crossplane_reconcile) added in the
previous commit already forces immediate reconciliation after queue deletion,
making the poll interval override unnecessary for the chaos tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

charts feat New feature implementation test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant