Skip to content

Conversation

@t0mmylam
Copy link
Collaborator

Adds comprehensive end-to-end tests for DeploymentPolicy functionality, including multi-compartment rollouts, various strategies (Fixed, Linear, Exponential), and backwards compatibility with InterruptionBudget.

Changes

  • 4 new E2E test suites covering core deployment policy features
  • Prometheus metrics validation for all 8 rollout metrics
  • CI integration with new deployment-policy-tests job
  • Makefile targets for easy local testing
  • Podman keyring quota fix for macOS and Linux

Testing

  • All 4 test suites passing locally
  • Metrics validation working correctly
  • CI job configured and ready to run
  • Tested on 15-node Kind cluster

Test Coverage

  • ✅ Multi-compartment rollouts (15 nodes)
  • ✅ Linear ramp-up strategy (8 nodes)
  • ✅ Overlapping label selectors (6 nodes)
  • ✅ Legacy InterruptionBudget compatibility (6 nodes)
  • ✅ Prometheus metrics export and validation
  • ✅ Budget enforcement (count and percent-based)
  • ✅ Batch progression and state tracking

@t0mmylam t0mmylam changed the title add e2e tests for deployment policy feat: add e2e tests for deployment policy Oct 30, 2025
Copy link
Collaborator

@lockwobr lockwobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are not making assertions it seems

try:
- apply:
file: skyhook.yaml
- script:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 54 to 70
- script:
content: |
echo "=== Verifying synthetic __default__ compartment ==="
COMPARTMENTS=$(kubectl get skyhook legacy-interruption-budget-test -o jsonpath='{.status.compartmentStatuses}' | jq -r 'keys[]')
echo "Compartments found: $COMPARTMENTS"
DEFAULT=$(echo "$COMPARTMENTS" | grep -c "__default__" || echo "0")
if [ "$DEFAULT" != "1" ]; then
echo "ERROR: Expected __default__ compartment to be created"
echo "Found compartments: $COMPARTMENTS"
kubectl get skyhook legacy-interruption-budget-test -o yaml
exit 1
fi
echo "✓ Synthetic __default__ compartment created"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done with an assertion I think, its cleaner and is how the other tests work.
https://kyverno.github.io/chainsaw/latest/operations/assert/#examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there already a metrics_test.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is already a script to do this as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind there isn't, but I see this same script a few times in different tests. I would just make a top-level script that can add or remove given labels and then call that in the tests when needed instead of writing a new script for each test.

# Label all 8 nodes as production tier
echo "Labeling production nodes (0-7)..."
for i in {0..7}; do
kubectl label node ${WORKER_ARRAY[$i]} tier=production skyhook.nvidia.com/test-node=skyhooke2e --overwrite
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here with these scripts, I see it a few times, I would make a somewhat generalized script at the top level and then call that in these tests.

@if command -v podman >/dev/null 2>&1; then \
echo "📝 Detected Podman - increasing kernel keyring quota..."; \
if [ "$$(uname)" = "Darwin" ]; then \
podman machine ssh sudo sysctl -w kernel.keys.maxkeys=20000 2>/dev/null || true; \
Copy link
Collaborator

@riceriley59 riceriley59 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed? Were the 15 nodes causing memory issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda, basically Podman has default keyring limits since each container uses keys for security stuff (i think). This just bumps those numbers up so the nodes don't get into a crashloop

Comment on lines 116 to 164
steps:
- uses: actions/checkout@v4
with:
fetch-tags: true
fetch-depth: 0
- name: Setup Go ${{ env.GO_VERSION }}
uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
cache-dependency-path: operator/go.sum
- name: Log in to the Container registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Create 15-node Kind Cluster
id: kind
uses: helm/kind-action@v1
with:
version: v0.30.0
node_image: kindest/node:v1.34.0
config: k8s-tests/chainsaw/deployment-policy/kind-config.yaml
cluster_name: skyhook-dp-test
# Cache build tools and dependencies for faster builds
- name: Restore cached Binaries
id: cached-binaries
uses: actions/cache/restore@v4
with:
key: ${{ env.GO_VERSION }}-${{ runner.os }}-${{ runner.arch }}-bin-${{ hashFiles('operator/deps.mk') }}
restore-keys: ${{ env.GO_VERSION }}-${{ runner.os }}-${{ runner.arch }}-bin-
path: |
${{ github.workspace }}/operator/bin
~/.cache/go-build
- name: Install dependencies
if: steps.cached-binaries.outputs.cache-hit != 'true'
run: |
cd operator
make install-deps
- name: Save cached Binaries
id: save-cached-binaries
if: steps.cached-binaries.outputs.cache-hit != 'true'
uses: actions/cache/save@v4
with:
key: ${{ env.GO_VERSION }}-${{ runner.os }}-${{ runner.arch }}-bin-${{ hashFiles('operator/deps.mk') }}
path: |
${{ github.workspace }}/operator/bin
~/.cache/go-build
# Run deployment policy E2E tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a clone of the other test. Is there away we reduce this boilerplate? maybe yaml anchor? Also this seems to run pretty fast. I wonder if we make the other tests fast by breaking them up and making the parallel? maybe that should be a different PR.

Copy link
Collaborator Author

@t0mmylam t0mmylam Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split unit/e2e/helm into separate matrix jobs, but with 4 K8s versions that'd create 12+ parallel jobs. Maybe we could split just the e2e tests into its own matrix as a follow-up PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants