Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scale Testing Pipeline for Cilium L7 & Kubernetes Network Policies #554

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

sanamsarath
Copy link

This pull request establishes a new pipeline for scale testing Cilium L7 network policies, and Kubernetes network policies. It also supports configuration to run feature, soak, and load tests. Currently, it handles network policies matching HTTP traffic, with plans to extend support for benchmarking and scale testing other L4 and L7 network policies.

[Copilot generated Summary]
This pull request introduces several new configurations and functionalities for Cilium and network policy scale testing in the clusterloader2 framework. The changes include adding new measurement modules, updating configuration files, and enhancing the main script to support these new features.

Key changes include:

New Measurement Modules:

  • Added cilium-envoy-measurments.yaml to collect various Cilium Envoy HTTP and memory metrics using Prometheus queries.
  • Added cilium-measurements.yaml to gather additional Cilium metrics such as queueing delay, CPU usage, and memory usage using Prometheus queries.

Configuration Updates:

  • Updated netpol-scale-config.yaml to include parameters for enabling Cilium and Cilium Envoy, and to define the steps for starting and gathering measurements.

Script Enhancements:

  • Enhanced netpol_scale.py to include functions for configuring, validating, executing, and collecting results from clusterloader2 tests. The script now supports command-line arguments for various test parameters and configurations.

Pipeline Configuration:

  • Added a new pipeline configuration file netpol-scale-testing.yml to define the CI/CD pipeline for network policy scale testing using clusterloader2.

…date related references in configuration files
…er2 function by removing unused parameters for improved clarity
…updating cloud_info parameter for consistency
@sanamsarath sanamsarath marked this pull request as ready for review March 10, 2025 15:16
@@ -10,6 +10,8 @@ Repository Bloat Risks
*.gz
bin/
debug/
venv/
.gitignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the gitignore in this file

displayName: "Every 8 hours"
branches:
include:
- sarathsa/cilium-l7-scale

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be present?

- $(LOCATION)
engine: clusterloader2
engine_input:
image: "ghcr.io/sanamsarath/clusterloader2:vtest" # TODO: Fix this after perf-tests PR is merged

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order of operations here?

# TODO: Remove aks once CL2 update provider name to be azure


def configure_clusterloader2(
Copy link
Contributor

@rafael-mendes-pereira rafael-mendes-pereira Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add unit tests for all these new methods

Comment on lines +15 to +16
DAEMONSETS_PER_NODE = {"aws": 2, "azure": 6, "aks": 6}
CPU_CAPACITY = {"aws": 0.94, "azure": 0.87, "aks": 0.87}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could import it from

DAEMONSETS_PER_NODE = {
"aws": 2,
"azure": 6,
"aks": 6
}
CPU_CAPACITY = {
"aws": 0.94,
"azure": 0.87,
"aks": 0.87
}

Comment on lines +52 to +59
# test config
# add "s" at the end of test_duration_secs
file.write("# Test config\n")
test_duration = str(test_duration_secs) + "s"
# Test config
# add "s" at the end of test_duration_secs
file.write("# Test config\n")
test_duration = f"{test_duration_secs}s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated

file.close()


def validate_clusterloader2(node_count=2, operation_timeout_in_minutes=10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)


def execute_clusterloader2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content = ""
for f in os.listdir(cl2_report_dir):
file_path = os.path.join(cl2_report_dir, f)
with open(file_path, "r", encoding="utf-8") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a try/except, so it can continue reading the other files if some of them fails

help="Number of workers per client",
)
parser_configure.add_argument(
"--netpol_type", type=str, required=True, help="Type of network policy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add the valid options

Comment on lines +18 to +21
- script: |
run_id=$(Build.BuildId)-$(System.JobId)
echo "Run ID: $run_id"
echo "##vso[task.setvariable variable=RUN_ID]$run_id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required as the run id is already set here

Comment on lines +29 to +31
matrix:
azure_cilium:
cl2_config_file: netpol-scale-config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other parameters? Example:

matrix:
azure_cilium:
cpu_per_node: 4
node_count: 1000
node_per_step: 1000
max_pods: 20
repeats: 10
scale_timeout: "15m"
cilium_enabled: True
network_policy: cilium
network_dataplane: cilium
service_test: True
cl2_config_file: load-config.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants