Skip to content

Commit 50de5fe

Browse files
committed
Add a test plan for the pulp-worker scalability test.
1 parent f1d6655 commit 50de5fe

File tree

1 file changed

+81
-0
lines changed

1 file changed

+81
-0
lines changed
+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Worker Scalability Test Plan
2+
3+
<!--toc:start-->
4+
- [Worker Scalability Test Plan](#worker-scalability-test-plan)
5+
- [Current State](#current-state)
6+
- [Metrics to follow](#metrics-to-follow)
7+
- [Test Plan](#test-plan)
8+
- [Results](#results)
9+
- [An example to verify different worker names from time to time](#an-example-to-verify-different-worker-names-from-time-to-time)
10+
<!--toc:end-->
11+
12+
The workers have a permanent connection to the database and keep sending [heartbeats]()
13+
as signal that they are alive and ready to accomplish a task.
14+
The idea of this test plan is to verify the maximum number of idle workers before its
15+
heartbeat starts to timeout.
16+
17+
## Current State
18+
We're gonna use the Pulp stage instance to run those tests. Each pulp-worker pod requests
19+
must request the minimal amount needed, 128MiB of memory, and have access to 250m of CPU.
20+
21+
For database, we're using the AWS RDS PostgreSQL 16, on a db.m7g.2xlarge instance class.
22+
In theory, the maximum number of connections is defined by `LEAST({DBInstanceClassMemory/9531392}, 5000)`
23+
as written [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections)
24+
25+
## Metrics to follow
26+
On database level:
27+
- Active Connections
28+
- Connection wait time
29+
- Slow queries
30+
- CPU and Memory utilization
31+
32+
Most of those metrics can be checked [here](https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=pulp-prod;is-cluster=false)
33+
34+
For the application, we need to follow the timeouts using the logs.
35+
You can start [here](https://grafana.app-sre.devshift.net/explore?schemaVersion=1&panes=%7B%22vse%22%3A%7B%22datasource%22%3A%22P1A97A9592CB7F392%22%2C%22queries%22%3A%5B%7B%22id%22%3A%22%22%2C%22region%22%3A%22us-east-1%22%2C%22namespace%22%3A%22%22%2C%22refId%22%3A%22A%22%2C%22queryMode%22%3A%22Logs%22%2C%22expression%22%3A%22fields+%40logStream%2C+%40message%2C++kubernetes.namespace_name+%7C+filter+%40logStream+like+%2Fpulp-stage_pulp-%28worker%7Capi%7Ccontent%29%2F%5Cn%5Cn%5Cn%5Cn%22%2C%22statsGroups%22%3A%5B%5D%2C%22datasource%22%3A%7B%22type%22%3A%22cloudwatch%22%2C%22uid%22%3A%22P1A97A9592CB7F392%22%7D%2C%22logGroups%22%3A%5B%7B%22arn%22%3A%22arn%3Aaws%3Alogs%3Aus-east-1%3A744086762512%3Alog-group%3Acrcs02ue1.pulp-stage%3A*%22%2C%22name%22%3A%22crcs02ue1.pulp-stage%22%2C%22accountId%22%3A%22744086762512%22%7D%5D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1)
36+
37+
## Test Plan
38+
39+
1. Warn the pulp stakeholders (@pulp-service-stakeholders) that a test gonna happen in stage. All tasks must be stopped.
40+
2. Create a MR reducing the resource for each worker, and increasing the number of workers. [here](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/pulp/deploy.yml#L75).
41+
It can be approved by anyone from the team. Just need to wait for the app-sre bot to merge it.
42+
3. After it got merged, check for the database metrics and application logs. The results from the run will be down in this document.
43+
4. We need to check if there's any new worker being started. For that, we need to use the `WorkersAPI` and check a set with the names from time to time.
44+
An example of a script to check the workers from time to time is written [here](#an-example-to-verify-different-worker-names-from-time-to-time)
45+
5. When the test is done, you need a new MR reducing the number of workers to 0 (zero) and then a new MR returning
46+
the number to its usual state (2 pulp-workers in staging) and to its proper resource utilization.
47+
6. Send a new message to the Pulp Stakeholders (@pulp-service-stakeholders) saying that the test is concluded.
48+
49+
## Results
50+
To be added in the future.
51+
52+
Date of the test: YYYY/mm/dd
53+
54+
A possible template for the table could be:
55+
56+
| Instance Count | Timeout(Yes/No) | Bottleneck Identified |
57+
|----------------|-----------------|-----------------------|
58+
| 100 | {{value}} | `None` |
59+
| 250 | {{value}} | `...` |
60+
| 500 | {{value}} | `...` |
61+
| 1000 | {{value}} | `...` |
62+
63+
**Key:**
64+
- Use `-` for unavailable metrics.
65+
66+
## An example to verify different worker names from time to time
67+
```bash
68+
#!/bin/bash
69+
70+
# Extract the current list
71+
pulp status | jq '.online_workers.[].name' | sort > current_list.txt
72+
73+
# Compare with the previous list (if it exists)
74+
if [ -f previous_list.txt ]; then
75+
echo "Differences since last check:"
76+
comm -3 previous_list.txt current_list.txt
77+
fi
78+
79+
# Update the previous list
80+
mv current_list.txt previous_list.txt
81+
```

0 commit comments

Comments
 (0)