Skip to content

Commit 090b7bd

Browse files
committed
Add a test plan for the pulp-worker scalability test.
1 parent f1d6655 commit 090b7bd

File tree

2 files changed

+190
-0
lines changed

2 files changed

+190
-0
lines changed
+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Worker Scalability Test Plan
2+
3+
<!--toc:start-->
4+
- [Worker Scalability Test Plan](#worker-scalability-test-plan)
5+
- [Current State](#current-state)
6+
- [Metrics to follow](#metrics-to-follow)
7+
- [Test Plan](#test-plan)
8+
- [Results](#results)
9+
<!--toc:end-->
10+
11+
The workers have a permanent connection to the database and keep sending [heartbeats]()
12+
as signal that they are alive and ready to accomplish a task.
13+
The idea of this test plan is to verify the maximum number of idle workers before its
14+
heartbeat starts to timeout.
15+
16+
## Current State
17+
We're gonna use the Pulp stage instance to run those tests. Each pulp-worker pod requests
18+
must request the minimal amount needed, 128MiB of memory, and have access to 250m of CPU.
19+
20+
For database, we're using the AWS RDS PostgreSQL 16, on a db.m7g.2xlarge instance class.
21+
In theory, the maximum number of connections is defined by `LEAST({DBInstanceClassMemory/9531392}, 5000)`
22+
as written [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections)
23+
24+
## Metrics to follow
25+
On database level:
26+
- Active Connections
27+
- Connection wait time
28+
- Slow queries
29+
- CPU and Memory utilization
30+
31+
Most of those metrics can be checked [here](https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=pulp-prod;is-cluster=false)
32+
33+
For the application, we need to follow the timeouts using the logs.
34+
You can start [here](https://grafana.app-sre.devshift.net/explore?schemaVersion=1&panes=%7B%22vse%22%3A%7B%22datasource%22%3A%22P1A97A9592CB7F392%22%2C%22queries%22%3A%5B%7B%22id%22%3A%22%22%2C%22region%22%3A%22us-east-1%22%2C%22namespace%22%3A%22%22%2C%22refId%22%3A%22A%22%2C%22queryMode%22%3A%22Logs%22%2C%22expression%22%3A%22fields+%40logStream%2C+%40message%2C++kubernetes.namespace_name+%7C+filter+%40logStream+like+%2Fpulp-stage_pulp-%28worker%7Capi%7Ccontent%29%2F%5Cn%5Cn%5Cn%5Cn%22%2C%22statsGroups%22%3A%5B%5D%2C%22datasource%22%3A%7B%22type%22%3A%22cloudwatch%22%2C%22uid%22%3A%22P1A97A9592CB7F392%22%7D%2C%22logGroups%22%3A%5B%7B%22arn%22%3A%22arn%3Aaws%3Alogs%3Aus-east-1%3A744086762512%3Alog-group%3Acrcs02ue1.pulp-stage%3A*%22%2C%22name%22%3A%22crcs02ue1.pulp-stage%22%2C%22accountId%22%3A%22744086762512%22%7D%5D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1)
35+
36+
## Test Plan
37+
38+
1. Warn the pulp stakeholders (@pulp-service-stakeholders) that a test gonna happen in stage. All tasks must be stopped.
39+
2. Create a MR reducing the resource for each worker, and increasing the number of workers. [here](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/pulp/deploy.yml#L75).
40+
It can be approved by anyone from the team. Just need to wait for the app-sre bot to merge it.
41+
3. After it got merged, check for the database metrics and application logs. The results from the run will be down in this document.
42+
4. We need to check if there's any new worker being started. For that, we need to use the `WorkersAPI` and check a set with the names from time to time.
43+
An example of a script to check the workers from time to time is written [here](#Script)
44+
5. When the test is done, you need a new MR reducing the number of workers to 0 (zero) and then a new MR returning
45+
the number to its usual state (2 pulp-workers in staging) and to its proper resource utilization.
46+
6. Send a new message to the Pulp Stakeholders (@pulp-service-stakeholders) saying that the test is concluded.
47+
48+
## Results
49+
To be added in the future.
50+
51+
Date of the test: YYYY/mm/dd
52+
53+
A possible template for the table could be:
54+
55+
| Instance Count | Timeout(Yes/No) | Bottleneck Identified |
56+
|----------------|-----------------|-----------------------|
57+
| 100 | {{value}} | `None` |
58+
| 250 | {{value}} | `...` |
59+
| 500 | {{value}} | `...` |
60+
| 1000 | {{value}} | `...` |
61+
62+
**Key:**
63+
- Use `-` for unavailable metrics.
64+
65+
## An example to verify different worker names from time to time
66+
```bash
67+
#!/bin/bash
68+
69+
# Extract the current list
70+
pulp status | jq '.online_workers.[].name' | sort > current_list.txt
71+
72+
# Compare with the previous list (if it exists)
73+
if [ -f previous_list.txt ]; then
74+
echo "Differences since last check:"
75+
comm -3 previous_list.txt current_list.txt
76+
fi
77+
78+
# Update the previous list
79+
mv current_list.txt previous_list.txt
80+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
From 15d6f725a1304d0850c1569e75ad5c9f41024a96 Mon Sep 17 00:00:00 2001
2+
From: =?UTF-8?q?Andr=C3=A9=20=22decko=22=20de=20Brito?= <[email protected]>
3+
Date: Tue, 15 Apr 2025 11:16:12 -0300
4+
Subject: [PATCH] Add the task name to the task log.
5+
6+
Close #6482
7+
---
8+
CHANGES/6482.feature | 1 +
9+
pulpcore/tasking/tasks.py | 19 +++++++++++++------
10+
pulpcore/tasking/worker.py | 4 ++--
11+
3 files changed, 16 insertions(+), 8 deletions(-)
12+
create mode 100644 CHANGES/6482.feature
13+
14+
diff --git a/CHANGES/6482.feature b/CHANGES/6482.feature
15+
new file mode 100644
16+
index 000000000..a700a8761
17+
--- /dev/null
18+
+++ b/CHANGES/6482.feature
19+
@@ -0,0 +1 @@
20+
+Added the `task_name` and if it's a `task.immediate` to the `Starting task ...` log entry.
21+
diff --git a/pulpcore/tasking/tasks.py b/pulpcore/tasking/tasks.py
22+
index 5018f96ca..16616214b 100644
23+
--- a/pulpcore/tasking/tasks.py
24+
+++ b/pulpcore/tasking/tasks.py
25+
@@ -60,7 +60,14 @@ def _execute_task(task):
26+
task.set_running()
27+
domain = get_domain()
28+
try:
29+
- _logger.info(_("Starting task %s in domain: %s"), task.pk, domain.name)
30+
+ _logger.info(
31+
+ "Starting task id: %s in domain: %s, task_type: %s, immediate: %s, deferred: %s",
32+
+ task.pk,
33+
+ domain.name,
34+
+ task.name,
35+
+ str(task.immediate),
36+
+ str(task.deferred),
37+
+ )
38+
39+
# Execute task
40+
module_name, function_name = task.name.rsplit(".", 1)
41+
@@ -70,7 +77,7 @@ def _execute_task(task):
42+
kwargs = task.enc_kwargs or {}
43+
result = func(*args, **kwargs)
44+
if asyncio.iscoroutine(result):
45+
- _logger.debug(_("Task is coroutine %s"), task.pk)
46+
+ _logger.debug("Task is coroutine %s", task.pk)
47+
loop = asyncio.get_event_loop()
48+
loop.run_until_complete(result)
49+
50+
@@ -78,7 +85,7 @@ def _execute_task(task):
51+
exc_type, exc, tb = sys.exc_info()
52+
task.set_failed(exc, tb)
53+
_logger.info(
54+
- _("Task[{task_type}] {task_pk} failed ({exc_type}: {exc}) in domain: {domain}").format(
55+
+ "Task[{task_type}] {task_pk} failed ({exc_type}: {exc}) in domain: {domain}".format(
56+
task_type=task.name,
57+
task_pk=task.pk,
58+
exc_type=exc_type.__name__,
59+
@@ -90,7 +97,7 @@ def _execute_task(task):
60+
send_task_notification(task)
61+
else:
62+
task.set_completed()
63+
- _logger.info(_("Task completed %s in domain: %s"), task.pk, domain.name)
64+
+ _logger.info("Task completed %s in domain: %s", task.pk, domain.name)
65+
send_task_notification(task)
66+
67+
68+
@@ -170,7 +177,7 @@ def dispatch(
69+
with contextlib.ExitStack() as stack:
70+
with transaction.atomic():
71+
# Task creation need to be serialized so that pulp_created will provide a stable order
72+
- # at every time. We specifically need to ensure that each task, when commited to the
73+
+ # at every time. We specifically need to ensure that each task, when committed to the
74+
# task table will be the newest with respect to `pulp_created`.
75+
with connection.cursor() as cursor:
76+
# Wait for exclusive access and release automatically after transaction.
77+
@@ -275,7 +282,7 @@ def cancel_task(task_id):
78+
)
79+
return task
80+
_logger.info(
81+
- _("Canceling task: {id} in domain: {name}").format(id=task_id, name=task.pulp_domain.name)
82+
+ "Canceling task: {id} in domain: {name}".format(id=task_id, name=task.pulp_domain.name)
83+
)
84+
85+
# This is the only valid transition without holding the task lock
86+
diff --git a/pulpcore/tasking/worker.py b/pulpcore/tasking/worker.py
87+
index 185b6802a..6a938f9c8 100644
88+
--- a/pulpcore/tasking/worker.py
89+
+++ b/pulpcore/tasking/worker.py
90+
@@ -69,7 +69,7 @@ class PulpcoreWorker:
91+
self.versions = {app.label: app.version for app in pulp_plugin_configs()}
92+
self.cursor = connection.cursor()
93+
self.worker = self.handle_worker_heartbeat()
94+
- # This defaults to immediate task cancelation.
95+
+ # This defaults to immediate task cancellation.
96+
# It will be set into the future on moderately graceful worker shutdown,
97+
# and set to None for fully graceful shutdown.
98+
self.task_grace_timeout = timezone.now()
99+
@@ -401,7 +401,7 @@ class PulpcoreWorker:
100+
seconds=TASK_KILL_INTERVAL
101+
)
102+
_logger.info(
103+
- "Aborting current task %s in domain: %s due to cancelation.",
104+
+ "Aborting current task %s in domain: %s due to cancellation.",
105+
task.pk,
106+
domain.name,
107+
)
108+
--
109+
2.49.0
110+

0 commit comments

Comments
 (0)