Add a test plan for the pulp-worker scalability test.

decko · decko · commit 090b7bda8b60 · 2025-04-23T12:48:36.000-03:00
diff --git a/docs/test_plans/0001-Worker_Scalability.md b/docs/test_plans/0001-Worker_Scalability.md
@@ -0,0 +1,80 @@
+# Worker Scalability Test Plan
+
+<!--toc:start-->
+- [Worker Scalability Test Plan](#worker-scalability-test-plan)
+  - [Current State](#current-state)
+  - [Metrics to follow](#metrics-to-follow)
+  - [Test Plan](#test-plan)
+  - [Results](#results)
+<!--toc:end-->
+
+The workers have a permanent connection to the database and keep sending [heartbeats]()
+as signal that they are alive and ready to accomplish a task.
+The idea of this test plan is to verify the maximum number of idle workers before its 
+heartbeat starts to timeout.
+
+## Current State
+We're gonna use the Pulp stage instance to run those tests. Each pulp-worker pod requests 
+must request the minimal amount needed, 128MiB of memory, and have access to 250m of CPU.
+
+For database, we're using the AWS RDS PostgreSQL 16, on a db.m7g.2xlarge instance class.
+In theory, the maximum number of connections is defined by `LEAST({DBInstanceClassMemory/9531392}, 5000)`
+as written [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections)
+
+## Metrics to follow
+On database level:
+- Active Connections
+- Connection wait time
+- Slow queries
+- CPU and Memory utilization
+
+Most of those metrics can be checked [here](https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=pulp-prod;is-cluster=false)
+
+For the application, we need to follow the timeouts using the logs.
+You can start [here](https://grafana.app-sre.devshift.net/explore?schemaVersion=1&panes=%7B%22vse%22%3A%7B%22datasource%22%3A%22P1A97A9592CB7F392%22%2C%22queries%22%3A%5B%7B%22id%22%3A%22%22%2C%22region%22%3A%22us-east-1%22%2C%22namespace%22%3A%22%22%2C%22refId%22%3A%22A%22%2C%22queryMode%22%3A%22Logs%22%2C%22expression%22%3A%22fields+%40logStream%2C+%40message%2C++kubernetes.namespace_name+%7C+filter+%40logStream+like+%2Fpulp-stage_pulp-%28worker%7Capi%7Ccontent%29%2F%5Cn%5Cn%5Cn%5Cn%22%2C%22statsGroups%22%3A%5B%5D%2C%22datasource%22%3A%7B%22type%22%3A%22cloudwatch%22%2C%22uid%22%3A%22P1A97A9592CB7F392%22%7D%2C%22logGroups%22%3A%5B%7B%22arn%22%3A%22arn%3Aaws%3Alogs%3Aus-east-1%3A744086762512%3Alog-group%3Acrcs02ue1.pulp-stage%3A*%22%2C%22name%22%3A%22crcs02ue1.pulp-stage%22%2C%22accountId%22%3A%22744086762512%22%7D%5D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1)
+
+## Test Plan
+
+1. Warn the pulp stakeholders (@pulp-service-stakeholders) that a test gonna happen in stage. All tasks must be stopped.
+2. Create a MR reducing the resource for each worker, and increasing the number of workers. [here](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/pulp/deploy.yml#L75).
+It can be approved by anyone from the team. Just need to wait for the app-sre bot to merge it.
+3. After it got merged, check for the database metrics and application logs. The results from the run will be down in this document.
+4. We need to check if there's any new worker being started. For that, we need to use the `WorkersAPI` and check a set with the names from time to time.
+An example of a script to check the workers from time to time is written [here](#Script)
+5. When the test is done, you need a new MR reducing the number of workers to 0 (zero) and then a new MR returning 
+the number to its usual state (2 pulp-workers in staging) and to its proper resource utilization.
+6. Send a new message to the Pulp Stakeholders (@pulp-service-stakeholders) saying that the test is concluded.
+
+## Results
+To be added in the future.
+
+Date of the test: YYYY/mm/dd
+
+A possible template for the table could be:
+
+| Instance Count | Timeout(Yes/No) | Bottleneck Identified |
+|----------------|-----------------|-----------------------|
+| 100            | {{value}}       | `None`                |
+| 250            | {{value}}       | `...`                 |
+| 500            | {{value}}       | `...`                 |
+| 1000           | {{value}}       | `...`                 |
+
+**Key:**
+- Use `-` for unavailable metrics.
+
+## An example to verify different worker names from time to time
+```bash
+#!/bin/bash
+
+# Extract the current list
+pulp status | jq '.online_workers.[].name' | sort > current_list.txt
+
+# Compare with the previous list (if it exists)
+if [ -f previous_list.txt ]; then
+    echo "Differences since last check:"
+    comm -3 previous_list.txt current_list.txt
+fi
+
+# Update the previous list
+mv current_list.txt previous_list.txt
+```
diff --git a/images/assets/patches/0023-Add-the-task-name-to-the-task-log.patch b/images/assets/patches/0023-Add-the-task-name-to-the-task-log.patch
@@ -0,0 +1,110 @@
+From 15d6f725a1304d0850c1569e75ad5c9f41024a96 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Andr=C3=A9=20=22decko=22=20de=20Brito?= <decko@redhat.com>
+Date: Tue, 15 Apr 2025 11:16:12 -0300
+Subject: [PATCH] Add the task name to the task log.
+
+Close #6482
+---
+ CHANGES/6482.feature       |  1 +
+ pulpcore/tasking/tasks.py  | 19 +++++++++++++------
+ pulpcore/tasking/worker.py |  4 ++--
+ 3 files changed, 16 insertions(+), 8 deletions(-)
+ create mode 100644 CHANGES/6482.feature
+
+diff --git a/CHANGES/6482.feature b/CHANGES/6482.feature
+new file mode 100644
+index 000000000..a700a8761
+--- /dev/null
++++ b/CHANGES/6482.feature
+@@ -0,0 +1 @@
++Added the `task_name` and if it's a `task.immediate` to the `Starting task ...` log entry.
+diff --git a/pulpcore/tasking/tasks.py b/pulpcore/tasking/tasks.py
+index 5018f96ca..16616214b 100644
+--- a/pulpcore/tasking/tasks.py
++++ b/pulpcore/tasking/tasks.py
+@@ -60,7 +60,14 @@ def _execute_task(task):
+     task.set_running()
+     domain = get_domain()
+     try:
+-        _logger.info(_("Starting task %s in domain: %s"), task.pk, domain.name)
++        _logger.info(
++            "Starting task id: %s in domain: %s, task_type: %s, immediate: %s, deferred: %s",
++            task.pk,
++            domain.name,
++            task.name,
++            str(task.immediate),
++            str(task.deferred),
++        )
+ 
+         # Execute task
+         module_name, function_name = task.name.rsplit(".", 1)
+@@ -70,7 +77,7 @@ def _execute_task(task):
+         kwargs = task.enc_kwargs or {}
+         result = func(*args, **kwargs)
+         if asyncio.iscoroutine(result):
+-            _logger.debug(_("Task is coroutine %s"), task.pk)
++            _logger.debug("Task is coroutine %s", task.pk)
+             loop = asyncio.get_event_loop()
+             loop.run_until_complete(result)
+ 
+@@ -78,7 +85,7 @@ def _execute_task(task):
+         exc_type, exc, tb = sys.exc_info()
+         task.set_failed(exc, tb)
+         _logger.info(
+-            _("Task[{task_type}] {task_pk} failed ({exc_type}: {exc}) in domain: {domain}").format(
++            "Task[{task_type}] {task_pk} failed ({exc_type}: {exc}) in domain: {domain}".format(
+                 task_type=task.name,
+                 task_pk=task.pk,
+                 exc_type=exc_type.__name__,
+@@ -90,7 +97,7 @@ def _execute_task(task):
+         send_task_notification(task)
+     else:
+         task.set_completed()
+-        _logger.info(_("Task completed %s in domain: %s"), task.pk, domain.name)
++        _logger.info("Task completed %s in domain: %s", task.pk, domain.name)
+         send_task_notification(task)
+ 
+ 
+@@ -170,7 +177,7 @@ def dispatch(
+     with contextlib.ExitStack() as stack:
+         with transaction.atomic():
+             # Task creation need to be serialized so that pulp_created will provide a stable order
+-            # at every time. We specifically need to ensure that each task, when commited to the
++            # at every time. We specifically need to ensure that each task, when committed to the
+             # task table will be the newest with respect to `pulp_created`.
+             with connection.cursor() as cursor:
+                 # Wait for exclusive access and release automatically after transaction.
+@@ -275,7 +282,7 @@ def cancel_task(task_id):
+         )
+         return task
+     _logger.info(
+-        _("Canceling task: {id} in domain: {name}").format(id=task_id, name=task.pulp_domain.name)
++        "Canceling task: {id} in domain: {name}".format(id=task_id, name=task.pulp_domain.name)
+     )
+ 
+     # This is the only valid transition without holding the task lock
+diff --git a/pulpcore/tasking/worker.py b/pulpcore/tasking/worker.py
+index 185b6802a..6a938f9c8 100644
+--- a/pulpcore/tasking/worker.py
++++ b/pulpcore/tasking/worker.py
+@@ -69,7 +69,7 @@ class PulpcoreWorker:
+         self.versions = {app.label: app.version for app in pulp_plugin_configs()}
+         self.cursor = connection.cursor()
+         self.worker = self.handle_worker_heartbeat()
+-        # This defaults to immediate task cancelation.
++        # This defaults to immediate task cancellation.
+         # It will be set into the future on moderately graceful worker shutdown,
+         # and set to None for fully graceful shutdown.
+         self.task_grace_timeout = timezone.now()
+@@ -401,7 +401,7 @@ class PulpcoreWorker:
+                             seconds=TASK_KILL_INTERVAL
+                         )
+                         _logger.info(
+-                            "Aborting current task %s in domain: %s due to cancelation.",
++                            "Aborting current task %s in domain: %s due to cancellation.",
+                             task.pk,
+                             domain.name,
+                         )
+-- 
+2.49.0
+