|
| 1 | +# Worker Scalability Test Plan |
| 2 | + |
| 3 | +<!--toc:start--> |
| 4 | +- [Worker Scalability Test Plan](#worker-scalability-test-plan) |
| 5 | + - [Current State](#current-state) |
| 6 | + - [Metrics to follow](#metrics-to-follow) |
| 7 | + - [Test Plan](#test-plan) |
| 8 | + - [Results](#results) |
| 9 | + - [An example to verify different worker names from time to time](#an-example-to-verify-different-worker-names-from-time-to-time) |
| 10 | +<!--toc:end--> |
| 11 | + |
| 12 | +The workers have a permanent connection to the database and keep sending [heartbeats]() |
| 13 | +as signal that they are alive and ready to accomplish a task. |
| 14 | +The idea of this test plan is to verify the maximum number of idle workers before its |
| 15 | +heartbeat starts to timeout. |
| 16 | + |
| 17 | +## Current State |
| 18 | +We're gonna use the Pulp stage instance to run those tests. Each pulp-worker pod requests |
| 19 | +must request the minimal amount needed, 128MiB of memory, and have access to 250m of CPU. |
| 20 | + |
| 21 | +For database, we're using the AWS RDS PostgreSQL 16, on a db.m7g.2xlarge instance class. |
| 22 | +In theory, the maximum number of connections is defined by `LEAST({DBInstanceClassMemory/9531392}, 5000)` |
| 23 | +as written [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections) |
| 24 | + |
| 25 | +## Metrics to follow |
| 26 | +On database level: |
| 27 | +- Active Connections |
| 28 | +- Connection wait time |
| 29 | +- Slow queries |
| 30 | +- CPU and Memory utilization |
| 31 | + |
| 32 | +Most of those metrics can be checked [here](https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=pulp-prod;is-cluster=false) |
| 33 | + |
| 34 | +For the application, we need to follow the timeouts using the logs. |
| 35 | +You can start [here](https://grafana.app-sre.devshift.net/explore?schemaVersion=1&panes=%7B%22vse%22%3A%7B%22datasource%22%3A%22P1A97A9592CB7F392%22%2C%22queries%22%3A%5B%7B%22id%22%3A%22%22%2C%22region%22%3A%22us-east-1%22%2C%22namespace%22%3A%22%22%2C%22refId%22%3A%22A%22%2C%22queryMode%22%3A%22Logs%22%2C%22expression%22%3A%22fields+%40logStream%2C+%40message%2C++kubernetes.namespace_name+%7C+filter+%40logStream+like+%2Fpulp-stage_pulp-%28worker%7Capi%7Ccontent%29%2F%5Cn%5Cn%5Cn%5Cn%22%2C%22statsGroups%22%3A%5B%5D%2C%22datasource%22%3A%7B%22type%22%3A%22cloudwatch%22%2C%22uid%22%3A%22P1A97A9592CB7F392%22%7D%2C%22logGroups%22%3A%5B%7B%22arn%22%3A%22arn%3Aaws%3Alogs%3Aus-east-1%3A744086762512%3Alog-group%3Acrcs02ue1.pulp-stage%3A*%22%2C%22name%22%3A%22crcs02ue1.pulp-stage%22%2C%22accountId%22%3A%22744086762512%22%7D%5D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-30m%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1) |
| 36 | + |
| 37 | +## Test Plan |
| 38 | + |
| 39 | +1. Warn the pulp stakeholders (@pulp-service-stakeholders) that a test gonna happen in stage. All tasks must be stopped. |
| 40 | +2. Create a MR reducing the resource for each worker, and increasing the number of workers. [here](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/pulp/deploy.yml#L75). |
| 41 | +It can be approved by anyone from the team. Just need to wait for the app-sre bot to merge it. |
| 42 | +3. After it got merged, check for the database metrics and application logs. The results from the run will be down in this document. |
| 43 | +4. We need to check if there's any new worker being started. For that, we need to use the `WorkersAPI` and check a set with the names from time to time. |
| 44 | +An example of a script to check the workers from time to time is written [here](#an-example-to-verify-different-worker-names-from-time-to-time) |
| 45 | +5. When the test is done, you need a new MR reducing the number of workers to 0 (zero) and then a new MR returning |
| 46 | +the number to its usual state (2 pulp-workers in staging) and to its proper resource utilization. |
| 47 | +6. Send a new message to the Pulp Stakeholders (@pulp-service-stakeholders) saying that the test is concluded. |
| 48 | + |
| 49 | +## Results |
| 50 | +To be added in the future. |
| 51 | + |
| 52 | +Date of the test: YYYY/mm/dd |
| 53 | + |
| 54 | +A possible template for the table could be: |
| 55 | + |
| 56 | +| Instance Count | Timeout(Yes/No) | Bottleneck Identified | |
| 57 | +|----------------|-----------------|-----------------------| |
| 58 | +| 100 | {{value}} | `None` | |
| 59 | +| 250 | {{value}} | `...` | |
| 60 | +| 500 | {{value}} | `...` | |
| 61 | +| 1000 | {{value}} | `...` | |
| 62 | + |
| 63 | +**Key:** |
| 64 | +- Use `-` for unavailable metrics. |
| 65 | + |
| 66 | +## An example to verify different worker names from time to time |
| 67 | +```bash |
| 68 | +#!/bin/bash |
| 69 | + |
| 70 | +# Extract the current list |
| 71 | +pulp status | jq '.online_workers.[].name' | sort > current_list.txt |
| 72 | + |
| 73 | +# Compare with the previous list (if it exists) |
| 74 | +if [ -f previous_list.txt ]; then |
| 75 | + echo "Differences since last check:" |
| 76 | + comm -3 previous_list.txt current_list.txt |
| 77 | +fi |
| 78 | + |
| 79 | +# Update the previous list |
| 80 | +mv current_list.txt previous_list.txt |
| 81 | +``` |
0 commit comments