Skip to content

improvement(images): drop supervisor from scylla SCT image #10627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2025

Conversation

dimakr
Copy link
Contributor

@dimakr dimakr commented Apr 10, 2025

The change drops rebuilding scylladb image as scylla-sct image. Instead, scylladb docker image is used as is for the docker backend.
For that purpose the new DockerCmdRunner abstraction was introduced, to allow operating docker based nodes directly using docker API (instead of sshing
into container).

Additionally, artifacts_test was updated to skip subtests/operations for the services, which are no longer supported by scylladb docker image (or are about
to be dropped) - node-exporter and scylla-hosekeeping .

Testing

❯ SCT_CONFIG_FILES='["test-cases/PR-provision-test-docker.yaml"]' SCT_ENABLE_ARGUS=false SCT_SCYLLA_VERSION=supervisor-less-test1 SCT_N_MONITORS_NODES=0 SCT_DOCKER_IMAGE=syuu1228/scylla ./sct.py run-test longevity_test.LongevityTest.test_custom_time --backend docker
logged in as arn:aws:sts::797456418907:assumed-role/DeveloperAccessRole/[email protected]
New directory created: /home/dmitriy/sct-results/20250612-212338-491746
Symlink `/home/dmitriy/sct-results/latest' updated to `/home/dmitriy/sct-results/20250612-212338-491746'
< t:2025-06-12 21:23:38,662 f:base.py         l:376  c:sdcm.sct_events.base p:INFO  > The run can be interrupted by following critical events:
< t:2025-06-12 21:23:38,662 f:base.py         l:376  c:sdcm.sct_events.base p:INFO  >   * ClusterHealthValidatorEvent.NodeStatus
...
< t:2025-06-12 21:27:43,350 f:tester.py       l:3132 c:LongevityTest        p:INFO  > ================================= TEST RESULTS =================================
< t:2025-06-12 21:27:43,350 f:tester.py       l:3132 c:LongevityTest        p:INFO  > ================================================================================
< t:2025-06-12 21:27:43,350 f:tester.py       l:3132 c:LongevityTest        p:INFO  > SUCCESS :)
...
.
----------------------------------------------------------------------
Ran 1 test in 245.231s

OK

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@dimakr
Copy link
Contributor Author

dimakr commented Apr 10, 2025

@fruch This is a draft. Can you take a look and suggest if the direction is OK and we can continue with it? (a run of artifacts-docker-test with this change - https://argus.scylladb.com/tests/scylla-cluster-tests/45752905-3c3e-4ab4-bda9-621ec5f220b6)

For now this change manages the processes directly, but nothing is dropped from the scylla-sct image (supervisor is just ignored).
Couple points to think about/discuss are:

  • as there is no scylladb image without supervisor, I made a local copy of docker-entrypoint.py. In this local copy the supervisor daemon is not started, but instead we start standalone processes that we need (e.g. sshd, node-exporter, etc.). Then in SCT when the scylla-sct image is built, e.g. during artifacts-dockerf-test, we copy the local version of docker-entrypoint.py into image.
    Once the supervisor is not used/started in the scylladb docker image itself, this workaround can be dropped.
  • I added abstractions/wrappers for docker container operations and service management operations. Although the docker-py lib itself provides clear APIs, I wrapped few APIs we are interested in with additional logic.
    Should we keep it, or maybe it is complicating the thing unnecessarily?

@dimakr dimakr force-pushed the drop_supervisor_from_image branch 3 times, most recently from ab44553 to b9f99c1 Compare May 12, 2025 15:45
@dimakr dimakr force-pushed the drop_supervisor_from_image branch from b9f99c1 to da218d0 Compare May 27, 2025 23:12
@dimakr dimakr requested a review from Copilot May 28, 2025 09:54
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Drops the use of Supervisor in the Scylla SCT Docker image and adds direct Docker-based service management.

  • Introduces DockerCmdRunner for executing commands and transferring files via Docker API.
  • Adds DockerServiceManager to start/stop/restart services without Supervisor.
  • Updates cluster_docker.py and tests to remove supervisorctl calls and adapt to the new runner and manager.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
unit_tests/test_docker_cmd_runner.py Adds unit tests for the new DockerCmdRunner functionality.
sdcm/utils/docker_service_manager.py Implements DockerServiceManager for container-based services.
sdcm/remote/docker_cmd_runner.py Defines DockerCmdRunner with exec, tar create/extract, send/receive.
sdcm/cluster_docker.py Replaces supervisor calls with direct container operations and integrates DockerCmdRunner.
artifacts_test.py Skips supervisor-specific checks when using Docker backend.
Comments suppressed due to low confidence (1)

sdcm/cluster_docker.py:273

  • The build_container_image step was removed, so run_container may fail if the custom image isn't pre-built; consider restoring image build or ensuring the image exists before running.
ContainerManager.run_container(

@dimakr dimakr force-pushed the drop_supervisor_from_image branch 6 times, most recently from a13d309 to f6c3682 Compare June 2, 2025 09:52
@dimakr dimakr added the backport/none Backport is not required label Jun 2, 2025
@dimakr dimakr marked this pull request as ready for review June 2, 2025 10:48
@dimakr dimakr force-pushed the drop_supervisor_from_image branch from f6c3682 to 08b2eb7 Compare June 12, 2025 19:23
@dimakr dimakr requested review from fruch and a team June 12, 2025 19:39
@soyacz
Copy link
Contributor

soyacz commented Jun 13, 2025

I've tried hydra investigate show-monitor 23df17b2-7171-4b7c-af7f-919e03bf63ad (provision test) and it failed:

Restoring monitoring stack from archive 23df17b2-7171-4b7c-af7f-919e03bf63ad/20250602_101740/monitor-set-23df17b2.tar.zst
Download file https://cloudius-jenkins-test.s3.amazonaws.com/23df17b2-7171-4b7c-af7f-919e03bf63ad/20250602_101740/monitor-set-23df17b2.tar.zst to directory /tmp/tmpnt8h6dv9
Downloading 23df17b2-7171-4b7c-af7f-919e03bf63ad/20250602_101740/monitor-set-23df17b2.tar.zst from cloudius-jenkins-test
Downloaded finished
Creating monitoring stack directories for monitor-node-23df17b2-7171-4b7c-af7f-919e03bf63ad-0 cluster
Docker containers for monitoring stack are started
Nemesis dashboard file for the cluster monitor-node-23df17b2-7171-4b7c-af7f-919e03bf63ad-0 is '/tmp/tmpnt8h6dv9/scylla-monitoring-src/sct_monitoring_addons/sct-monitor-J1N4A-scylla-dash-per-server-nemesis.master.json'
Dashboard /tmp/tmpnt8h6dv9/scylla-monitoring-src/sct_monitoring_addons/sct-monitor-J1N4A-scylla-dash-per-server-nemesis.master.json loaded successfully
Error during annotations data upload Expecting value: line 1 column 1 (char 0)
Uploading annotations data [try #1]
Error during annotations data upload Expecting value: line 1 column 1 (char 0)
Uploading annotations data [try #2]
Error during annotations data upload Expecting value: line 1 column 1 (char 0)
'restore_annotations_data': Number of retries exceeded!
Error during uploading sct monitoring data Expecting value: line 1 column 1 (char 0)
Errors were found when restoring Scylla monitoring stack
Killing Grafana service
Killing Prometheus service
Killing Alert service

did you try with some other test?
I wonder how it looks when there's no node_exporter anymore.

Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about logging (e.g. syslog-ng)? also we no longer ship iproute or rsync pkg which might be needed by some nemesis.
Did you try running with podman?

@dimakr
Copy link
Contributor Author

dimakr commented Jun 13, 2025

I've tried hydra investigate show-monitor 23df17b2-7171-4b7c-af7f-919e03bf63ad (provision test) and it failed:

did you try with some other test? I wonder how it looks when there's no node_exporter anymore.

yes, @fruch raised this concern that without node-exporter in the image, we won't have metrics, etc.
In the original core task of dropping supervisor, in scylladb/scylladb#22883 (comment), it was mentioned that a dedicated container should be used for node-exporter in the context of operator. Probably we can follow their approach, when they implement this (or reuse the image if possible)

how about logging (e.g. syslog-ng)? also we no longer ship iproute or rsync pkg which might be needed by some nemesis. Did you try running with podman?

good question
This is what a basic test (pr-provision-docker) has for db node logs when executed locally:

❯ ll latest/PR-provision-docker-dmitriy-db-cluster-2b51a250
total 12K
drwxrwxr-x 3 dmitriy 4,0K cze 13 14:27 .
drwxrwxr-x 7 dmitriy 4,0K cze 13 14:33 ..
drwxrwxr-x 2 dmitriy 4,0K cze 13 14:27 PR-provision-docker-dmitriy-db-node-2b51a250-0
❯ ll latest/PR-provision-docker-dmitriy-db-cluster-2b51a250/PR-provision-docker-dmitriy-db-node-2b51a250-0

And this is for comparison what we have for pr-provision test:

❯ ll ./20250403-104249-259616/PR-provision-test-dmitriy-db-cluster-9820234b/PR-provision-test-dmitriy-db-node-9820234b-1
total 9,4M
drwxrwxr-x 2 dmitriy 4,0K kwi  3 10:56 .
drwxrwxr-x 7 dmitriy 4,0K kwi  3 11:04 ..
-rw-rw-r-- 1 dmitriy  900 kwi  3 10:56 autossh.log
-rw-rw-r-- 1 dmitriy 9,4M kwi  3 10:47 kallsyms_20250403_084342
lrwxrwxrwx 1 dmitriy   69 kwi  3 10:50 system.log -> ../../hosts/PR-provision-test-dmitriy-db-node-9820234b-1/messages.log

So without the dedicated scylla-sct image and its tooling we lack some logs.
We need to think what and how can be used instead.

@fruch
Copy link
Contributor

fruch commented Jun 15, 2025

For logging in docker backend we never used syslong, so nothing new here

As for metrics, that an issue for scylla core, of how gonna be building the docker image, we'll need a confirmation from that exporter is gonna be removed, and we'll see how to approach it.

Even it mean that in the meanwhile docker backend won't gonna have (some) monitoring data

@fruch
Copy link
Contributor

fruch commented Jun 16, 2025

I would recommend to rebase, and give CI one more run

@dimakr dimakr force-pushed the drop_supervisor_from_image branch 3 times, most recently from 596d6ed to 4474a84 Compare June 19, 2025 19:05
The change drops rebuilding scylladb image as `scylla-sct` image.
Instead, scylladb docker image is used as is for the docker backend.
For that purpose the new DockerCmdRunner abstraction was introduced, to allow
operating docker based nodes directly using docker API (instead of sshing
into container)
Additionally, artifacts_test was updated to skip subtests/operations for
the services, which are no longer supported by scylladb docker image (or are about
to be dropped) - node-exporter and scylla-hosekeeping .
@dimakr dimakr force-pushed the drop_supervisor_from_image branch from 4474a84 to 1855f49 Compare June 19, 2025 19:50
@dimakr
Copy link
Contributor Author

dimakr commented Jun 19, 2025

@fruch
After recent fixes and few adjustments, have this change passing against scylla-nightly:latest image on both, pr-rpovision-docker and artifacts-docker tests, as we discussed:
🟢 pr-provision-test
🟢 artifactsdocker
('adjustment' - had to return to installing sudo on db node, as during scylla config setup flow in

with self.remote_scylla_yaml() as scylla_yml:
, it operates on remote config under sudo)

@fruch
Copy link
Contributor

fruch commented Jun 19, 2025

@fruch After recent fixes and few adjustments, have this change passing against scylla-nightly:latest image on both, pr-rpovision-docker and artifacts-docker tests, as we discussed: 🟢 pr-provision-test 🟢 artifactsdocker ('adjustment' - had to return to installing sudo on db node, as during scylla config setup flow in

with self.remote_scylla_yaml() as scylla_yml:

, it operates on remote config under sudo)

raise issue for that, we should be able to change scylla_yaml to be able to work without sudo for docker backend

@dimakr
Copy link
Contributor Author

dimakr commented Jun 20, 2025

raise issue for that, we should be able to change scylla_yaml to be able to work without sudo for docker backend

Created #11230 task for this.

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove docker cli installation from any docker instance

@fruch
Copy link
Contributor

fruch commented Jun 23, 2025

@dimakr
please raise an issue for the docker backend loader

@dimakr
Copy link
Contributor Author

dimakr commented Jun 23, 2025

@dimakr please raise an issue for the docker backend loader

Created a follow-up task #11251.
@fruch

@dimakr dimakr requested a review from fruch June 24, 2025 12:41
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit 4541ac1 into scylladb:master Jun 24, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants