[docs] Add zuul retention docs

pablintino · pablintino · commit 8eab954e13cd · 2025-02-27T09:55:33.000-05:00
diff --git a/docs/dictionary/en-custom.txt b/docs/dictionary/en-custom.txt
@@ -19,6 +19,8 @@ arx
 arxcruz
 auth
 authfile
+autohold
+autoholds
 autoscale
 autostart
 awk
@@ -349,6 +351,7 @@ nncp
 nobuild
 nodeexporter
 nodenetworkconfigurationpolicy
+nodepool
 nodeps
 nodeset
 nodesets
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -73,10 +73,10 @@ In case of emergency, or if we didn't come back to you in a reasonable time (exp
 
 .. toctree::
    :maxdepth: 1
-   :caption: Cookbooks
+   :caption: Zuul
    :glob:
 
-   cookbooks/*
+   zuul/*
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/zuul/zuul-job-nodeset.md b/docs/source/zuul/zuul-job-nodeset.md
diff --git a/docs/source/zuul/zuul-multinode-autoholds.md b/docs/source/zuul/zuul-multinode-autoholds.md
@@ -0,0 +1,27 @@
+# Autoholds resources retention mechanism
+
+
+## Why is it needed?
+
+Zuul uses nodepool to manage the life-cycle of the instances required to run a job, also known as the nodeset. Zuul only manages the default network for initial access, all other networks required for our complex testing are managed via the CI-Framework, external to Zuul.
+
+The ci-framework uses a special set of [playbooks](https://review.rdoproject.org/r/plugins/gitiles/config.git/+/refs/heads/master/playbooks/crc/) to create network resources around the nodepool deployed instances. Those resources are cleaned up on each run, no matter how the run finishes, leading to an environment that may be useless from a debugging perspective.
+
+The autohold retention mechanism checks with Zuul to see if the run has an autohold request and, in that case, skips the cleanup process so the network resources remains.
+
+For the skipped network resources, a script managed by the infrastructure team cleans the resources periodically.
+
+## Where is the code?
+
+The code that handles skipping the cleanup in the framework [repo](https://github.com/openstack-k8s-operators/ci-framework/blob/main/ci/playbooks/multinode-autohold.yml).
+
+
+## How does it work?
+
+Basically the [code](https://github.com/openstack-k8s-operators/ci-framework/blob/main/ci/playbooks/multinode-autohold.yml) checks against the Zuul API to see if there is an autohold request created for the run based on the information stored in the `zuul` Ansible variable.
+
+The code uses the `krb_request` role that uses kerberos underneath if needed and if a kerberos token is present. If the Zuul API is not secured the method will not use any kind of authentication.
+
+Some Zuul instances are not configured to use the executor as an API so, for those cases, the `zuul_autohold_endpoint` needs to be set to point to the autohold URL of the Zuul instance. If the variable is not present the URL is auto-generated assuming the API is reachable through the executor.
+
+This check is only done if the job failed, if the job passed the autohold will not retain the instances so we follow the same approach with the network resources of cleaning them up before finishing the job.