Skip to content

[cert.ci.jenkins.io] Run ephemeral VM agents in the sponsored subscription #5004

@dduportal

Description

@dduportal

Service(s)

cert.ci.jenkins.io

Summary

As per #5003, we have credits to spend in the Jenkins sponsored subscription.

Let's move the cert.ci.jenkins.io ephemeral agents in this new subscription, which includes the data transiting from/to the NAT gateway. It means the following costs will be moved to the new subscription (excerpt of the past 6 months):

Image

Prerequisites:

Task list:

  • Azure Net: a new vnet + subnet is required for ephemeral VM agents. We should use the same pattern as last year, except that the controller might need to move in this vnet as well: we should increase its size compared to 2025 right at creation so we won't have vnet overlap or increase in the upcoming weeks
  • Azure: the following resources are expected (same as old setup with 2025 subscription):
    • data sources to the new vnet,subnet and their resource groups (RG)
    • A new RG for the "non agent" resources. Usually we name it xxx_ci_jenkins_io_controller_jenkins_sponsored with xxx the specific name (cert here). Will also be used if we move the controller VM of course.
    • A new "azure-vm" module instantiation in this new subscription (to create the usual resources) - RG, storage, Network Security Group (NSG), etc.
    • Missing permissions such as vnet reader
      • Nit: I'm not sure why it's not in the module. Might be missing OR there might have been a reason 🤔
    • A new User-Assigned Managed Identity (UAID) for the azure VM agents (required to be in the same subscription as the role assignment and their scopes) and its assignments to allow management by the controler Service Principal (SP) to allow writing to the buildreports file share for agents
      • Nit: might be useful to have this UAID integrated into the module in the future as we want this by default
    • Azure Container Registry (ACR) setup: a Private Endpoint (PE) in the agents subnet to reach the ACR's Private Link Service (PLS), the NSG rules associated with it
    • Output the required values for cert.ci.jenkins.io JCasC Puppet setup (see below)
      • Nit: can be done as a second non functional PR if need be
  • Puppet: set up cert.ci controller to use the new vnet, subnet, their RG and the agent UAID (if I recall correctly, should be all)
    • Tip: testing the values manually in the controller UI and triggering the "agent health" helps to verify the minimum is set up. If agent do not allocate after 2-3 min, then check the controller logs and correct discovered errors
    • Once the manual tests are ok, puppet hieradata can be updated and deployed
  • Setup VPN access (required to access agents with SSH bounce through the VPN):
    • Add the new vnet in the VPN routes (Docker image, e.g. VPN client side)
    • With the new image tagged, update it in puppet along with server side routes (automated PR recently fixed by Jay)
    • Allocate an agent from cert.ci (pipeline replay) and verify you can SSH to it through your machine. If cannot access, then try through OpenVPN VM and compare.
  • Verify ACR from a cert.ci.jenkins.io agent
    • From an allocated agent (with SSH), check access with a curl -v https://<acr DNS name>. Fix missing requirements based on eventual errors (DNS record absent from private network? TCP not able to establish connection? etc. - see https://docs.azure.cn/en-us/container-registry/container-registry-troubleshoot-access).
    • Reminder: ACR must stay private using PE/PLS. No network peering, no public access, because it cannot be authenticated (limitation of Docker/Podman registry-mirror)
    • Once curl allows reaching the ACR, check if Docker Engine is able to use it (need root access to the agent, with journalctl -u docker -f
  • Finally, cleanup: with cert.ci.jenkins.io using new subscription for agents
    • Remove old agents resources in Azure (including data source unless used somewhere else)
    • Then remove old subnet/resources from Azure Net in the CDF subscription (does not cost much, but better to cleanup for clarity)
    • Finally remove routes from OpenVPN
Pinned by lemeurherve

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions