-
-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Milestone
Description
Service(s)
trusted.ci.jenkins.io
Summary
As per #5003, we have credits to spend in the Jenkins sponsored subscription.
Let's move the trusted.ci.jenkins.io ephemeral agents in this new subscription, which includes the data transiting from/to the NAT gateway. It means the following costs will be moved to the new subscription (excerpt of the past 6 months):
- The ephemeral agents themselves are ~$300 monthly
- The NAT gateway (in and out data, mostly out) shared with the permanent agents is ~$400. We won't gain the whole $400 monthly as the permanent agent (remaining in the CDF subscription) is also a consumer of NAT gateway, but we can still expect at least $100 monthly 👍
Prerequisites:
- [Azure Sponsored Subscription 2026] Set up permissions #5005
- Check old "cleanup" PR or commits from 2025 former sponsored subscription
- Check current code (for both repositories azure and azure-net) to ensure we have the same objects and naming conventions (as the old cleanup might be stuck on former techniques or naming we dropped)
List of expected Jobs running with these agents (e.g. potential impacts):
- All "Docker" publication jobs: controller, agents (inbound and SSH) images
- www.jenkins.io publication
- javadoc publication
- RPU
- crawler (except its publication)
- core-taglibs-report-generator
- other utility script not important (purge-fastly-for-security-advisory, update-center-sync-recent-releases, reindex_maven)
Task list:
- Azure Net: a new vnet + subnet is required for ephemeral VM agents. We should use the same pattern as last year, except that both the controller and the permanent agent might need to move in this vnet as well: we should increase its size compared to 2025 right at creation so we won't have vnet overlap or increase in the upcoming weeks
- Proposal (for vnet sizing): from
10.204.0.0/24to10.204.0.0/22(10.204.0.0->10.204.3.254) to allow having three/24subnets - Proposal (for subnet sizing): we keep a
/24(big enough). We ought to have 3 subnets: 1 for ephemeral agents (dynamic and nondeterministic IPv4 allocation by azure-vm), 1 for permanent agent (current one, maybe 1 or 2 for census and usage in the future) and 1 for controller. Yes, oversized, but it helps keeping subnet division clear
- Proposal (for vnet sizing): from
- Azure: the following resources are expected (same as old setup with 2025 subscription):
- data sources to the new vnet,subnet and their RG
- A new RG for the "non agent" resources. Usually we name it
xxx_ci_jenkins_io_controller_jenkins_sponsorshipwithxxxthe specific name (trustedhere). Will also be used if we move the controller VM of course. - A new "azure-vm" module instantiation in this new subscription (to create the usual resources) - RG, storage, NSG, etc.
- Missing permissions such as vnet reader
- Nit: I'm not sure why it's not in the module. Might be missing OR there might have been a reason 🤔
- A new UAID for the azure VM agents (required to be in the same subscription as the role assignment and their scopes) and its assignments to allow management by controler SP to allow writing to the buildreports file share for agents
- Nit: might be useful to have this UAID integrated into the module in the future as we want this by default
- NSG rules:
- To reach archives.jenkins.io in DigitalOcean
- Note (
⚠️ ): we must NOT add any rules related topkgVM as it's gone no (wasn't gone in 2025)
⚠️ No need for the PE/PLS for update.ci.jenkins.io's mirrorbits/rsync setup... yet.- ACR setup: a PE in the agents subnet to reach the ACR's PLS, the NSG rules associated with it
⚠️ Check for the access to data storage in the rest of the repository (might need to add the subnet for these agents to some locals) to ensure www.jenkins.io / javadoc can continue publishing- Output the required values for trusted.ci.jenkins.io JCasC Puppet setup (see below)
- Nit: can be done as a second non functional PR if need be
- Puppet: set up trusted.ci controller to use the new vnet, subnet, their RG and the agent UAID (if I recall correctly, should be all)
- Tip: testing the values manually in the controller UI and triggering the "agent health" helps to verify the minimum is set up. If agent do not allocate after 2-3 min, then check the controller logs and correct discovered errors
- Once the manual tests are ok, puppet hieradata can be updated and deployed
- Setup VPN access (required to access agents with SSH bounce through the VPN):
- Add the new vnet in the VPN routes (Docker image, e.g. VPN client side)
- With the new image tagged, update it in puppet along with server side routes (automated PR recently fixed by Jay)
- Allocate an agent from trusted.ci (pipeline replay) and verify you can SSH to it through your machine. If cannot access, then try through OpenVPN VM and compare.
- Verify ACR from a trusted.ci.jenkins.io agent
- From an allocated agent (with SSH), check access with a
curl -v https://<acr DNS name>. Fix missing requirements based on eventual errors (DNS record absent from private network? TCP not able to establish connection? etc. - see https://docs.azure.cn/en-us/container-registry/container-registry-troubleshoot-access). - Reminder: ACR must stay private using PE/PLS. No network peering, no public access, because it cannot be authenticated (limitation of Docker/Podman
registry-mirror) - Once
curlallows reaching the AC, check if Docker Engine is able to use it (needrootaccess to the agent, withjournalctl -u docker -f
- From an allocated agent (with SSH), check access with a
- Finally, cleanup: with trusted.ci.jenkins.io using new subscription for agents
- Remove old agents resources in Azure (including
datasource unless used somewhere else) - Then remove old subnet/resources from Azure Net in the CDF (does not cost much, but better to cleanup for clarity)
- Finally remove routes from OpenVPN
- Remove old agents resources in Azure (including
Reproduction steps
No response
Reproduction steps
No response
Reactions are currently unavailable