Description
Contributing guidelines
- I've read the contributing guidelines and wholeheartedly agree
I've found a bug, and:
- The documentation does not mention anything about my problem
- There are no open or closed issues that are related to my problem
Description
Issue: Self-Hosted Runners on GHA Workflows with Kubernetes Driver
Background
We have configured our GitHub Actions (GHA) workflows to use self-hosted runners. Our typical workflow involves:
- Installing
buildx
- Building, pushing, and caching with
buildx
Problem
We are encountering an issue when using the Kubernetes (k8s) driver for our builds. Our self-hosted runners are deployed on our k8s cluster. We're experiencing a specific error as shown in the screenshot below:
Kubernetes Container Logs:
time="2023-11-30T22:28:11Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
Hypothesis
We suspect that the issue might be related to our runners being behind a VPN. It seems buildx
may not be adequately handling network latency associated with a VPN connection.
Observations
- The issue is isolated to our runners or the k8s driver/buildx combination. This is evident because switching to GitHub's hosted runners resolves the issue, indicating no problems with our workflow or Dockerfile.
- The failure isn't consistent; approximately 1 in 5 actions encounter this issue. Sometimes the action completes successfully.
References
For additional context, see this related issue.
Seeking insights or suggestions to resolve this intermittent failure with our self-hosted runners in GHA workflows.
Expected Behavior
When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx
, we expect the following:
-
Stable Connection to Build Services: The runners should maintain a stable connection to Docker's build services, regardless of being behind a VPN. Network latency typically associated with VPN connections should not disrupt the build process.
-
Consistent Build Process: Each action initiated by the workflow should complete successfully without intermittent failures. The build, push, and cache processes via
buildx
should be executed reliably. -
Error-Free Operation: The
buildx
command, especially when interacting with Kubernetes, should execute without returning errors like/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled
. -
Consistency with GitHub Hosted Runners: The performance and reliability of builds using self-hosted runners should be comparable to those observed with GitHub's hosted runners.
The expectation is that the self-hosted runners on our Kubernetes cluster should work as efficiently and reliably as GitHub's hosted runners, ensuring a smooth CI/CD pipeline.
Actual Behavior
When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx
, we are encountering the following issues:
-
Unstable Connection to Build Services: The runners, especially when operating behind a VPN, are experiencing unstable connections to Docker's build services. This is evident from frequent connection cancellations and errors during the build process.
-
Inconsistent Build Process: The actions initiated by the workflow are not completing consistently. Approximately 20% of the actions (1 in 5) fail intermittently, showcasing a lack of reliability in the build, push, and cache processes via
buildx
. -
Frequent Errors: We are frequently encountering errors such as
/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled
. These errors suggest issues with the interaction betweenbuildx
and Kubernetes. -
Disparity with GitHub Hosted Runners: Unlike the smooth operation observed with GitHub's hosted runners, our self-hosted runners exhibit inconsistent and error-prone behavior, leading to a disrupted CI/CD pipeline.
In summary, our self-hosted runners on the Kubernetes cluster are not performing as efficiently or reliably as expected, particularly in comparison to GitHub's hosted runners.
Repository URL
No response
Workflow run URL
No response
YAML workflow
name: Build and Push Docker Image
on:
workflow_call:
jobs:
build-and-push-image:
runs-on: [gha-runner-scale-set]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Set ECR repository name
id: set_repo_name
run: |
REPO_NAME="${{ github.event.repository.name }}"
ECR_REPO_NAME="${REPO_NAME//./-}"
echo "ECR_REPO_NAME=$ECR_REPO_NAME" >> $GITHUB_ENV
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: <ecr>/${{ env.ECR_REPO_NAME }}:${{ github.sha }}
context: .
build-args: |
GITHUB_UN=${{ secrets.GITHUBUSERMAME }}
GITHUB_PW=${{ secrets.GITHUBPASSWORD }}
cache-from: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache
cache-to: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache,mode=max,image-manifest=true
Workflow logs
No response
BuildKit logs
No response
Additional info
Also it is important to note that this job only ever cancels when doing build and push. We use actions for other things and the actions never just cancel for no reason.