Skip to content

Job hangs indefinitely sometimes #1305

Open
@Alex17Li

Description

@Alex17Li

Describe the bug

I don't know what the cause is, but when using this action I've started to see failures where the credential job crashes, but no retry or exit seems to happen - it just hangs (5+ hours)
We've had this working pretty consistently for a long time (~year?) but now it's sometimes failing like this which just takes up all of the runners.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

It should never hang. If the internet fails then it can crash

Current Behavior

Run aws-actions/configure-aws-credentials@v4
  with:
    role-session-name: GithubActionsRoleSession
    role-to-assume: arn:aws:iam::425642425116:role/github
    aws-region: us-west-2
    role-duration-seconds: 21600
    output-credentials: true
    audience: sts.amazonaws.com
  env:
    HOME: /root
    ADK_GITHUB_TOKEN: ***
    REMOTE_ROOT: /mnt/ssd/bazeltest_github
    VPU_ADDR: bazelvpu
Error: getIDToken call failed: Error message: Failed to get ID Token. 
 
        Error Code : undefined
 
        Error Message: read ECONNRESET
context canceled
Error: The operation was canceled.

The error appeared after cancelling the job

Reproduction Steps

It probably will not be reproduced easily. We are running in a company docker container, though I don't see why that would be an issue. Removing everything irrelevant the job looks like this.

jobs:
  run_embedded_vpu_bundle:
    runs-on: adk-vpu2-jp5
    container:
      image: artifactory.bluerivertech.com/dev-adk-docker/autonomy/adk/ubuntu_2204_build:2025-02-26
      volumes:
        - ghrunner_ci_cache_adk-vpu2-jp5:/ci_cache
      options: --shm-size 32G
   steps:
      - name: Assume AWS-Github role using OIDC (prod)
        id: aws-role
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-session-name: GithubActionsRoleSession
          role-to-assume: arn:aws:iam::425642425116:role/github
          aws-region: us-west-2
          role-duration-seconds: 21600
          output-credentials: true
          retry-max-attempts: 50

Possible Solution

Error Message: read ECONNRESET makes it seem that the network connection is breaking at an inopportune time during the step? Perhaps there is a point where you wait for a packet and don't crash if it doesn't arrive in a few seconds.

Additional Information/Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions