Skip to content

(PTFE-3070) Add multi part upload for large file#576

Open
matthiasL-scality wants to merge 8 commits into
mainfrom
PTFE-3070-add-multi-part-upload
Open

(PTFE-3070) Add multi part upload for large file#576
matthiasL-scality wants to merge 8 commits into
mainfrom
PTFE-3070-add-multi-part-upload

Conversation

@matthiasL-scality
Copy link
Copy Markdown
Contributor

This pull request updates the GitHub Actions workflows to use the latest actions/checkout@v6 and introduces improved multipart upload support for large files in the artifact upload action. It also updates the test workflows to cover the multipart upload path, and updates the container image reference for artifacts. These changes improve compatibility, performance, and test coverage.

Workflow and Dependency Updates:

  • All workflows now use actions/checkout@v6 instead of older versions, ensuring compatibility and access to the latest features and security updates. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

  • The artifacts container image in test-promote.yaml is updated from registry.scality.com/artifacts/artifacts:4.2.6 to ghcr.io/scality/artifacts:4.3.3, ensuring the use of the latest image from GitHub Container Registry.

Multipart Upload Support:

  • Added multipart upload support for files larger than 100 MB in src/artifacts.ts, with each part set to 64 MB and up to 4 parts uploaded concurrently. This improves reliability and performance for large file uploads.

  • A new test job, test-upload-multipart, is added to test-upload.yaml to verify multipart upload functionality and ensure file integrity for files above the 100 MB threshold.

  • The artifact upload workflow dependencies are updated to include the new multipart upload test, ensuring it runs as part of the CI process.

These changes collectively modernize the CI workflows and strengthen support for large artifact uploads.

@matthiasL-scality matthiasL-scality force-pushed the PTFE-3070-add-multi-part-upload branch from b08f5a3 to 24b487f Compare April 8, 2026 08:30
Comment thread src/artifacts.ts Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

  • No timeout on raw https.request calls in fileUploadPresigned (line 144) and fileUploadMultipart.uploadPart (line 286): a stalled S3 connection will hang the action indefinitely. Add a timeout option and a req.on('timeout', ...) handler to destroy hung requests.
    • Add timeout: 300000 to the request options object
    • Add req.on('timeout', () => req.destroy(new Error('S3 upload timed out'))) after creating the request
  • Multipart initiate (line 235) and complete (line 371) calls are not wrapped in retryWithBackoff, unlike the individual part uploads. A transient failure on either call fails or wastes the entire upload.
    • Wrap both in retryWithBackoff — both operations are idempotent

Review by Claude Code

Comment thread src/artifacts.ts
Comment thread src/artifacts.ts
Comment thread action.yaml
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

  • retryWithBackoff retries all errors including non-transient 4xx (401, 403), causing ~100s of wasted delay on auth failures. Consider skipping retries for 4xx like the existing retryArtifacts does for 5xx.
    • Add a guard to throw immediately on AxiosError with status 400-499
  • fileStream / partStream not destroyed on retry failure in fileUploadPresigned (line 137) and uploadPart (line 292), leaking file descriptors (up to 10 per file).
    • Wrap the S3 PUT promise in try/finally with stream.destroy()
  • node20 to node24 in action.yaml is a breaking change for callers — should be called out prominently.
    • Consider node22 as a stepping stone, or document the requirement

Review by Claude Code

Comment thread src/artifacts.ts Outdated
Comment thread action.yaml
Comment thread src/artifacts.ts Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

  • File descriptor leak on upload retry: fileStream (line 181) and partStream (line 350) in src/artifacts.ts are not destroyed when https.request errors or times out. On retry, new streams are created but old ones leak FDs. Fix: call stream.destroy() in the error handler.
    - Orphaned S3 parts when complete fails: the abort logic (lines 374-386) only covers part-upload errors. If the complete POST (line 402) fails after exhausting retries, parts remain in S3 consuming storage. Fix: move the complete call inside the existing try block.
    - Breaking change — node24 runtime: action.yaml now uses node24, which requires runner >= 2.323.0. Callers on older self-hosted runners will fail. Document this in release notes.

    Review by Claude Code

Comment thread action.yaml
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

  • Breaking change: action.yaml jumps from node20 to node24, skipping node22. Self-hosted runners on versions older than v2.322.0 will fail. Document the minimum runner version in the release notes.

The multipart upload implementation is well-structured: proper abort on failure to avoid orphaned S3 parts, presigned URLs re-fetched on each retry attempt (handling expiry), resource cleanup via finally blocks, and a clear concurrency model (8 files × 4 parts = 32 peak connections matching maxSockets). The capability probing fallback to legacy fileUpload is a clean migration path.

Review by Claude Code

Comment thread action.yaml
Comment thread src/artifacts.ts
Comment thread src/artifacts.ts
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

  • Breaking change not documented: action.yaml changes using: 'node20' to using: 'node24'. This affects every repo that consumes this action and is not mentioned in the PR description.
    • Call this out in the PR description and verify runner compatibility.
  • S3 4xx errors retried unnecessarily: retryWithBackoff only short-circuits retries for AxiosError with status < 500, but presigned and multipart uploads use raw https.request and throw plain Error. Permanent S3 failures (403, 400) are retried 10 times (~3.5 min wasted).
    • Introduce a typed error with a status field so the retry logic can distinguish permanent from transient failures across both Axios and raw HTTP calls.
  • Capability probe false positives on 5xx: probeServerCapabilities treats any non-404 response as capability available. A transient 500 would falsely enable multipart/presigned paths.
    • Add && resp.status < 500 to the probe check.

Review by Claude Code

matthiasL-scality and others added 7 commits May 11, 2026 09:30
Transient 5xx errors (e.g. 502 Bad Gateway) on a single multipart part
caused the entire upload to abort with no recovery. fileUploadMultipart
now retries each part independently (up to 10 times with exponential
backoff) before aborting the multipart session. The same retry wrapper
is applied to fileUploadPresigned, which had the same gap vs the legacy
fileUpload path that already used artifactsRetry(client, 10).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add 5 min timeout to raw https.request calls in fileUploadPresigned
  and uploadPart (multipart): a stalled S3 connection would otherwise
  hang the action until the 6h GitHub Actions workflow timeout.
  The 'timeout' option only emits an event; req.destroy() is needed to
  actually abort the socket.
- Wrap multipart initiate and complete in retryWithBackoff: a transient
  failure on initiate aborts before any upload starts; a failure on
  complete wastes all uploaded parts (the abort handler only covers the
  parts-upload phase, not the finish line).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three arrow functions passed to retryWithBackoff returned Promises
without being declared async, triggering the
@typescript-eslint/promise-function-async rule.
Also apply prettier reformatting picked up by the format step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n complete failure

- retryWithBackoff: skip retry on AxiosError with status < 500.
  Auth failures (401/403) were being retried 10 times with ~100s of
  exponential backoff before surfacing. Only 5xx and network errors
  are transient and worth retrying.

- fileUploadPresigned / uploadPart: wrap the S3 https.request in
  try/finally with stream.destroy(). Without this, each failed attempt
  leaves an open ReadStream/FD until GC — up to 10 leaks per file.

- fileUploadMultipart: move the complete POST inside the existing try
  block so the catch handler aborts the multipart upload when complete
  fails after exhausting retries. Previously, parts would be left
  orphaned on S3 indefinitely if complete failed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 8-file limit was chosen to cap peak S3 connections at 8×4=32 when
multipart is active. For workloads with only small files (< 100 MB),
each file uses one connection regardless, so the cap halved throughput
for no benefit.

Inspect the file list before uploading: if no file exceeds
MULTIPART_THRESHOLD (or if multipart is not supported), use 16
concurrent file uploads instead of 8 to match the previous behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread action.yaml
value: ${{ steps.artifacts.outputs.redirect-link }}
runs:
using: 'node20'
using: 'node24'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change: node20node24 changes the runtime for every consumer of this action. Workflows running on self-hosted runners without Node.js 24 will fail. This is independent of the multipart upload feature and should be called out in the PR description and release notes so callers know to expect a runtime change.

— Claude Code

Comment thread src/artifacts.ts
}
}
await Promise.all(
Array.from({length: Math.min(MULTIPART_CONCURRENCY, partCount)}, worker)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one worker rejects, Promise.all rejects immediately but the remaining workers continue running. Their eventual rejections become unhandled promise rejections, which in Node.js 24 defaults to crashing the process (--unhandled-rejections=throw). Consider using Promise.allSettled and rethrowing the first failure so all worker promises settle cleanly before the catch block aborts the upload.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

  • action.yaml: node20node24 is a breaking runtime change for all consumers of this action, independent of the multipart feature. Should be documented in the PR description and release notes. - Call this out explicitly so callers on self-hosted runners without Node 24 aren't caught off guard.
    - src/artifacts.ts:393: Promise.all over the worker pool can leave orphaned worker promises whose rejections are unhandled under Node.js 24's default --unhandled-rejections=throw. - Use Promise.allSettled and rethrow the first failure, or drain the shared queue on error to prevent workers from picking up new parts.

    Review by Claude Code

When GITHUB_RUN_ATTEMPT > 1, fileVersion backs up the previous file before
overwriting it. If versioning fails (transient S3 error, or the file was
never uploaded in a prior attempt), the upload was being blocked entirely,
preventing re-runs from succeeding.

Make versioning non-fatal: log a warning and proceed with the upload so that
re-runs always make progress regardless of whether the backup step succeeded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread action.yaml
value: ${{ steps.artifacts.outputs.redirect-link }}
runs:
using: 'node20'
using: 'node24'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change: switching the action runtime from node20 to node24 affects every caller. Node.js 24 is still in Current (not LTS). Self-hosted runners without node24 installed will fail. This should be called out as a breaking change in the PR description.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

  • Breaking change not documented: action.yaml switches runtime from node20 to node24. Node.js 24 is Current, not LTS. Callers on self-hosted runners without node24 will break. This should be called out as a breaking change.
    - Document the node24 requirement in the PR description and consider whether this warrants a major version bump.
    - No unit tests for new multipart/presigned upload code: retryWithBackoff, fileUploadPresigned, fileUploadMultipart, and probeServerCapabilities are complex functions with retry logic and concurrency, but only covered by a workflow integration test. Unit tests would catch regressions faster and more reliably.
    - Add unit tests covering at least: retry behavior on transient vs permanent errors, multipart part counting, abort-on-failure path, and capability probe logic.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants