OPCT-428: add must-gather-clean to artifacts-collector for sensitive data cleaning#88
Conversation
…data cleaning Embed must-gather-clean (openshift/must-gather-clean) in the artifacts-collector plugin image to clean sensitive data at source before packing archives to sonobuoy. This prevents leaktk-gcs-filter from removing OPCT CI archives from GCS due to sha256~ OAuth tokens and JWTs in e2e metadata and must-gather files. Changes: - Containerfile: add must-gather-clean v0.0.5 binary via build stage - collector.sh: replace sed-only clean_must_gather() with must-gather-clean, add clean_e2e_metadata() to unpack/clean/repack e2e metadata tar.gz files - mgc-config-mustgather.yaml: regex patterns + omit Secret/MachineConfig resources - mgc-config-e2e.yaml: regex-only patterns for sha256~ tokens and JWTs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR adds must-gather-clean support to the artifacts-collector image and updates collector.sh to sanitize must-gather, metrics, and e2e metadata outputs before final packaging. ChangesMust-gather-clean Sensitive Data Cleanup
Estimated code review effort: 3 (Moderate) | ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
artifacts-collector/Containerfile (1)
10-14: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winVerify checksum of downloaded
must-gather-cleanbinary.The binary is fetched from GitHub releases and piped straight into
tarwith no integrity check, and without-f/pipefaila failed or tampered download could silently succeed. The upstream project publishes aSHA256_SUMfile alongside each release specifically for this purpose.🔒 Proposed fix to add checksum verification
FROM quay.io/fedora/fedora-minimal:41 AS mgc ARG MGC_VERSION=0.0.5 -RUN curl -sL https://github.com/openshift/must-gather-clean/releases/download/v${MGC_VERSION}/must-gather-clean-linux-amd64.tar.gz \ - | tar xz -C /usr/local/bin must-gather-clean +RUN set -euo pipefail && \ + curl -sSLf -o /tmp/mgc.tar.gz https://github.com/openshift/must-gather-clean/releases/download/v${MGC_VERSION}/must-gather-clean-linux-amd64.tar.gz && \ + curl -sSLf -o /tmp/mgc.sha256 https://github.com/openshift/must-gather-clean/releases/download/v${MGC_VERSION}/SHA256_SUM && \ + echo "$(cat /tmp/mgc.sha256 | grep must-gather-clean-linux-amd64.tar.gz) /tmp/mgc.tar.gz" | sha256sum --check --strict - && \ + tar xz -C /usr/local/bin -f /tmp/mgc.tar.gz must-gather-cleanPlease confirm the exact checksum asset filename on the release page before adopting this diff.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@artifacts-collector/Containerfile` around lines 10 - 14, The must-gather-clean download in the Containerfile is extracted without any integrity verification. Update the build step around the mgc stage to fetch the release checksum asset first, verify the downloaded tarball against the upstream SHA256_SUM file, and only then extract it into /usr/local/bin; also make the shell fail on download errors so a broken or tampered fetch cannot pass silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@artifacts-collector/collector.sh`:
- Around line 50-64: The clean_e2e_metadata workflow is leaving either a
corrupted archive or an unredacted archive behind. Update clean_e2e_metadata to
repack through a temporary output file instead of overwriting "${archive}" in
place, check the tar czf exit status before replacing the original, and always
clean up temp dirs on failure. Also add a fallback redaction path in the
must-gather-clean failure branch, similar to clean_must_gather, so
artifacts_e2e-metadata archives still get sanitized when must-gather-clean
cannot run.
- Around line 38-48: The directory swap in collector.sh can lose data because rm
-rf on MUST_GATHER_DIR happens before the cleaned mv succeeds. Update the
must-gather cleanup flow so the original MUST_GATHER_DIR is only removed after
mv "${mg_clean_dir}" "${MUST_GATHER_DIR}" completes successfully, and keep the
existing must-gather-clean fallback path intact; use the must-gather-clean, rm,
and mv steps in collector.sh as the place to adjust the order.
---
Nitpick comments:
In `@artifacts-collector/Containerfile`:
- Around line 10-14: The must-gather-clean download in the Containerfile is
extracted without any integrity verification. Update the build step around the
mgc stage to fetch the release checksum asset first, verify the downloaded
tarball against the upstream SHA256_SUM file, and only then extract it into
/usr/local/bin; also make the shell fail on download errors so a broken or
tampered fetch cannot pass silently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 7003bdea-dcc3-4e11-9f82-4f2da3d392d3
📒 Files selected for processing (4)
artifacts-collector/Containerfileartifacts-collector/collector.shartifacts-collector/mgc-config-e2e.yamlartifacts-collector/mgc-config-mustgather.yaml
- Containerfile: add SHA256 checksum verification for must-gather-clean download - collector.sh: safe directory swap in clean_must_gather (mv before rm) - collector.sh: write to temp file in clean_e2e_metadata, replace only on success Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fedora-minimal:41 does not include tar by default, causing the must-gather-clean download stage to fail during image build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tterns Integration test showed 9 remaining findings after initial configs: - 5 in controllerconfigs (JWTs and registry pull secret) - 2 in e2e metadata (Authorization: Basic headers) - 1 in imageregistry config (httpSecret) - 1 in oauthclients (secret field) Fixed by adding: - ControllerConfig omit rule in mustgather config - Authorization header regex in both configs - Generic secret/httpSecret regex in mustgather config Verified locally: 9 remaining findings reduced to 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@artifacts-collector/mgc-config-e2e.yaml`:
- Around line 9-11: Update the Authorization redaction regex in the FileContents
rule to be case-insensitive so lowercase and mixed-case authorization headers
are also matched; apply the same change to the corresponding regex entry in the
mustgather config. Keep the existing token-matching portion intact and add the
case-insensitive flag directly in the pattern so the redaction covers common log
and curl output variants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: f7a0c25f-7dcd-4871-9559-794fb2109b81
📒 Files selected for processing (2)
artifacts-collector/mgc-config-e2e.yamlartifacts-collector/mgc-config-mustgather.yaml
✅ Files skipped from review due to trivial changes (1)
- artifacts-collector/mgc-config-mustgather.yaml
|
Local tests shows leaktk findings are 0 while scanning reports Now findings shows 0 in local testing |
mtulio
left a comment
There was a problem hiding this comment.
Couple nits related to the verbosity as well the MGM plugin.
|
/hold |
- Add -v4 flag to must-gather-clean calls, redirect output to dedicated log files (artifacts_log-mgc-*.log) to avoid mixing with pod stdout - Add must-gather-clean step for must-gather-monitoring (metrics) archive before packing, consistent with must-gather cleaning
Keep errors visible for debugging instead of redirecting to /dev/null.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
artifacts-collector/collector.sh (1)
62-64: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winKeep a fallback for e2e archive redaction.
If
must-gather-cleanfails here, the originalartifacts_e2e-metadata-*.tar.gzremains unredacted and packaging continues, which reintroduces the token findings this PR targets. Add a fallback redaction path or fail the archive collection explicitly.Suggested fallback shape
must-gather-clean -c /plugin/mgc-config-e2e.yaml \ -i "${tmpdir}" -o "${cleandir}" \ - -v4 >>./artifacts_log-mgc-e2e-metadata.log 2>&1 || { rm -rf "${tmpdir}" "${cleandir}"; continue; } + -v4 >>./artifacts_log-mgc-e2e-metadata.log 2>&1 || { + os_log_info "[executor][PluginID#${PLUGIN_ID}] must-gather-clean failed for ${archive##*/}; applying fallback redaction" + find "${tmpdir}" -type f -exec sed -i \ + -e 's/sha256~[A-Za-z0-9_-]\+/sha256~<sensitive>/g' \ + -e 's/eyJ[A-Za-z0-9_-]\{10,\}\.eyJ[A-Za-z0-9_-]\{10,\}\.[A-Za-z0-9_-]\+/JWT-REDACTED/g' {} + + cleandir="${tmpdir}" + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@artifacts-collector/collector.sh` around lines 62 - 64, The e2e metadata redaction step in collector.sh currently falls back by cleaning up temp dirs and continuing, which leaves the original artifacts_e2e-metadata-*.tar.gz unredacted and allows packaging to proceed. Update the must-gather-clean handling in the archive collection flow to either run an alternate redaction path when the clean step fails or stop the collection explicitly, so the final artifacts are never packaged without redaction. Keep the fix localized around the must-gather-clean invocation and the surrounding archive packaging logic.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@artifacts-collector/collector.sh`:
- Around line 194-201: The metrics cleanup flow in collector.sh can still
archive the original must-gather-metrics content when must-gather-clean fails
because the current && block is simply skipped. Update the must-gather-metrics
handling around must-gather-clean, metrics_src_dir, and metrics_clean_dir to add
an explicit failure path that either applies fallback redaction or skips/fails
metrics packaging entirely, and ensure the pack step only uses sanitized
content.
- Around line 198-200: The metrics directory swap in collector.sh is not
failure-safe: if the second mv from metrics_clean_dir to metrics_src_dir fails,
the original stays renamed as *-orig and later packaging can break. Update the
swap logic around the metrics_src_dir and metrics_clean_dir moves to check each
mv result and restore the original directory from the *-orig backup before
continuing or exiting, so the original metrics directory is always preserved on
failure.
---
Duplicate comments:
In `@artifacts-collector/collector.sh`:
- Around line 62-64: The e2e metadata redaction step in collector.sh currently
falls back by cleaning up temp dirs and continuing, which leaves the original
artifacts_e2e-metadata-*.tar.gz unredacted and allows packaging to proceed.
Update the must-gather-clean handling in the archive collection flow to either
run an alternate redaction path when the clean step fails or stop the collection
explicitly, so the final artifacts are never packaged without redaction. Keep
the fix localized around the must-gather-clean invocation and the surrounding
archive packaging logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 340dd2e8-3868-4c2e-9135-139bfeae04a0
📒 Files selected for processing (1)
artifacts-collector/collector.sh
…e regex - Skip metrics packing if must-gather-clean fails (don't pack uncleaned) - Restore original metrics directory if swap fails - Add (?i) to Authorization header regex for case-insensitive matching
mtulio
left a comment
There was a problem hiding this comment.
/lgtm
/approve
Feel free to drop hold label.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mtulio The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Removing approval labels as we are discussing on slack how we can keep secret manifests obfuscated, not removing it |
…cation Must-gather already obfuscates secret values (e.g., 'token: MTAwOSBieXRlcyBsb25n' which decodes to '1009 bytes long'). The omit rule was removing files that are already safe. Removing it to preserve secrets.yaml for audit with obfuscated values. leaktk-gcs-filter in CI does not flag these obfuscated values (confirmed by must-gather.tar being available in regular CI artifacts). Will evaluate in OPCT CI after merge.
|
Integration test results (4.22.3 IPI, latest PR #88 image + PR #223 binary): Collector image verified: Retrieve timing: leaktk scan: All in must-gather secrets.yaml files (obfuscated values e.g. "token: MTAwOSBieXRlcyBsb25n" = "1009 bytes long") All 30 findings are must-gather's built-in obfuscation — not real secrets. leaktk-gcs-filter in CI does not remove must-gather with these values (confirmed by must-gather.tar being available in regular CI artifacts). MGC log file sizes: |
Summary
This prevents leaktk-gcs-filter from removing OPCT CI archives from GCS. The bulk of findings (3070 of 3157) are sha256~ OAuth tokens inside
artifacts_e2e-metadata-*.tar.gzfrom openshift-tests output.Verified locally: must-gather-clean with regex config reduces sha256~ findings in e2e metadata from 1212 to 0.
Related: redhat-openshift-ecosystem/opct#223 (non-nested scanner improvements, defense in depth)
Test plan
/cc @mtulio
🤖 Generated with Claude Code