Skip to content

BUILD-10745 Add audit script#235

Closed
mikolaj-matuszny-ext-sonarsource wants to merge 2 commits intomasterfrom
feat/mmatuszny/BUILD-10745-audit-script
Closed

BUILD-10745 Add audit script#235
mikolaj-matuszny-ext-sonarsource wants to merge 2 commits intomasterfrom
feat/mmatuszny/BUILD-10745-audit-script

Conversation

@mikolaj-matuszny-ext-sonarsource
Copy link
Contributor

@mikolaj-matuszny-ext-sonarsource mikolaj-matuszny-ext-sonarsource commented Mar 20, 2026

Summary

Adds a Python audit script (tools/audit-action-version.py) that scans a GitHub organization to find all usages of a specified GitHub Action and reports which repositories are not using an allowed version (tag or SHA).

How it works

  1. Search - Uses GitHub Code Search API (via gh CLI) to find all references to the target action in .github/ directories across the org
  2. Inspect - Fetches each matched file and extracts action version references using regex
  3. Report - Outputs a CSV (stdout or file) with repo, file, line number, current ref, and compliance status
  4. Summary - Prints a compliance summary to stderr; exits non-zero if any non-compliant refs are found

Usage

python tools/audit-action-version.py \
    --org SonarSource \
    --action SonarSource/gh-action_cache \
    --allowed-refs v1,54a48984cf6564fd48f3c6c67c0891d7fe89604c \
    [--output report.csv] [--verbose]

Prerequisites

  • gh CLI (authenticated)
  • Python 3.7+

@hashicorp-vault-sonar-prod
Copy link

hashicorp-vault-sonar-prod bot commented Mar 20, 2026

BUILD-10745

@sonar-review-alpha
Copy link

sonar-review-alpha bot commented Mar 20, 2026

Summary

This PR adds a Python audit script to detect non-compliant GitHub Action versions across an organization. It searches all .github/ files for a target action, extracts version refs from workflow files, and reports which repos aren't using an allowed version. The script gracefully handles API limits and file fetch errors, outputting a CSV report and exit code 1 if any non-compliant refs are found.

What reviewers should know

Key implementation details to watch:

  1. Rate limiting (lines 103, 247): Respects GitHub's 10 req/min search API cap (6s between pages) and secondary limits (0.5s between file fetches). These are deliberate and necessary.

  2. Version extraction (line 191): The regex handles variable YAML formatting—quoted/unquoted refs, optional subpaths, inline whitespace. Test with real workflow syntax if you're unfamiliar with the variations.

  3. Deduplication (line 108): Search API returns duplicates across paginated results. The dedup logic runs on search results, not after fetching—efficient but worth understanding why.

  4. File handling (lines 171–183): Uses GitHub API's base64-encoded content endpoint (not raw.githubusercontent.com). Falls back gracefully if a file can't be fetched or decoded—won't stop the audit.

  5. Output behavior: CSV goes to stdout (unless --output specified), summary to stderr. Exit code is 1 if any non-compliant refs found—important for CI pipelines.

Review focus: Regex correctness (does it match the action syntax you care about?), rate limit sleeps (necessary but could be tuned if you hit API errors), and the error recovery strategy (is failing silently on unreachable files acceptable?).


  • Generate Walkthrough
  • Generate Diagram

🗣️ Give feedback

Copy link

@sonar-review-alpha sonar-review-alpha bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: Clean, well-structured tool. Logic is correct — the for…else on the pagination loop, the deduplication, and the regex ref extraction all behave as intended. No issues worth blocking on.

🗣️ Give feedback

@mikolaj-matuszny-ext-sonarsource
Copy link
Contributor Author

Code review

Found 4 issues:

  1. Regex matches action name prefixes, causing false positives. The pattern [^@\s'"#]* between the action name and @ greedily absorbs extra characters, so auditing SonarSource/gh-action_cache would also match SonarSource/gh-action_cache-extra@v1. Consider anchoring with a word boundary or restricting to only /subpath after the action name.

# Match "uses: owner/action[/optional/subpath]@ref" with optional quotes and whitespace
pattern = re.compile(
rf"uses:\s*['\"]?{re.escape(action)}[^@\s'\"#]*@([^\s'\"#]+)"
)

  1. Search API rate limit sleep is too short. time.sleep(2) allows ~30 req/min, but the GitHub Code Search API limit for authenticated users is 10 req/min. The comment says "30 req/min" which is also incorrect. This will cause silent truncation of search results on larger orgs. The sleep should be at least 6 seconds.

if page < max_pages:
time.sleep(2) # Respect 30 req/min search rate limit
else:

  1. No rate-limit delay between file content fetches. The search phase has a sleep between pages, but the content-fetch loop (up to 1000 files) has no delay. GitHub's secondary rate limit will return 403s, which are caught and silently swallowed as warnings, producing an incomplete report that appears complete.

for i, file_info in enumerate(matched_files):
repo = file_info["repo"]
filepath = file_info["path"]
versions = extract_versions_from_file(repo, filepath, args.action)
for v in versions:
compliant = v["ref"] in allowed_refs

  1. Docstring claims Python 3.6+ but code requires 3.7+. The script uses from __future__ import annotations (PEP 563) and dataclasses, both of which require Python 3.7+.

Prerequisites: gh CLI (authenticated), Python 3.6+
Usage:

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

- Fix regex to only allow /subpath after action name, preventing false
  positives on actions sharing a name prefix
- Increase search API sleep from 2s to 6s to respect 10 req/min limit
- Replace for/else with explicit check for 1000-result API cap
- Add 0.5s delay between file content fetches to avoid secondary rate limits
- Fix docstring: Python 3.7+ (not 3.6+), correct script path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link

Copy link

@sonar-review-alpha sonar-review-alpha bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: The commit fixes several real bugs from the previous review (rate limit value, for...else logic, regex group index). One new bug was introduced in the process: the secondary rate limit sleep is placed after the API call it's meant to throttle.

🗣️ Give feedback


versions = extract_versions_from_file(repo, filepath, args.action)

if i > 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep is placed after extract_versions_from_file (which makes the gh api call), not before it. This means there's zero delay between the 0th and 1st file fetch — the pair most likely to trigger a secondary rate limit since they fire back-to-back immediately after the search completes.

Move the sleep before the fetch:

for i, file_info in enumerate(matched_files):
    repo = file_info["repo"]
    filepath = file_info["path"]

    if i > 0:
        time.sleep(0.5)  # Avoid hitting GitHub secondary rate limits

    versions = extract_versions_from_file(repo, filepath, args.action)
  • Mark as noise

@mikolaj-matuszny-ext-sonarsource
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant