CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Data Collection

Monthly PR collection: ./collect-monthly.sh "YYYY-MM" [true/false] - Collects PR data for specified month, optional Google Sheets update
Daily updates: ./update-daily.sh - Updates current month's data and failing PR statistics
Bulk collection with retries: ./retry-collection.sh - Collects all data from July 2024 onwards with exponential backoff
JDK 25 testing: ./test-jdk-25.sh - Tests top 250 plugins with JDK 25, writes results to CSV
Install JDK versions: ./install-jdk-versions.sh - Installs multiple JDK versions for compatibility testing

Development Setup

Go dependencies: go mod download && go mod tidy
Go build: go build jenkins-pr-collector.go - Builds the main collector binary
Go run directly: go run jenkins-pr-collector.go -start YYYY-MM-DD -end YYYY-MM-DD -output file.json
Python environment: python -m venv venv && source venv/bin/activate && pip install -r requirements.txt
Environment check: ./check-env.sh - Validates required tools and credentials

Testing and Analysis

Find JUnit 5 migration PRs: ./find-junit5-prs.sh - Searches for JUnit 5 migration-related PRs
Test plugin builds: ./test-pr-builds.sh - Tests building plugins from PR branches
Validate plugin list: ./validate-top-plugins.sh - Validates top-250-plugins.csv format
Analyze JUnit 5 PRs: ./analyze-junit5-prs.sh - Analyzes JUnit 5 migration patterns
Generate reports: ./generate-report.sh - Creates consolidated statistics reports

Utility Scripts

Filter PRs: ./filter-prs.sh input.json - Filters PR data by Jenkins-related criteria
Group PRs: ./group-prs.sh input.json plugins.json - Groups PRs by repository/plugin
Count PRs: ./count_prs.sh repos.txt year - Counts PRs for specific repositories
Process PRs: ./process_prs.sh - General PR data processing pipeline

GitHub Profile Tools

Build analyzer: (cd github-profile-tools && go build -o ../github-user-analyzer ./cmd/github-user-analyzer) - Builds the GitHub profile analyzer binary
Run tests: (cd github-profile-tools && go test ./...) - Runs unit test suite
Test specific package: (cd github-profile-tools && go test -v ./internal/cache) - Tests specific package with verbose output
Analyze user: ./github-user-analyzer -user=username - Generates comprehensive GitHub profile analysis with all templates by default
Analyze with specific template: ./github-user-analyzer -user=username -template=resume - Generates profile with specific template (resume, technical, executive, ats)
Analyze with custom usernames: ./github-user-analyzer -user=username -docker-user=dockerhub_user -discourse-user=discourse_user - Uses separate usernames for different platforms
Cache management: ./github-user-analyzer -cache-stats - Shows cache statistics, ./github-user-analyzer -clear-cache - Clears cache, ./github-user-analyzer -force-refresh - Forces refresh ignoring cache
Analyze with token: ./github-user-analyzer -user=username -token="$GITHUB_TOKEN" - Uses explicit GitHub token for API access

Architecture

Core Components

jenkins-pr-collector.go - Main Go application that queries GitHub GraphQL API to collect PR data
Shell scripts ecosystem - Bash scripts orchestrate data collection, processing, and reporting
Python integration - Handles Google Sheets uploads and data processing via upload_to_sheets.py
GitHub Actions workflows - Automated scheduling and execution in .github/workflows/
GitHub Profile Tools - Standalone Go application for analyzing GitHub user profiles and generating professional documentation

Data Flow

Collection: jenkins-pr-collector.go fetches PR data via GitHub GraphQL API
Processing: Scripts filter, group, and transform raw PR data
Storage: Data stored in data/ directory (monthly/, consolidated/, archive/)
Reporting: Processed data uploaded to Google Sheets and stored as JSON artifacts

Key Data Structures

Raw PR data: JSON files containing GitHub PR objects with metadata
Filtered data: PRs filtered by criteria (Jenkins-related, specific time periods)
Grouped data: PRs organized by plugin/repository for analysis
Build results: CSV files with plugin build status and JDK compatibility

Authentication

GitHub API: Requires GITHUB_TOKEN or PAT_TOKEN environment variable with repo, read:org, read:user scopes
Google Sheets: Requires GOOGLE_CREDENTIALS JSON service account file (set via environment or file path)
Rate limiting: Built-in exponential backoff and retry mechanisms for both GitHub and Google APIs

Automation Workflows

pr-stats.yml: Monthly collection (2nd of month at 00:00 UTC) and daily updates (midnight UTC)
test-jdk-25.yml: Weekly JDK 25 compatibility testing (Tuesdays at 00:00 UTC) with 6-hour timeout
updatecli.yml: Daily updates to top-250-plugins.csv from upstream Jenkins data
auto-merge-bot-prs.yml: Automatically merges bot-created PRs with "automation" label
pr-collector-test.yml: Weekly testing of PR collector functionality (Tuesdays at 07:18 UTC)
generate-top-plugins.yml: Updates plugin popularity data
run-update-daily-on-merge.yml: Triggers daily update when changes are merged
release-github-profile-tools.yml: Automated releases of GitHub Profile Tools with cross-platform binaries

Release Automation Plan

The GitHub Profile Tools binary will be automatically released through GitHub Actions:

Release Strategy

Trigger: Tag-based releases using semantic versioning (v1.0.0, v1.1.0, etc.)
Platforms: Cross-platform binaries for Windows (x64), Linux (x64, ARM64), and macOS (x64, ARM64)
Artifacts: Compressed binaries with checksums for verification
Changelog: Auto-generated release notes from commit messages and PR titles

Build Matrix

All builds use ubuntu-latest with Go cross-compilation:

Windows x64: GOOS=windows GOARCH=amd64 → github-user-analyzer-windows-amd64.exe
Linux x64: GOOS=linux GOARCH=amd64 → github-user-analyzer-linux-amd64
Linux ARM64: GOOS=linux GOARCH=arm64 → github-user-analyzer-linux-arm64
macOS x64: GOOS=darwin GOARCH=amd64 → github-user-analyzer-darwin-amd64
macOS ARM64: GOOS=darwin GOARCH=arm64 → github-user-analyzer-darwin-arm64

Release Process

Tag Creation: Create and push a version tag (e.g., git tag v1.0.0 && git push origin v1.0.0)
Automated Build: GitHub Action builds binaries for all supported platforms
Testing: Run basic smoke tests on each binary
Packaging: Compress binaries and generate SHA256 checksums
Release: Create GitHub release with auto-generated changelog and download links
Notification: Optional notifications to relevant channels

Directory Structure

data/monthly/ - Monthly PR collection files (prs_YYYY_MM.json, filtered_, grouped_)
data/consolidated/ - Aggregated data across time periods
data/archive/ - Older data files (>6 months)
data/junit5/ - JUnit 5 migration analysis data
data/profiles/ - Generated GitHub profile analyses and templates
data/cache/ - Cached analysis data for efficient template regeneration and incremental updates
data/progress/ - Temporary progress files for resuming interrupted analyses
github-profile-tools/ - GitHub profile analyzer Go application
- cmd/github-user-analyzer/ - Main CLI application entry point
- internal/github/ - GitHub API client with exponential backoff and retry logic (8 attempts)
- internal/profile/ - Profile analysis logic with incremental processing (50 repos per page)
- internal/cache/ - File-based cache system with TTL, compression, and thread-safety
- internal/docker/ - Docker Hub integration and expertise scoring
- internal/discourse/ - Discourse community engagement analysis
- internal/markdown/ - Template generator for multiple profile formats
- templates/ - Profile generation templates (resume, technical, executive, ats)
updatecli/ - Updatecli configuration for dependency updates
.github/workflows/ - GitHub Actions automation workflows

Key Files and Formats

Raw PR data: prs_YYYY_MM.json - Complete GitHub PR objects from GraphQL API
Filtered data: filtered_prs_YYYY_MM.json - Jenkins-related PRs only
Grouped data: grouped_prs_YYYY_MM.json - PRs organized by repository/plugin
JDK compatibility: jdk-25-build-results.csv - Plugin build results with JDK 25
Plugin list: top-250-plugins.csv - Most popular Jenkins plugins for testing

Error Handling

Scripts use set -e for immediate exit on errors
Rate limiting handled with exponential backoff for GitHub and Google APIs
Partial data preservation during collection failures with resume capability
Comprehensive logging to stdout/stderr and dedicated log files
Debug logging to build-debug.log, fetch_prs_debug.log, and other log files

GitHub Profile Tools - Key Implementation Details

Caching System

File-based cache with gzip compression and JSON serialization
TTL support with configurable expiration (default 24 hours)
Thread-safe operations with proper mutex locking
Cache key types: profile, repositories, organizations, contributions, languages, skills
Cache management: Statistics, clearing, force refresh, and invalidation by user
Storage location: data/cache/ with files named by cache key type and username

Incremental Analysis

Page-by-page processing: Repositories fetched in batches of 50 to handle large profiles
Progress saving: Analysis state saved after each major step for resumability
Graceful degradation: Continues with partial data if API calls fail
Progress files: Stored in data/progress/ for interrupted analysis resumption
Analysis steps: 8 sequential steps from basic info → Docker/Discourse → insights generation

API Resilience

Retry logic: Up to 8 attempts with exponential backoff and jitter
Retryable errors: Infrastructure failures (502, 503, 504), rate limits, timeouts, stream cancellations
Rate limit handling: Respects GitHub API rate limits with automatic backoff
Context support: All API calls respect context timeouts (default 6 hours for large analyses)

Docker & Discourse Integration

Docker Hub: Fetches public repositories, pull counts, expertise scoring (0-10 scale)
Docker expertise: Proficiency levels from beginner to expert based on usage patterns
Docker file detection: Scans for Dockerfile, docker-compose.yml, docker-bake.hcl, .dockerignore
Discourse: Jenkins community engagement analysis with separate username support
Optional platforms: Both Docker Hub and Discourse are optional and gracefully handle missing data

Testing Best Practices

Comprehensive unit tests for cache system covering concurrency, expiration, data integrity
String conversion: Use strconv.Itoa() instead of string(rune()) to avoid control character injection
Test coverage: Critical paths tested including error scenarios and edge cases
Test location: github-profile-tools/internal/cache/*_test.go, internal/profile/*_test.go

Cache Key Scoping (PR #193)

Scoped cache keys: Prevents cache poisoning when same GitHub user has different Docker/Discourse usernames
Scope format: "docker:{dockerUsername},discourse:{discourseUsername}" appended to cache key
Scope conditions: Only applied when Docker/Discourse usernames differ from GitHub username
Implementation: GetUserProfileKeyWithScope(), GetUserProfileWithCustomUsernames(), SetUserProfileWithCustomUsernames()
Cache invalidation: DeleteByPrefix() removes all scoped variants when invalidating user cache
- Pattern: "profile_username" matches "profile_username" and "profile_username_scope:..."
- Ensures -force-refresh and cache clearing work with scoped keys
Files: internal/cache/manager.go, internal/cache/storage.go, internal/profile/cache.go

Planned Improvements

Follow-up Refactoring (Post PR #193)

Code duplication: Refactor runAnalysis() and runAnalysisWithCache() in cmd/github-user-analyzer/main.go
- Both functions share similar structure and logic
- Extract common analysis workflow into shared helper function
- Reduce maintenance burden and improve code clarity
- Identified in CodeRabbit review comment
Progress file cache key alignment: Ensure progress files use same scoped key format as cache
- Store dockerUsername and discourseUsername in ProgressData struct
- Validate on resume that usernames match requested analysis
- Prevents resuming with wrong Docker/Discourse username data
- Identified in CodeRabbit review comment on analyzer.go:1182-1246

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Commands

Data Collection

Development Setup

Testing and Analysis

Utility Scripts

GitHub Profile Tools

Architecture

Core Components

Data Flow

Key Data Structures

Authentication

Automation Workflows

Release Automation Plan

Release Strategy

Build Matrix

Release Process

Directory Structure

Key Files and Formats

Error Handling

GitHub Profile Tools - Key Implementation Details

Caching System

Incremental Analysis

API Resilience

Docker & Discourse Integration

Testing Best Practices

Cache Key Scoping (PR #193)

Planned Improvements

Follow-up Refactoring (Post PR #193)

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Commands

Data Collection

Development Setup

Testing and Analysis

Utility Scripts

GitHub Profile Tools

Architecture

Core Components

Data Flow

Key Data Structures

Authentication

Automation Workflows

Release Automation Plan

Release Strategy

Build Matrix

Release Process

Directory Structure

Key Files and Formats

Error Handling

GitHub Profile Tools - Key Implementation Details

Caching System

Incremental Analysis

API Resilience

Docker & Discourse Integration

Testing Best Practices

Cache Key Scoping (PR #193)

Planned Improvements

Follow-up Refactoring (Post PR #193)