This repository contains two AI-powered pipelines that generate executive summaries of GitHub repositories, projects, and portfolios over a configurable time window. Both pipelines produce the same three-tier output — repo-level, project-level, and portfolio-level summaries — and are designed for direct comparison.
| Pipeline | Approach | Models | Output suffix |
|---|---|---|---|
Chat-based (src/) |
Pre-fetches activity data, feeds it to OpenAI via Python | Configurable per tier | _chatbased |
Agentic (agentic/) |
GitHub Copilot CLI reads repos and activity autonomously | GitHub Copilot | _agentbased |
Both pipelines write to the same reports/ and reports_pdf/ folders. Each pipeline only cleans its own output files, so you can run both and keep both sets of results side by side.
Both pipelines share the same .env file. Use .env_example as a template.
- GitHub token: Go to GitHub.com > Settings > Developer settings > Personal Access Tokens > Tokens (classic) > Generate a token. Select the
repoandread:orgboxes. You need NIH-CFDE organization access. - OpenAI API key: Go to OpenAI API and generate a token.
- GitHub Copilot subscription: You need an active GitHub Copilot subscription and must authenticate locally before running (see Section 2).
The agentic pipeline uses GitHub Copilot CLI, which must be authenticated with your GitHub account before running. This only needs to be done once on your local machine.
# Install Copilot CLI locally if you haven't already
npm install -g @githubnext/github-copilot-cli
# Authenticate — this saves credentials to ~/.config/github-copilot/
copilot authFollow the prompts to log in with your GitHub account. You need an active GitHub Copilot subscription. Once authenticated, the Docker run command mounts these credentials into the container so Copilot can identify you without re-authenticating.
docker build -t cfde-pipeline --no-cache .
docker run --rm --env-file .env \
-v "$PWD/data:/app/data" \
-v "$PWD/reports:/app/reports" \
-v "$PWD/reports_pdf:/app/reports_pdf" \
-v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro" \
cfde-pipelinedocker build -t cfde-pipeline --build-arg PIPELINE=chatbased --no-cache .
docker run --rm --env-file .env \
-v "$PWD/data:/app/data" \
-v "$PWD/reports:/app/reports" \
-v "$PWD/reports_pdf:/app/reports_pdf" \
cfde-pipelineThe
-v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro"mount is only needed for the agentic pipeline. The chat-based pipeline does not need it.
- Time window: Change
--days=365insrc/full.shoragentic/full.sh. - Models (chat-based only): Change the
--modelflags insrc/full.sh. - Project cohort:
projects_seed.csvis auto-generated bybuild_projects_seed.py. To use a different set of projects, updatedata/projects_seed.csvmanually and remove thebuild_projects_seed.pycall fromfull.sh.
Both pipelines share steps 1–4. Summaries diverge at step 5.
1) Clean outputs
Removes only the current pipeline's output files from reports/ and reports_pdf/.
2) src/build_projects_seed.py
Fetches repository and project information from the CFDE-Eval core private repository and writes data/projects_seed.csv. To use a custom project cohort, skip this step and edit the CSV directly.
3) src/fetch_github_activity.py
Uses GraphQL to fetch all GitHub activity (commits, PRs, issues, releases, stars, forks) for all repositories in projects_seed.csv. Retry logic is included for network failures.
4) src/normalize_activity.py and src/rollup_projects.py
Normalize raw data into cleaned parquet tables and per-project JSON rollups for downstream consumption.
5) src/summarize_repos.py
Shallow-clones each repository and uses a map-reduce approach over the codebase to infer its goal. Combines this with fetched activity data and calls an OpenAI model to produce a ## Summary and Goal + ## Recent Developments report per repository. Output: reports/<PROJECT_ID>__<owner>__<repo>__chatbased.md.
6) src/summarize_projects.py
Reads all repo-level _chatbased.md files for each project and calls an OpenAI model to synthesize a single project-level executive summary. Output: reports/<PROJECT_ID>__chatbased.md.
7) src/summarize_portfolio.py
Reads all project-level _chatbased.md files and calls an OpenAI model to produce a single portfolio-wide summary. Output: reports/_portfolio_full__chatbased.md.
5) agentic/run_repo_summaries.sh + agentic/build_activity_context.py
Shallow-clones each repository, injects the same activity data as the chat-based pipeline into a _activity_context.md file, then runs GitHub Copilot CLI inside the clone using the repo-summary skill. Copilot reads the code and activity autonomously and writes the summary. Output: reports/<PROJECT_ID>__<owner>__<repo>__agentbased.md.
6) agentic/run_project_summaries.sh
Creates a temporary working directory per project containing all its repo-level _agentbased.md files, then runs Copilot CLI using the project-summary skill to synthesize a project-level summary. Output: reports/<PROJECT_ID>__agentbased.md.
7) agentic/run_portfolio_summary.sh
Gathers all project-level _agentbased.md files into a working directory and runs Copilot CLI using the portfolio-summary skill to produce a portfolio-wide summary. Output: reports/_portfolio_full__agentbased.md.
8) src/make_pdfs.py
Converts all Markdown reports in reports/ to styled PDFs saved in reports_pdf/. Each pipeline's outputs are named with their respective suffix so both sets can coexist.