Name	Name	Last commit message	Last commit date
parent directory ..
config	config
prism	prism
scripts	scripts
sql	sql
tests	tests
.dockerignore	.dockerignore
.env.example	.env.example
.gitignore	.gitignore
Dockerfile	Dockerfile
LICENSE	LICENSE
README.md	README.md
docker-compose.yml	docker-compose.yml
pyproject.toml	pyproject.toml
uv.lock	uv.lock

prism-collect

Data collection engine for OUSPG-PRISM. Mirrors structured PR metadata and repository snapshots from the GitHub GraphQL API into a local PostgreSQL database, tracking ~1,380 repositories across six analytical domains and five temporal cohorts.

Raw API responses are stored verbatim as JSONB for archival fidelity, while extracted columns are indexed for fast analytical queries. The database can be queried directly via SQL, attached to DuckDB for columnar analytics, or accessed from Python through SQLAlchemy.

Sample selection

The 1,383 tracked repositories were selected from an exhaustive API-driven population census of ~11,200 GitHub repos using a reproducible three-layer strategy:

Stratified random sampling (1,376 repos) -- from the eligible population, up to 80 repos drawn per (domain x star_tier) cell. Eligibility requires at least 20 merged PRs, a push after 2020-01-01, enrichment_status=ok, and not archived. The 6 domains and 3 star tiers produce 18 cells, ensuring balanced coverage across project types and popularity levels.

Sample composition

Domain	Tier 1 (anchor)	Tier 2 (midtier)	Tier 3 (longtail)	Total
agentic_coding_tools	29	76	80	185
devtools	80	80	80	240
ml_ai	80	80	80	240
security_critical	71	80	80	231
systems	80	80	80	240
web	80	80	80	240

Top languages: Python (420), TypeScript (283), Go (262), Rust (147), JavaScript (86), C (75), C++ (63), Java (35).

Temporal cohorts

The sample spans five eras aligned with AI-tooling inflection points, enabling difference-in-differences analysis:

Cohort	Repos	Share	Era boundary
pre_ai	984	71.1%	Created before Copilot GA (2022-06-21)
ai_assisted_era	181	13.1%	Copilot GA to Devin announcement (2024-03-12)
swe_agent_era	175	12.7%	SWE-agent ecosystem to normalization (2026-01-15)
agentic_normalized	22	1.6%	After agentic normalization
agentic_discourse_era	21	1.5%	Devin to SWE-agent ecosystem (2024-07-01)

The pre_ai cohort (71.1%) well exceeds the 40% minimum needed for the treated group in the difference-in-differences design. Each repo also carries 18 per-inflection-point cohort flags and a vibecoding-era marker stored as JSONB in repos.cohort_flags.

Architecture

                    GitHub GraphQL API
                           |
                           v
┌──────────────────────────────────────────────────────┐
│  Host machine                                        │
│                                                      │
│  ┌────────────┐       ┌────────────┐                 │
│  │  Postgres   │<──────│   PRISM    │                 │
│  │  (JSONB +   │       │   CLI /    │                 │
│  │   indexed)  │       │   Worker   │                 │
│  └─────┬──────┘       └────────────┘                 │
│        │                                             │
│        │  postgres_scanner / psycopg2                 │
│        v                                             │
│  ┌───────────┐    ┌───────────┐    ┌───────────────┐ │
│  │  DuckDB   │    │  psql /   │    │  Python       │ │
│  │ (columnar │    │  pgAdmin  │    │ (SQLAlchemy / │ │
│  │  queries) │    │           │    │  pandas)      │ │
│  └───────────┘    └───────────┘    └───────────────┘ │
└──────────────────────────────────────────────────────┘

Quickstart

git clone https://github.com/ouspg/PRISM && cd PRISM/prism-collect
cp .env.example .env
# Edit .env -- set GITHUB_PAT to a GitHub PAT with repo + read:org scope

# Start Postgres
docker compose up db -d

# Install Python dependencies
uv sync

# Initialize database schema
prism db init

# Seed the repo tracking list
prism seed --csv-file example/repo_sample_list.csv

# (Optional) Import previously collected PR CSVs so collection resumes
# from where it left off rather than re-fetching everything
python scripts/import_csv_prs.py --csv-dir example/github_prs

# Start collecting
prism collect --domain all

Database schema

PRISM uses four core tables:

Table	Purpose
`repos`	Tracked repositories with domain, temporal cohort, star tier, selection metadata, and cohort flags
`pull_requests`	PR metadata (author, dates, review counts, labels, diffs) plus the full raw API response as JSONB
`repo_snapshots`	Point-in-time repository metrics (stars, forks, issues, PRs, community files)
`sync_log`	Audit trail for each collection run (status, item count, errors)

The schema is defined in sql/001_init.sql and applied idempotently by prism db init.

Key columns on `pull_requests`

author_login, author_type (User/Bot/Mannequin), author_association, was_merged, created_at, merged_at, additions, deletions, changed_files, total_review_count, total_comment_count, label_names, has_closing_issue_reference, first_review_submitted_at, plus the complete raw JSONB.

Key columns on `repos`

domain (agentic_coding_tools, devtools, ml_ai, security_critical, systems, web), temporal_cohort (pre_ai, ai_assisted_era, agentic_discourse_era, swe_agent_era, agentic_normalized), star_tier (tier1_anchor, tier2_midtier, tier3_longtail), selection_method (purposive, stratified_random), cohort_flags (JSONB with per-inflection-point pre/post markers).

CLI reference

Command	Description
`prism db init`	Initialize (or re-initialize) the database schema
`prism seed --csv-file path/to/repos.csv`	Load the repo tracking list into the `repos` table
`prism seed --yaml-file path/to/repos.yaml`	Load repos from YAML format
`prism collect --domain all`	Collect PR data and repo snapshots for all active repos
`prism collect --domain ml_ai`	Collect for a single domain
`prism collect --repo owner/name`	Collect for a single repository
`prism collect --collector pulls`	Run only the PR collector (skip snapshots)
`prism status`	Print sync status: repo counts by domain, total PRs, snapshots, recent activity

Importing pre-collected data

If you have PR data from a previous collection run (CSV files in example/github_prs/), import them before running prism collect so that incremental collection resumes from last_synced_at instead of re-fetching:

python scripts/import_csv_prs.py --csv-dir example/github_prs

Each CSV file is named {owner}__{repo}.csv. The script upserts all PR rows and sets repos.last_synced_at to the max created_at per repo.

Querying the data

Direct SQL (psql)

psql -h localhost -U prism -d prism

-- PR merge throughput by domain, monthly
SELECT r.domain, date_trunc('month', pr.merged_at) AS month, count(*)
FROM pull_requests pr JOIN repos r ON r.id = pr.repo_id
WHERE pr.was_merged = true
GROUP BY r.domain, month
ORDER BY r.domain, month;

-- Bot-authored PR fraction by temporal cohort
SELECT r.temporal_cohort,
       count(*) FILTER (WHERE pr.author_type = 'Bot') AS bot_prs,
       count(*) AS total_prs,
       round(100.0 * count(*) FILTER (WHERE pr.author_type = 'Bot') / count(*), 2) AS bot_pct
FROM pull_requests pr JOIN repos r ON r.id = pr.repo_id
GROUP BY r.temporal_cohort
ORDER BY r.temporal_cohort;

-- Repos with highest review-to-PR ratio
SELECT r.owner || '/' || r.repo AS slug,
       count(*) AS prs,
       round(avg(pr.total_review_count), 1) AS avg_reviews
FROM pull_requests pr JOIN repos r ON r.id = pr.repo_id
WHERE pr.was_merged = true
GROUP BY r.owner, r.repo HAVING count(*) > 50
ORDER BY avg_reviews DESC LIMIT 20;

Pre-built query examples are available in sql/queries/.

DuckDB (columnar analytics)

DuckDB can attach directly to the running Postgres instance via the postgres_scanner extension, giving you columnar query performance without data export:

import duckdb

con = duckdb.connect()
con.sql("INSTALL postgres; LOAD postgres;")
con.sql("""
    ATTACH 'dbname=prism user=prism password=changeme host=localhost port=5432'
    AS prism (TYPE POSTGRES);
""")

# Now query as if tables were local
df = con.sql("""
    SELECT r.domain, r.temporal_cohort,
           count(*) AS pr_count,
           avg(pr.additions + pr.deletions) AS avg_churn
    FROM prism.pull_requests pr
    JOIN prism.repos r ON r.id = pr.repo_id
    WHERE pr.was_merged = true
    GROUP BY r.domain, r.temporal_cohort
""").df()

print(df)

Python (SQLAlchemy / pandas)

from prism.settings import Settings
from prism.db import get_engine, get_session
from prism.models import PullRequest, Repo

settings = Settings()
engine = get_engine(settings.database_url)
session = get_session(engine)

# ORM query
bot_prs = (
    session.query(PullRequest)
    .filter(PullRequest.author_type == "Bot")
    .limit(10)
    .all()
)

# Or use pandas with raw SQL
import pandas as pd
df = pd.read_sql("SELECT * FROM pull_requests LIMIT 1000", engine)

Configuration

File	Purpose
`.env`	Environment variables (see `.env.example`)
`example/repo_sample_list.csv`	Canonical list of 1,383 tracked repositories with selection metadata
`config/repos.yaml`	Alternative YAML repo list (generated from CSV via `scripts/csv_to_repos_yaml.py`)
`sql/001_init.sql`	Database DDL (tables, indexes, enums)

Environment variables

Settings are loaded via pydantic-settings. Variables prefixed PRISM_ map to collection tuning parameters. Core variables:

Variable	Default	Description
`GITHUB_PAT`	(required)	GitHub personal access token
`POSTGRES_USER`	`prism`	Postgres username
`POSTGRES_PASSWORD`	`changeme`	Postgres password
`POSTGRES_DB`	`prism`	Database name
`POSTGRES_HOST`	`localhost`	Hostname (`db` inside Docker)
`PRISM_LOG_LEVEL`	`INFO`	Logging verbosity
`PRISM_DEFAULT_PAGE_SIZE`	`100`	GraphQL pagination page size
`PRISM_RATE_LIMIT_BUFFER`	`300`	Stop this many requests before GitHub's limit
`PRISM_MAX_RETRIES`	`10`	Max retries on transient failures
`PRISM_REQUEST_TIMEOUT`	`90`	HTTP request timeout (seconds)

Docker usage

# Postgres only (recommended for local development)
docker compose up db -d

# With pgAdmin for visual inspection
docker compose --profile debug up -d
# pgAdmin at http://localhost:5050 (admin@local.dev / admin)

# Run collection via the Docker worker
docker compose --profile worker run --rm worker prism collect --domain all

Continuous collection

PRISM collects incrementally -- each run picks up where the last left off using repos.last_synced_at. To keep the database current, schedule periodic collection via cron or systemd.

Cron (simplest)

# Edit your crontab
crontab -e

# Run collection every 6 hours, logging to a file
0 */6 * * * cd /path/to/PRISM/prism-collect && /path/to/.venv/bin/prism collect --domain all >> /var/log/prism-collect.log 2>&1

Or if running via Docker:

0 */6 * * * cd /path/to/PRISM/prism-collect && docker compose --profile worker run --rm worker prism collect --domain all >> /var/log/prism-collect.log 2>&1

Systemd timer (recommended for servers)

Create two files:

# /etc/systemd/system/prism-collect.service
[Unit]
Description=PRISM data collection run
After=network-online.target postgresql.service

[Service]
Type=oneshot
User=prism
WorkingDirectory=/path/to/PRISM/prism-collect
ExecStart=/path/to/.venv/bin/prism collect --domain all
Environment=GITHUB_PAT=ghp_your_token
Environment=POSTGRES_HOST=localhost
StandardOutput=append:/var/log/prism-collect.log
StandardError=append:/var/log/prism-collect.log

# /etc/systemd/system/prism-collect.timer
[Unit]
Description=Run PRISM collection every 6 hours

[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target

sudo systemctl daemon-reload
sudo systemctl enable --now prism-collect.timer

# Check status
systemctl status prism-collect.timer
journalctl -u prism-collect.service --since "6 hours ago"

Monitoring collection progress

# Quick status check
prism status

# Watch the sync log for recent failures
psql -h localhost -U prism -d prism -c "
  SELECT r.owner || '/' || r.repo AS repo, s.collector, s.status,
         s.items_collected, s.error, s.finished_at
  FROM sync_log s JOIN repos r ON r.id = s.repo_id
  WHERE s.started_at > now() - interval '24 hours'
  ORDER BY s.finished_at DESC LIMIT 20;
"

Rate limits

PRISM uses a single GitHub PAT. With the default PRISM_RATE_LIMIT_BUFFER=300, it stops 300 requests before hitting the GitHub rate ceiling and sleeps until the reset window. A full collection run across all 1,383 repos typically takes 2--4 hours depending on PR volume. The PRISM_POLITE_SLEEP_SECS=0.5 delay between repos avoids hammering the API.

Development

uv add <package>        # Add a runtime dependency
uv add --dev <package>  # Add a dev dependency
ruff check src/ tests/  # Lint
pytest                  # Test

License

AGPL-3.0-or-later. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

prism-collect

Sample selection

Sample composition

Temporal cohorts

Architecture

Quickstart

Database schema

Key columns on `pull_requests`

Key columns on `repos`

CLI reference

Importing pre-collected data

Querying the data

Direct SQL (psql)

DuckDB (columnar analytics)

Python (SQLAlchemy / pandas)

Configuration

Environment variables

Docker usage

Continuous collection

Cron (simplest)

Systemd timer (recommended for servers)

Monitoring collection progress

Rate limits

Development

License

FilesExpand file tree

prism-collect

Directory actions

More options

Directory actions

More options

Latest commit

History

prism-collect

Folders and files

parent directory

README.md

prism-collect

Sample selection

Sample composition

Temporal cohorts

Architecture

Quickstart

Database schema

Key columns on pull_requests

Key columns on repos

CLI reference

Importing pre-collected data

Querying the data

Direct SQL (psql)

DuckDB (columnar analytics)

Python (SQLAlchemy / pandas)

Configuration

Environment variables

Docker usage

Continuous collection

Cron (simplest)

Systemd timer (recommended for servers)

Monitoring collection progress

Rate limits

Development

License

Key columns on `pull_requests`

Key columns on `repos`