Skip to content

Commit d8f34e0

Browse files
authored
Merge pull request #6 from sdsc-ordes/feat-gimie-driven-properties
Feat gimie driven properties
2 parents e0099d5 + c69ee86 commit d8f34e0

49 files changed

Lines changed: 2595 additions & 4271 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.devcontainer/.env.example

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Copy to `.devcontainer/.env` for docker-compose variable substitution.
2+
# Compose reads this file from the `.devcontainer/` directory (not repo-root `.env` for these keys).
3+
#
4+
# Host port mappings (optional):
5+
# SSH_PORT=2220
6+
# APP_PORT=1234
7+
# DEV_PORT=8888
8+
#
9+
# DNS inside the container (optional; defaults are 1.1.1.1 + 8.8.8.8 in docker-compose.yml).
10+
# Use your corporate resolvers if public DNS is blocked:
11+
# DEVCONTAINER_DNS_1=10.0.0.1
12+
# DEVCONTAINER_DNS_2=10.0.0.2

.devcontainer/devcontainer.json

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,33 @@
11
{
2-
"name": "open-pulse-crawler",
3-
"build": {
4-
"dockerfile": "Dockerfile"
2+
"name": "open-pulse-crawler-dev",
3+
"dockerComposeFile": "docker-compose.yml",
4+
"service": "devcontainer",
5+
"workspaceFolder": "/workspaces/project",
6+
"containerEnv": {
7+
"UV_CACHE_DIR": "/workspaces/project/.uv-cache"
8+
},
9+
"overrideCommand": false,
10+
"features": {
11+
"ghcr.io/devcontainers/features/sshd:1": {
12+
"version": "latest"
13+
}
514
},
6-
"runArgs": [
7-
"--env-file",
8-
"${localWorkspaceFolder}/.env",
9-
"--network",
10-
"dev"
11-
],
12-
13-
// This is where your repo will be mounted inside the container
1415
"remoteUser": "vscode",
15-
"workspaceFolder": "/workspaces/${localWorkspaceFolderBasename}",
16-
1716
"customizations": {
1817
"vscode": {
19-
"settings": {
20-
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
21-
"python.envFile": "${workspaceFolder}/.env"
22-
},
18+
"settings": {
19+
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
20+
"python.envFile": "${workspaceFolder}/.env"
21+
},
2322
"extensions": [
2423
"ms-python.python",
2524
"ms-python.vscode-pylance",
26-
"tamasfe.even-better-toml"
25+
"tamasfe.even-better-toml",
26+
"github.copilot",
27+
"github.copilot-chat"
2728
]
2829
}
2930
},
30-
31-
// Install project in editable mode after the container is built
32-
"postCreateCommand": "rm -rf .venv && uv venv && uv pip install -e .[viz,dev] && echo '. $PWD/.venv/bin/activate' >> /home/vscode/.bashrc"
31+
"postCreateCommand": "mkdir -p .uv-cache && rm -rf .venv && uv venv && uv pip install -e .[dev] && echo '. $PWD/.venv/bin/activate' >> /home/vscode/.bashrc",
32+
"postStartCommand": "bash .devcontainer/set-vscode-password.sh"
3333
}

.devcontainer/docker-compose.yml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Dev container stack. Compose publishes ports on the host (more reliable than
2+
# devcontainer forwardPorts in some setups). Interpolation vars (SSH_PORT, etc.)
3+
# can be set in `.devcontainer/.env` (see `.devcontainer/.env.example`).
4+
#
5+
# Internal SSH: devcontainers `sshd` feature listens on 2222, not 22 — map host:2222.
6+
#
7+
# Explicit DNS: containers on external networks (e.g. `dev`) sometimes get no working resolver
8+
# and `uv pip` fails with "dns error" / "failed to lookup address information".
9+
# Override in `.devcontainer/.env`: DEVCONTAINER_DNS_1 / DEVCONTAINER_DNS_2.
10+
services:
11+
devcontainer:
12+
build:
13+
context: ..
14+
dockerfile: .devcontainer/Dockerfile
15+
dns:
16+
- "${DEVCONTAINER_DNS_1:-1.1.1.1}"
17+
- "${DEVCONTAINER_DNS_2:-8.8.8.8}"
18+
env_file:
19+
- ../.env
20+
environment:
21+
# Avoid ~/.cache/uv (often root-owned after sshd/common-utils); workspace is bind-mounted as vscode.
22+
UV_CACHE_DIR: /workspaces/project/.uv-cache
23+
ports:
24+
- "${SSH_PORT:-2222}:2222"
25+
- "${APP_PORT:-1234}:1234"
26+
- "${DEV_PORT:-8888}:8888"
27+
volumes:
28+
- ..:/workspaces/project:cached
29+
command: sleep infinity
30+
networks:
31+
- dev
32+
33+
networks:
34+
dev:
35+
external: true
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
# Apply VSCODE_PASSWORD to user vscode at container start (not baked into the image).
3+
# Set VSCODE_PASSWORD in .env (this repo loads it via devcontainer runArgs --env-file).
4+
set -euo pipefail
5+
if [[ -z "${VSCODE_PASSWORD:-}" ]]; then
6+
exit 0
7+
fi
8+
printf 'vscode:%s\n' "$VSCODE_PASSWORD" | sudo chpasswd

.env.dist

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,75 @@
1+
# Open Pulse Crawler — environment template
2+
#
3+
# Copy to `.env` at the repo root, then either:
4+
# • API only: `docker compose -f infra/docker-compose.yml --env-file ../.env up -d`
5+
# → FastAPI on http://localhost:${OPC_API_PORT:-8000}
6+
# • API + GUI + Nginx: `docker compose -f infra/docker-compose.yml --env-file ../.env --profile gui up -d`
7+
# → Browser GUI on http://localhost:${OPC_PORT:-80}
8+
# • or run the API/CLI directly via `uv run` after exporting the same vars.
9+
#
10+
# All variables marked REQUIRED must be set; everything else has sensible
11+
# defaults baked into either the compose file or the application code.
12+
13+
# ----- REQUIRED: API authentication ----------------------------------------
14+
#
15+
# Bearer token clients must send to authenticated endpoints.
16+
# Generate one with:
17+
# python -c 'import secrets; print(secrets.token_urlsafe(32))'
18+
API_TOKEN=
19+
20+
# ----- REQUIRED: GitHub access ---------------------------------------------
21+
#
22+
# GitHub Personal Access Token used by the BFS crawler AND by the gimie
23+
# hybrid path when enabled.
24+
#
25+
# Comma-separated for multi-token rotation:
26+
# GITHUB_TOKEN=ghp_aaa…,ghp_bbb…,ghp_ccc…
27+
#
28+
# Required scopes:
29+
# read:org — required by gimie's GraphQL query (org avatarUrl/name/etc).
30+
# Without it the hybrid gimie path returns INSUFFICIENT_SCOPES.
31+
# read:user — recommended; gimie reads contributor profiles.
32+
# public_repo — required for crawling public repos (or `repo` for private).
33+
#
34+
# A classic PAT scoped read:org + read:user + public_repo is the simplest
35+
# choice. Fine-grained PATs work too but need org-level approval.
136
GITHUB_TOKEN=
2-
API_TOKEN=
37+
38+
# ----- Optional: data directory --------------------------------------------
39+
#
40+
# Where the API stores per-job artifacts (JSON-LD payloads, archives).
41+
# Defaults to /tmp/open-pulse-crawler inside the container; change only if
42+
# you mount a persistent volume.
43+
# OPC_DATA_DIR=/tmp/open-pulse-crawler
44+
45+
# ----- Optional: GUI password (only relevant with --profile gui) -----------
46+
#
47+
# Required by the Streamlit GUI to gate the browser interface. If unset, the
48+
# GUI shows a visible warning and stays open. Pick anything non-trivial:
49+
# GUI_PASSWORD=$(python -c 'import secrets; print(secrets.token_urlsafe(16))')
50+
# GUI_PASSWORD=
51+
52+
# ----- Optional: gimie hybrid (deployment-time, server-side) ---------------
53+
#
54+
# When enabled, the crawler enriches repository nodes from a gimie JSON-LD
55+
# service (users/orgs still come from GitHub via PyGithub). These are server
56+
# concerns, not per-request — clients of /api/v1/crawl don't see or pass them.
57+
#
58+
# Truthy values: true, 1, yes, on (case-insensitive). Anything else: false.
59+
# GIMIE_ENABLED=false
60+
# GIMIE_API_BASE=http://host.docker.internal:1234
61+
# GIMIE_STORE_JSONLD=false
62+
# GIMIE_SKIP_EXISTING_JSONLD=false
63+
64+
# ----- Optional: compose-only knobs ----------------------------------------
65+
#
66+
# These are read by infra/docker-compose.yml, not by the Python code.
67+
#
68+
# Override the published image tag (defaults to :latest on GHCR).
69+
# OPC_IMAGE=ghcr.io/sdsc-ordes/open-pulse-crawler:1.0.0
70+
#
71+
# Host port for the FastAPI service (default profile, defaults to 8000).
72+
# OPC_API_PORT=8000
73+
#
74+
# Host port for the Nginx reverse proxy (only with --profile gui, defaults to 80).
75+
# OPC_PORT=8080

.github/workflows/docs.yaml

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
name: Docs
2+
3+
on:
4+
push:
5+
branches: ["main", "develop"]
6+
paths:
7+
- "docs/**"
8+
- "mkdocs.yml"
9+
- "pyproject.toml"
10+
- ".github/workflows/docs.yaml"
11+
pull_request:
12+
branches: ["main", "develop"]
13+
paths:
14+
- "docs/**"
15+
- "mkdocs.yml"
16+
- "pyproject.toml"
17+
- ".github/workflows/docs.yaml"
18+
workflow_dispatch:
19+
20+
# Required by actions/deploy-pages.
21+
permissions:
22+
contents: read
23+
pages: write
24+
id-token: write
25+
26+
# Avoid stomping on each other if pushes land back-to-back.
27+
concurrency:
28+
group: pages
29+
cancel-in-progress: false
30+
31+
jobs:
32+
build:
33+
name: Build MkDocs site
34+
runs-on: ubuntu-latest
35+
steps:
36+
- name: Checkout repository
37+
uses: actions/checkout@v4
38+
with:
39+
fetch-depth: 0 # mkdocs-material uses git for "last updated" metadata
40+
41+
- name: Set up uv
42+
uses: astral-sh/setup-uv@v4
43+
with:
44+
python-version: "3.12"
45+
46+
- name: Install docs dependencies
47+
run: uv sync --extra docs
48+
49+
- name: Build site (strict)
50+
# --strict turns warnings (broken links, missing nav targets, …) into
51+
# errors so we fail fast in CI rather than ship a broken site.
52+
run: uv run mkdocs build --strict --site-dir site
53+
54+
- name: Upload Pages artifact
55+
uses: actions/upload-pages-artifact@v3
56+
with:
57+
path: site
58+
59+
deploy:
60+
name: Deploy to GitHub Pages
61+
# Only deploy on pushes to the default branch — PR builds validate but
62+
# do not publish. workflow_dispatch on main also publishes.
63+
if: github.event_name != 'pull_request' && github.ref == 'refs/heads/main'
64+
needs: build
65+
runs-on: ubuntu-latest
66+
environment:
67+
name: github-pages
68+
url: ${{ steps.deployment.outputs.page_url }}
69+
steps:
70+
- name: Deploy to GitHub Pages
71+
id: deployment
72+
uses: actions/deploy-pages@v4

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
- `--max-contributors` skip rule (CLI flag, REST API field on `POST /api/v1/crawl`, Streamlit GUI input under "Performance & filtering", and constructor argument on `GitHubCrawler`). When set, repos with more than N contributors stay in the graph but their contributor users are not queued for BFS expansion — owner / fork / dependency / dependent edges are unaffected. Useful for avoiding mega-projects (linux kernel, etc.) that would otherwise dominate the frontier.
13+
- `RepoModel.contributor_count` and `RepoModel.skipped_high_contributors` fields. The total contributor count is captured the first time a repo is fetched (one cheap `per_page=1` request via `repo.get_contributors().totalCount`) and cached alongside the existing `repo:{full_name}` entry, so repeat crawls re-apply the skip rule with zero additional API calls.
14+
- `GitHubClient.get_contributor_count(repo_full_name)` helper — cache-first lookup with a single live fallback on miss; write-throughs the count so the next call hits cache.
15+
- Streamlit GUI brought to parity with the REST API: form now exposes `crawl_dependencies`, `crawl_dependents`, `min_stars`, `max_dependents`, `max_contributors`, and `batch_size` (under collapsible "Dependency crawling" and "Performance & filtering" expanders, with sensible defaults that don't surface the params in the request body unless changed). Results panel now shows live progress (current round, nodes processed, queue size, ETA) for active jobs, includes pause / resume / cancel / delete controls gated by current status, and offers a graph download button once the job is `completed`. API errors are surfaced with their `detail` message instead of a generic HTTP error string.
16+
- MkDocs documentation site (Material theme) at `mkdocs.yml`, with a `docs` extra group in `pyproject.toml` (`mkdocs`, `mkdocs-material`, `pymdown-extensions`). Rendered Mermaid diagrams via `pymdownx.superfences`: an architecture diagram on the home page, a job-lifecycle state diagram and a request-flow sequence diagram in `docs/API.md`, and a Compose-stack flow diagram in `docs/DEPLOYMENT.md`. Build verified locally with `mkdocs build --strict`.
17+
- `docs/index.md` landing page introducing the project and linking into the existing docs.
18+
- GitHub Action `.github/workflows/docs.yaml` that builds the site on push / PR to `main` / `develop` (with `--strict`) and deploys to GitHub Pages from `main` via `actions/deploy-pages`. **Requires Pages source = "GitHub Actions"** in repo Settings → Pages.
19+
- Job lifecycle endpoints in the REST API:
20+
- `GET /api/v1/jobs` — list every job in the in-memory registry, newest first, with optional `status_filter`.
21+
- `POST /api/v1/crawl/{job_id}/pause` — pause at the next BFS round boundary.
22+
- `POST /api/v1/crawl/{job_id}/resume` — resume a paused job.
23+
- `POST /api/v1/crawl/{job_id}/cancel` — cooperative cancel; partial graph is preserved.
24+
- `DELETE /api/v1/crawl/{job_id}` — drop a terminal job from the registry.
25+
- Live progress fields on `GET /api/v1/crawl/{job_id}` for running jobs: `started_at`, `completed_at`, `current_round`, `nodes_processed`, `nodes_in_queue`, and a best-effort `estimated_completion_at`.
26+
- `paused` and `cancelled` job-status values.
27+
- Server-side gimie configuration via env vars: `GIMIE_ENABLED`, `GIMIE_API_BASE`, `GIMIE_STORE_JSONLD`, `GIMIE_SKIP_EXISTING_JSONLD`. When enabled, each crawled repo is enriched via a sibling git-metadata-extractor service and (optionally) JSON-LD payloads are persisted under `${OPC_DATA_DIR}/<job_id>/jsonld/`.
1228
- Nginx reverse-proxy container (`infra/nginx/Dockerfile` + `infra/nginx/nginx.conf`): routes `/api/*` to FastAPI (port 8000) and `/` to Streamlit (port 8501) with WebSocket upgrade support.
1329
- Streamlit placeholder GUI (`src/open_pulse_crawler/gui.py`) with token input sidebar, crawl form (seeds + BFS rounds), and live job-status results area.
1430
- `streamlit` and `httpx` added to project dependencies.
@@ -29,16 +45,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2945
- `fastapi` and `uvicorn[standard]` added to project dependencies; `httpx` added to dev dependencies.
3046
- API test suite (`tests/test_api.py`).
3147
- API documentation (`docs/API.md`).
48+
- Optional gimie JSON-LD hybrid repository fetching (initially as a `gimie_repos` request flag; superseded by env-var configuration — see `Added` above).
3249
- Docker Compose stack (`infra/docker-compose.yml`) for `api`, `gui`, and `nginx` services on a shared network with env-file configuration and per-service health checks.
3350
- End-to-end Docker integration test script (`tests/test_integration.sh`) validating health, auth behavior, and GUI/API routing through Nginx.
3451

3552
### Changed
3653

54+
- Fixed `infra/docker-compose.yml` nginx healthcheck: replaced `wget http://localhost/api/v1/health` with `wget http://127.0.0.1/api/v1/health`. Inside the Alpine container, `localhost` resolves to `::1` (IPv6) but `infra/nginx/nginx.conf` only declares `listen 80;` (IPv4-only), so the probe was getting "Connection refused" while external access worked fine. Stack went from `unhealthy` to healthy in 7s after recreate.
55+
- Crawl export filenames: timestamp first, then kind — e.g. `YYYYMMDDHHMMSS.graph.json`, `YYYYMMDDHHMMSS.edges.csv`, `YYYYMMDDHHMMSS.nodes.csv`, `YYYYMMDDHHMMSS.graph.png`, directory `YYYYMMDDHHMMSS.clusters/`. Incremental round folders are `YYYYMMDDHHMMSS.round_NN/` with the same inner naming.
56+
- Gimie JSON-LD: on success (HTTP 2xx) or when using an existing payload file with skip-existing, remove matching `jsonld_errors/<repo>.*.json` files for that repository.
57+
- Gimie JSON-LD: log a short preview of the HTTP error response body when the gimie endpoint returns non-2xx (in addition to optional `jsonld_errors/` files).
58+
- Gimie JSON-LD: `force_refresh=true` is always sent on HTTP fetches (not a CLI/API flag). `--gimie-skip-existing-jsonld` only checks for existing files under the crawl `jsonld/` output directory.
59+
- Renamed `jsonld/*.json` export/processing filenames to include timestamp as `name.timestamp.json` (timestamp derived from crawl export time or filesystem metadata by the provided rename script).
3760
- Expanded deployment docs in `docs/DEPLOYMENT.md` with Docker Compose setup, environment configuration, health verification, and integration test usage.
3861
- Updated `README.md` with dedicated REST API and Docker/GUI quick-start sections and links to deployment/API docs.
3962
- Completed API docs (`docs/API.md`) with reverse-proxy base URL notes and practical `curl` examples for crawl/status/graph flows.
4063
- Extended `POST /api/v1/crawl` to accept CLI-aligned crawl controls: `crawl_dependencies`, `crawl_dependents`, `min_stars`, `max_dependents`, `batch_size`, and inline `epfl_entities`.
4164
- Moved Docker infrastructure files into `infra/` and updated commands/scripts to use `docker compose -f infra/docker-compose.yml ...`.
4265
- Updated compose services to pull the app container from `ghcr.io/sdsc-ordes/open-pulse-crawler:latest` by default (`OPC_IMAGE` override supported).
66+
- Gimie hybrid extraction is now configured via environment variables (`GIMIE_ENABLED`, `GIMIE_API_BASE`, `GIMIE_STORE_JSONLD`, `GIMIE_SKIP_EXISTING_JSONLD`), not the per-request `gimie_repos` flag. Operators decide whether the gimie path is on; clients submitting crawls don't need to know.
67+
- Cleaned up `docs/`: removed completion-report markdown (`*_COMPLETE`, `*_FIX_SUMMARY`, `*_IMPLEMENTATION`, `IMPROVEMENTS_SUMMARY`, `PLAN_*`, `QUICK_REFERENCE`, etc.) and the duplicate copies of files that already live under `docs/dev/dependency-graph/`. The remaining doc set is `API.md`, `DEPLOYMENT.md`, `CONCURRENCY.md`, `PROGRESS_TRACKING.md`, `TIMESTAMPS.md`, `VISUALIZATION.md`, plus `docs/dev/`. README's broken `RATE_LIMITING.md` link now points at `docs/CONCURRENCY.md`.
68+
- Refreshed `docs/API.md` to match the current API surface (job list / pause / resume / cancel / delete, live progress fields with ETA, env-driven gimie config) and corrected the Dockerfile path in `docs/DEPLOYMENT.md` (`tools/image/Dockerfile`).
4369

4470
[Unreleased]: https://github.com/sdsc-ordes/open-pulse-crawler/compare/v0.1.0...HEAD

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,13 +225,14 @@ open-pulse-crawler crawl DeepLabCut/DeepLabCut \
225225
- `--crawl-dependents`: Crawl upstream dependents ("Used by")
226226
- `--min-stars`: Minimum stars for filtering dependents/dependencies (default: 0)
227227
- `--max-dependents`: Maximum number of dependents to fetch (default: all)
228+
- `--max-contributors`: Skip contributor expansion for repos with more than N contributors. The repo node still lands in the graph (with owner / fork / deps); only its contributors are not queued. Useful for avoiding mega-projects (e.g. linux kernel) that would dominate the BFS frontier. The total count is cached, so this is roughly free on re-crawls. Default: unlimited.
228229

229230
#### Rate Limiting Options (New!)
230231
- `--request-delay`: Minimum delay in seconds between API requests (default: 0.0)
231232
- `--max-concurrent`: Maximum number of concurrent API requests (default: 5)
232233
- `--rate-limit-buffer`: Buffer of requests to keep before waiting (default: 50)
233234

234-
See [RATE_LIMITING.md](./RATE_LIMITING.md) for detailed guide on rate limiting and API management.
235+
See [docs/CONCURRENCY.md](./docs/CONCURRENCY.md) for the detailed guide on concurrency, rate limiting, and multi-token rotation.
235236

236237
## Output Formats
237238

0 commit comments

Comments
 (0)