All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
--max-contributorsskip rule (CLI flag, REST API field onPOST /api/v1/crawl, Streamlit GUI input under "Performance & filtering", and constructor argument onGitHubCrawler). When set, repos with more than N contributors stay in the graph but their contributor users are not queued for BFS expansion — owner / fork / dependency / dependent edges are unaffected. Useful for avoiding mega-projects (linux kernel, etc.) that would otherwise dominate the frontier.RepoModel.contributor_countandRepoModel.skipped_high_contributorsfields. The total contributor count is captured the first time a repo is fetched (one cheapper_page=1request viarepo.get_contributors().totalCount) and cached alongside the existingrepo:{full_name}entry, so repeat crawls re-apply the skip rule with zero additional API calls.GitHubClient.get_contributor_count(repo_full_name)helper — cache-first lookup with a single live fallback on miss; write-throughs the count so the next call hits cache.- Streamlit GUI brought to parity with the REST API: form now exposes
crawl_dependencies,crawl_dependents,min_stars,max_dependents,max_contributors, andbatch_size(under collapsible "Dependency crawling" and "Performance & filtering" expanders, with sensible defaults that don't surface the params in the request body unless changed). Results panel now shows live progress (current round, nodes processed, queue size, ETA) for active jobs, includes pause / resume / cancel / delete controls gated by current status, and offers a graph download button once the job iscompleted. API errors are surfaced with theirdetailmessage instead of a generic HTTP error string. - MkDocs documentation site (Material theme) at
mkdocs.yml, with adocsextra group inpyproject.toml(mkdocs,mkdocs-material,pymdown-extensions). Rendered Mermaid diagrams viapymdownx.superfences: an architecture diagram on the home page, a job-lifecycle state diagram and a request-flow sequence diagram indocs/API.md, and a Compose-stack flow diagram indocs/DEPLOYMENT.md. Build verified locally withmkdocs build --strict. docs/index.mdlanding page introducing the project and linking into the existing docs.- GitHub Action
.github/workflows/docs.yamlthat builds the site on push / PR tomain/develop(with--strict) and deploys to GitHub Pages frommainviaactions/deploy-pages. Requires Pages source = "GitHub Actions" in repo Settings → Pages. - Job lifecycle endpoints in the REST API:
GET /api/v1/jobs— list every job in the in-memory registry, newest first, with optionalstatus_filter.POST /api/v1/crawl/{job_id}/pause— pause at the next BFS round boundary.POST /api/v1/crawl/{job_id}/resume— resume a paused job.POST /api/v1/crawl/{job_id}/cancel— cooperative cancel; partial graph is preserved.DELETE /api/v1/crawl/{job_id}— drop a terminal job from the registry.
- Live progress fields on
GET /api/v1/crawl/{job_id}for running jobs:started_at,completed_at,current_round,nodes_processed,nodes_in_queue, and a best-effortestimated_completion_at. pausedandcancelledjob-status values.- Server-side gimie configuration via env vars:
GIMIE_ENABLED,GIMIE_API_BASE,GIMIE_STORE_JSONLD,GIMIE_SKIP_EXISTING_JSONLD. When enabled, each crawled repo is enriched via a sibling git-metadata-extractor service and (optionally) JSON-LD payloads are persisted under${OPC_DATA_DIR}/<job_id>/jsonld/. - Nginx reverse-proxy container (
infra/nginx/Dockerfile+infra/nginx/nginx.conf): routes/api/*to FastAPI (port 8000) and/to Streamlit (port 8501) with WebSocket upgrade support. - Streamlit placeholder GUI (
src/open_pulse_crawler/gui.py) with token input sidebar, crawl form (seeds + BFS rounds), and live job-status results area. streamlitandhttpxadded to project dependencies.- Root
Dockerfile— multi-stage build (uv builder + python:3.12-slim runtime), exposes port 8000, runs uvicorn as non-root user. .dockerignoreto keep production images lean.- Dockerfile structure and API entrypoint tests (
tests/test_dockerfile.py). - Deployment documentation (
docs/DEPLOYMENT.md). AGENTS.mdwith contributor and AI-agent guidelines.CHANGELOG.mdfollowing Keep a Changelog format.- Bearer-token auth module (
src/open_pulse_crawler/auth.py) usingHTTPBearer+secrets.compare_digestagainstAPI_TOKENenv var. - Auth test suite (
tests/test_auth.py). API_TOKENvariable added to.env.dist.- FastAPI REST API (
src/open_pulse_crawler/api.py) with endpoints:GET /api/v1/health— public health check.POST /api/v1/crawl— start a background crawl job (Bearer auth).GET /api/v1/crawl/{job_id}— job status and summary.GET /api/v1/graph/{job_id}— full graph data for completed jobs.
fastapianduvicorn[standard]added to project dependencies;httpxadded to dev dependencies.- API test suite (
tests/test_api.py). - API documentation (
docs/API.md). - Optional gimie JSON-LD hybrid repository fetching (initially as a
gimie_reposrequest flag; superseded by env-var configuration — seeAddedabove). - Docker Compose stack (
infra/docker-compose.yml) forapi,gui, andnginxservices on a shared network with env-file configuration and per-service health checks. - End-to-end Docker integration test script (
tests/test_integration.sh) validating health, auth behavior, and GUI/API routing through Nginx.
- Fixed
infra/docker-compose.ymlnginx healthcheck: replacedwget http://localhost/api/v1/healthwithwget http://127.0.0.1/api/v1/health. Inside the Alpine container,localhostresolves to::1(IPv6) butinfra/nginx/nginx.confonly declareslisten 80;(IPv4-only), so the probe was getting "Connection refused" while external access worked fine. Stack went fromunhealthyto healthy in 7s after recreate. - Crawl export filenames: timestamp first, then kind — e.g.
YYYYMMDDHHMMSS.graph.json,YYYYMMDDHHMMSS.edges.csv,YYYYMMDDHHMMSS.nodes.csv,YYYYMMDDHHMMSS.graph.png, directoryYYYYMMDDHHMMSS.clusters/. Incremental round folders areYYYYMMDDHHMMSS.round_NN/with the same inner naming. - Gimie JSON-LD: on success (HTTP 2xx) or when using an existing payload file with skip-existing, remove matching
jsonld_errors/<repo>.*.jsonfiles for that repository. - Gimie JSON-LD: log a short preview of the HTTP error response body when the gimie endpoint returns non-2xx (in addition to optional
jsonld_errors/files). - Gimie JSON-LD:
force_refresh=trueis always sent on HTTP fetches (not a CLI/API flag).--gimie-skip-existing-jsonldonly checks for existing files under the crawljsonld/output directory. - Renamed
jsonld/*.jsonexport/processing filenames to include timestamp asname.timestamp.json(timestamp derived from crawl export time or filesystem metadata by the provided rename script). - Expanded deployment docs in
docs/DEPLOYMENT.mdwith Docker Compose setup, environment configuration, health verification, and integration test usage. - Updated
README.mdwith dedicated REST API and Docker/GUI quick-start sections and links to deployment/API docs. - Completed API docs (
docs/API.md) with reverse-proxy base URL notes and practicalcurlexamples for crawl/status/graph flows. - Extended
POST /api/v1/crawlto accept CLI-aligned crawl controls:crawl_dependencies,crawl_dependents,min_stars,max_dependents,batch_size, and inlineepfl_entities. - Moved Docker infrastructure files into
infra/and updated commands/scripts to usedocker compose -f infra/docker-compose.yml .... - Updated compose services to pull the app container from
ghcr.io/sdsc-ordes/open-pulse-crawler:latestby default (OPC_IMAGEoverride supported). - Gimie hybrid extraction is now configured via environment variables (
GIMIE_ENABLED,GIMIE_API_BASE,GIMIE_STORE_JSONLD,GIMIE_SKIP_EXISTING_JSONLD), not the per-requestgimie_reposflag. Operators decide whether the gimie path is on; clients submitting crawls don't need to know. - Cleaned up
docs/: removed completion-report markdown (*_COMPLETE,*_FIX_SUMMARY,*_IMPLEMENTATION,IMPROVEMENTS_SUMMARY,PLAN_*,QUICK_REFERENCE, etc.) and the duplicate copies of files that already live underdocs/dev/dependency-graph/. The remaining doc set isAPI.md,DEPLOYMENT.md,CONCURRENCY.md,PROGRESS_TRACKING.md,TIMESTAMPS.md,VISUALIZATION.md, plusdocs/dev/. README's brokenRATE_LIMITING.mdlink now points atdocs/CONCURRENCY.md. - Refreshed
docs/API.mdto match the current API surface (job list / pause / resume / cancel / delete, live progress fields with ETA, env-driven gimie config) and corrected the Dockerfile path indocs/DEPLOYMENT.md(tools/image/Dockerfile).