Feat gimie driven properties#6
Conversation
… related enhancements - Added new parameters to the API and CLI for enabling gimie JSON-LD fetching, including `gimie_repos`, `gimie_api_base`, `gimie_store_jsonld`, `gimie_skip_existing_jsonld`, and `gimie_archive_on_download`. - Updated the crawl job to handle JSON-LD data, including storing payloads and creating a zip archive for downloaded data. - Enhanced error handling and logging for HTTP responses from the gimie API. - Updated export filename formats to include timestamps and improved directory structure for crawl outputs. - Added tests to validate the new functionality and ensure proper integration with the existing crawler logic.
…ment setup - Added `.env.example` for environment variable configuration in the development container. - Updated `devcontainer.json` to use Docker Compose, specifying service and environment settings. - Created `docker-compose.yml` to define the development container stack, including port mappings and DNS settings. - Introduced `set-vscode-password.sh` script to set the VSCode user password at container start based on `.env` configuration.
- Expanded `.env.dist` with detailed instructions and required variables for API authentication and GitHub access. - Updated `docker-compose.yml` to support profiles for API and GUI, including port mappings and health checks. - Refactored API and CLI to remove deprecated `epfl_entities` parameter and streamline gimie hybrid configuration. - Introduced new job summary and response models to track job progress and timing. - Enhanced GUI with password protection and improved session management. - Updated tests to reflect changes in API parameters and validate new functionality.
…nagement - Added MkDocs configuration and Material theme for project documentation, including a new landing page and structured navigation. - Implemented GitHub Actions workflow for automatic documentation builds and deployment to GitHub Pages. - Expanded REST API with new job lifecycle endpoints for managing jobs (pause, resume, cancel) and retrieving live progress metrics. - Updated API documentation to reflect new endpoints and job status values, along with improved diagrams for job lifecycle and request flow. - Refined environment variable configuration for gimie hybrid extraction, moving from request-based to server-side settings. - Cleaned up and refreshed existing documentation, removing outdated files and correcting links.
- Introduced `--max-contributors` option in CLI, REST API, and Streamlit GUI to skip contributor expansion for repositories exceeding a specified contributor count, while retaining the repo node in the graph. - Added `RepoModel.contributor_count` and `RepoModel.skipped_high_contributors` fields to track contributor counts and skipped repos. - Implemented `GitHubClient.get_contributor_count(repo_full_name)` for efficient contributor count retrieval with caching. - Updated Streamlit GUI to reflect new parameters and provide live job progress metrics, including controls for job management. - Fixed nginx healthcheck in `docker-compose.yml` to ensure proper service health monitoring. - Enhanced API documentation to include new parameters and usage examples.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c69ee86. Configure here.
| # DELETE endpoint so we don't drop a still-active job out from under itself. | ||
| _TERMINAL_STATES = { | ||
| JobStatus.COMPLETED, JobStatus.FAILED, JobStatus.CANCELLED, JobStatus.PENDING, | ||
| } |
There was a problem hiding this comment.
Deleting a PENDING job crashes the background task
High Severity
_TERMINAL_STATES includes JobStatus.PENDING, which allows the DELETE endpoint to remove a job that already has a background task scheduled via BackgroundTasks.add_task. If the delete occurs before _run_crawl begins executing, it will crash with a KeyError on record = _jobs[job_id] because the entry was removed from _jobs. PENDING is not truly terminal — the background task is about to start.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c69ee86. Configure here.
| # CANCELLED when the loop actually exits, so callers see the | ||
| # transition rather than a phantom. | ||
| record.detail = "cancellation requested" | ||
| return JobActionResponse(job_id=job_id, status=record.status, detail=record.detail) |
There was a problem hiding this comment.
Cancelling a PENDING job has no effect
Medium Severity
The cancel endpoint accepts PENDING jobs (line 585), but when the job is pending, record.crawler is None — so no cancel_requested flag is set. _run_crawl then creates a fresh crawler with cancel_requested = False, runs the crawl to completion, and finishes with status=COMPLETED while detail still says "cancellation requested". The cancellation is silently ignored.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c69ee86. Configure here.


Note
Medium Risk
Medium risk: expands the REST API surface with pause/resume/cancel/delete semantics and live progress/ETA derived from in-memory crawler state, which could affect long-running job behavior and concurrency. Also shifts gimie integration to server-side env configuration, so misconfiguration could change crawl outputs or JSON-LD persistence paths.
Overview
Adds job lifecycle management to the FastAPI service: new
GET /api/v1/jobslisting pluspause/resume/cancel/DELETEendpoints for crawl jobs, and extendsGET /api/v1/crawl/{job_id}with live progress fields (round, processed/queue counts) and a best-effort ETA.Introduces a new crawl control,
max_contributors, wired through the CLI andPOST /api/v1/crawlto skip contributor expansion for very large repos while keeping repo nodes/other edges.Switches gimie JSON-LD enrichment to be configured server-side via
GIMIE_*env vars in the API (including optional JSON-LD persistence under${OPC_DATA_DIR}/<job_id>/jsonld), while the CLI gains corresponding--gimie-*flags.Adds a MkDocs Material documentation site (
mkdocs.yml,docs/index.md) with a GitHub Pages workflow, refreshes API/deployment docs, and cleans out a large set of obsoletedocs/*_SUMMARY/*_COMPLETE/*_IMPLEMENTATIONfiles.Updates Docker/dev ergonomics:
infra/docker-compose.ymlpublishes the API port by default, gatesgui/nginxbehind aguiprofile, fixes nginx healthcheck IPv6 localhost issue, and adds a docker-compose-based devcontainer setup (with DNS overrides and optional vscode password).Reviewed by Cursor Bugbot for commit c69ee86. Bugbot is set up for automated code reviews on this repo. Configure here.