Skip to content

Feat gimie driven properties#6

Merged
caviri merged 5 commits into
developfrom
feat-gimie-driven-properties
May 6, 2026
Merged

Feat gimie driven properties#6
caviri merged 5 commits into
developfrom
feat-gimie-driven-properties

Conversation

@caviri
Copy link
Copy Markdown
Member

@caviri caviri commented May 6, 2026

Note

Medium Risk
Medium risk: expands the REST API surface with pause/resume/cancel/delete semantics and live progress/ETA derived from in-memory crawler state, which could affect long-running job behavior and concurrency. Also shifts gimie integration to server-side env configuration, so misconfiguration could change crawl outputs or JSON-LD persistence paths.

Overview
Adds job lifecycle management to the FastAPI service: new GET /api/v1/jobs listing plus pause/resume/cancel/DELETE endpoints for crawl jobs, and extends GET /api/v1/crawl/{job_id} with live progress fields (round, processed/queue counts) and a best-effort ETA.

Introduces a new crawl control, max_contributors, wired through the CLI and POST /api/v1/crawl to skip contributor expansion for very large repos while keeping repo nodes/other edges.

Switches gimie JSON-LD enrichment to be configured server-side via GIMIE_* env vars in the API (including optional JSON-LD persistence under ${OPC_DATA_DIR}/<job_id>/jsonld), while the CLI gains corresponding --gimie-* flags.

Adds a MkDocs Material documentation site (mkdocs.yml, docs/index.md) with a GitHub Pages workflow, refreshes API/deployment docs, and cleans out a large set of obsolete docs/*_SUMMARY/*_COMPLETE/*_IMPLEMENTATION files.

Updates Docker/dev ergonomics: infra/docker-compose.yml publishes the API port by default, gates gui/nginx behind a gui profile, fixes nginx healthcheck IPv6 localhost issue, and adds a docker-compose-based devcontainer setup (with DNS overrides and optional vscode password).

Reviewed by Cursor Bugbot for commit c69ee86. Bugbot is set up for automated code reviews on this repo. Configure here.

caviri added 5 commits May 2, 2026 14:25
… related enhancements

- Added new parameters to the API and CLI for enabling gimie JSON-LD fetching, including `gimie_repos`, `gimie_api_base`, `gimie_store_jsonld`, `gimie_skip_existing_jsonld`, and `gimie_archive_on_download`.
- Updated the crawl job to handle JSON-LD data, including storing payloads and creating a zip archive for downloaded data.
- Enhanced error handling and logging for HTTP responses from the gimie API.
- Updated export filename formats to include timestamps and improved directory structure for crawl outputs.
- Added tests to validate the new functionality and ensure proper integration with the existing crawler logic.
…ment setup

- Added `.env.example` for environment variable configuration in the development container.
- Updated `devcontainer.json` to use Docker Compose, specifying service and environment settings.
- Created `docker-compose.yml` to define the development container stack, including port mappings and DNS settings.
- Introduced `set-vscode-password.sh` script to set the VSCode user password at container start based on `.env` configuration.
- Expanded `.env.dist` with detailed instructions and required variables for API authentication and GitHub access.
- Updated `docker-compose.yml` to support profiles for API and GUI, including port mappings and health checks.
- Refactored API and CLI to remove deprecated `epfl_entities` parameter and streamline gimie hybrid configuration.
- Introduced new job summary and response models to track job progress and timing.
- Enhanced GUI with password protection and improved session management.
- Updated tests to reflect changes in API parameters and validate new functionality.
…nagement

- Added MkDocs configuration and Material theme for project documentation, including a new landing page and structured navigation.
- Implemented GitHub Actions workflow for automatic documentation builds and deployment to GitHub Pages.
- Expanded REST API with new job lifecycle endpoints for managing jobs (pause, resume, cancel) and retrieving live progress metrics.
- Updated API documentation to reflect new endpoints and job status values, along with improved diagrams for job lifecycle and request flow.
- Refined environment variable configuration for gimie hybrid extraction, moving from request-based to server-side settings.
- Cleaned up and refreshed existing documentation, removing outdated files and correcting links.
- Introduced `--max-contributors` option in CLI, REST API, and Streamlit GUI to skip contributor expansion for repositories exceeding a specified contributor count, while retaining the repo node in the graph.
- Added `RepoModel.contributor_count` and `RepoModel.skipped_high_contributors` fields to track contributor counts and skipped repos.
- Implemented `GitHubClient.get_contributor_count(repo_full_name)` for efficient contributor count retrieval with caching.
- Updated Streamlit GUI to reflect new parameters and provide live job progress metrics, including controls for job management.
- Fixed nginx healthcheck in `docker-compose.yml` to ensure proper service health monitoring.
- Enhanced API documentation to include new parameters and usage examples.
@caviri caviri merged commit d8f34e0 into develop May 6, 2026
5 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c69ee86. Configure here.

# DELETE endpoint so we don't drop a still-active job out from under itself.
_TERMINAL_STATES = {
JobStatus.COMPLETED, JobStatus.FAILED, JobStatus.CANCELLED, JobStatus.PENDING,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting a PENDING job crashes the background task

High Severity

_TERMINAL_STATES includes JobStatus.PENDING, which allows the DELETE endpoint to remove a job that already has a background task scheduled via BackgroundTasks.add_task. If the delete occurs before _run_crawl begins executing, it will crash with a KeyError on record = _jobs[job_id] because the entry was removed from _jobs. PENDING is not truly terminal — the background task is about to start.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c69ee86. Configure here.

# CANCELLED when the loop actually exits, so callers see the
# transition rather than a phantom.
record.detail = "cancellation requested"
return JobActionResponse(job_id=job_id, status=record.status, detail=record.detail)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelling a PENDING job has no effect

Medium Severity

The cancel endpoint accepts PENDING jobs (line 585), but when the job is pending, record.crawler is None — so no cancel_requested flag is set. _run_crawl then creates a fresh crawler with cancel_requested = False, runs the crawl to completion, and finishes with status=COMPLETED while detail still says "cancellation requested". The cancellation is silently ignored.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c69ee86. Configure here.

@caviri caviri deleted the feat-gimie-driven-properties branch May 6, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant