Skip to content

Releases: Imaging-Plaza/git-metadata-extractor

Release v2.0.1

16 Feb 22:32
f2751af

Choose a tag to compare

[2.0.1] - 2026-02-16

Added

  • Documentation and CI for github-pages

Changed

  • Bumped project version to 2.0.1.
  • Updated API version metadata and root welcome message to v2.0.1.

Release v2.0.0

16 Feb 20:40
885b5ea

Choose a tag to compare

[2.0.0] - 2026-02-16

Added

  • Project restructuring for improved maintainability and modularity:
    • Reorganized src/core/ monolithic directory into categorized subdirectories under src/:
      • src/agents/ - PydanticAI agents for organization and user enrichment
      • src/cache/ - Caching infrastructure and SQLite cache manager
      • src/data_models/ - Pydantic models and schemas (Person, Organization, SoftwareSourceCode, etc.)
      • src/gimie/ - GIMIE integration methods for repository metadata extraction
      • src/llm/ - LLM processing and GenAI model wrapper
      • src/parsers/ - Organization and user parsers for structured data extraction
      • src/validation/ - Verification and validation logic
    • Created proper __init__.py files with explicit exports for all modules
    • Improved import paths throughout the codebase (e.g., from src.agents import... instead of from src.core.organization_enrichment import...)
    • Enhanced code organization and discoverability
  • SQLite-based caching system for external API calls (GitHub, ORCID, GIMIE, LLM)
    • Automatic TTL (Time To Live) expiration with configurable settings per API type
    • Default TTL: 30 days (LLM), 7 days (GitHub users/orgs), 14 days (ORCID), 1 day (GIMIE)
    • Thread-safe operations for concurrent access
    • JSON storage for complex API responses
  • Force refresh capability via force_refresh query parameter on all data endpoints
  • Cache management endpoints:
    • GET /v1/cache/stats - View comprehensive cache statistics
    • POST /v1/cache/cleanup - Remove expired cache entries
    • POST /v1/cache/clear - Clear all cache entries
    • POST /v1/cache/enable - Enable caching system
    • POST /v1/cache/disable - Disable caching system
    • DELETE /v1/cache/invalidate/{api_type} - Invalidate specific cache entries
  • Environment-based cache configuration:
    • CACHE_ENABLED - Enable/disable caching
    • CACHE_DEFAULT_TTL_DAYS - Default TTL in days
    • CACHE_DB_PATH - Custom database location
    • API-specific TTL overrides (e.g., CACHE_GITHUB_USER_TTL_DAYS)
    • Cache size and cleanup settings
  • Enhanced FastAPI documentation:
    • Comprehensive API metadata (title, description, version, contact, license)
    • Detailed endpoint docstrings with parameter and return descriptions
    • Organized API endpoints with tags (Repository, User, Organization, Cache Management, System)
    • OpenAPI schema improvements for better interactive documentation
  • Cache statistics and monitoring:
    • Total entries and active/expired counts
    • Entries breakdown by API type
    • Hit counts for cache effectiveness analysis
    • Database size reporting
  • Performance benefits:
    • Up to 90% reduction in external API requests
    • Faster response times with instant cache retrieval
    • Rate limit protection for GitHub/ORCID APIs
    • Cost savings on LLM API calls
  • ORCID affiliation enrichment:
    • Automatic extraction of ORCID IDs from author metadata
    • Selenium-based scraping of ORCID profiles for employment and education history
    • Smart affiliation merging that preserves existing affiliations and adds ORCID data
    • Support for both Zod format (schema:author, md4i:orcidId) and plain format (author, orcidId)
    • Integration with both main extraction and LLM JSON endpoints
  • Enhanced logging system:
    • Comprehensive logging for ORCID enrichment process
    • Detailed error handling and debugging information
    • Cache operation logging for monitoring and troubleshooting
    • Selenium operation logging for ORCID scraping
  • GPT-5 model support - Full support for GPT-5 and reasoning models
    • Support for GPT-5, GPT-5 variants (gpt-5-mini, gpt-5-nano), o3-mini, and o4-mini models
    • Proper model detection logic to handle GPT-5 and reasoning models
    • Uses beta.chat.completions.parse() with structured outputs for all models
    • Lazy initialization for async OpenAI client to prevent API key issues at module load
    • Comprehensive error logging with error type and detailed debugging information
    • Retry logic with exponential backoff for handling connection errors
    • Unified response parsing for all OpenAI models using .parsed attribute
  • Organization Enrichment System using PydanticAI for agentic analysis
    • Second-pass analysis to refine and enrich organization information
    • PydanticAI agent with intelligent tool usage for:
      • ROR (Research Organization Registry) API queries for standardized org data
      • Web search integration (DuckDuckGo) for additional context
      • Email domain analysis for institutional affiliation detection
    • Enhanced Organization model with new fields:
      • alternateNames - Other names the organization is known by
      • organizationType - Type classification (university, lab, company, etc.)
      • parentOrganization - Parent organization for hierarchical relationships
      • country - Country location
      • website - Official website URL
    • Optional enrich_orgs=true parameter on existing /v1/repository/llm/json endpoint
      • Non-breaking change - enrichment only runs when explicitly requested
      • Analyzes git author emails, ORCID affiliations, and existing metadata
      • Provides detailed EPFL relationship analysis with evidence
      • Graceful error handling - errors don't break the main request
    • Comprehensive documentation in docs/ORGANIZATION_ENRICHMENT.md
    • Example script: examples/example_organization_enrichment.py
    • Test suite: tests/test_organization_enrichment.py
  • Organization enrichment for User and Organization endpoints
    • Added enrich_orgs=true query parameter to /v1/user/llm/json/{full_path:path} endpoint
    • Added enrich_orgs=true query parameter to /v1/org/llm/json/{full_path:path} endpoint
    • Both endpoints now support ROR (Research Organization Registry) enrichment
    • Consistent enrichment functionality across repository, user, and organization endpoints
    • Enhanced organization metadata with ROR IDs, types, countries, websites, and hierarchical relationships
    • Detailed EPFL relationship analysis for user and organization profiles
  • Git commit temporal tracking:
    • Added Commits model with firstCommitDate and lastCommitDate fields per author
    • Enhanced extract_git_authors() to extract first and last commit dates using git log
    • Dates stored in ISO format (YYYY-MM-DD) for consistency
    • JSON-LD context mappings added for imag:firstCommitDate and imag:lastCommitDate
  • Organization confidence scoring system:
    • Added confidenceOfAttribution field to Organization model (0.0-1.0 scale)
    • Added relatedToEPFLConfidence field to OrganizationEnrichmentResult model
    • Enhanced PydanticAI agent with detailed confidence scoring guidelines:
      • 0.9-1.0: Strong evidence (verified affiliations, official emails, ORCID data)
      • 0.7-0.89: Good evidence (domain match, indirect affiliation)
      • 0.5-0.69: Moderate evidence (collaborations, shared projects)
      • 0.3-0.49: Weak evidence (geographical proximity, field similarity)
      • 0.0-0.29: Minimal or no evidence
    • Confidence assessment considers temporal alignment between commit dates and affiliation dates
    • JSON-LD context mapping for imag:confidenceOfAttribution
  • ORCID parser overhaul - Complete rewrite for reliability and data completeness:
    • Fixed employment extraction to parse line-by-line text content instead of unreliable HTML containers
    • Fixed education extraction with same line-by-line parsing approach
    • Enhanced date extraction to support multiple formats:
      • Full dates: YYYY-MM-DD to YYYY-MM-DD
      • Year ranges: YYYY to YYYY
      • Ongoing: YYYY-MM-DD to present
    • Fixed role extraction to recognize ORCID's | separator format (e.g., "Institut Pasteur | PhD Student")
    • Fixed degree extraction for education entries (MSc, BSc, PhD, etc.)
    • Enhanced duration calculation to handle full date formats with decimal precision (e.g., 3.2 years)
    • Fixed location parsing to eliminate double commas and clean formatting
    • All fields now reliably extracted: dates, roles, degrees, locations, durations
    • Validated with real ORCID profiles (e.g., 0000-0002-1126-1535)
  • Dependencies: Added httpx for async HTTP requests in organization enrichment
  • Docker volume mounting for persistent cache storage:
    • Support for mounting ./data directory to /app/data in container
    • Environment variable CACHE_DB_PATH for custom cache database location
    • Enables cache persistence across container restarts
  • Environment-based log level configuration:
    • Added LOG_LEVEL environment variable support (DEBUG, INFO, WARNING, ERROR)
    • Allows dynamic logging configuration without code changes
    • New serve-dev-debug justfile recipe for easy debug mode startup
    • Enhanced subprocess logging with full stderr/stdout output (no truncation)
  • Enhanced debugging capabilities for repository processing:
    • Comprehensive debug logging for git clone operations with directory contents
    • Full error output from repo-to-text subprocess (complete tracebacks)
    • Directory existence checks and file listing for troubleshooting
    • Detailed diagnostics when no .txt files are found after repo-to-text
  • ORCID validation and normalization:
    • Added normalize_orcid_to_url() function to convert ORCID IDs to standard URL format
    • ORCID validation now accepts both ID format (0000-0002-1234-5678) and URL format (https://orcid.org/0000-0002-1234-5678)
    • Automatic normalization to URL format before enrichment and scraping
    • Enhanced validation in both scraping flow and enrichment flow
  • Auto-enrichment flag for conditional ORCID enrichment:
    • Added auto_enrich_orcid query parameter (default: true) to repository endpoints
    • All...
Read more

Release v1.0.0

06 Aug 13:40
6207a80

Choose a tag to compare

Changelog

All notable changes to this project will be documented in this file.

[1.0.0] - 2025-08-06

Added

  • Users and Organization compatibility
  • Endpoints refactoring
  • Parallel calling
  • Multiworkers entrypoint

[0.1.0] - 2025-06-25

Added

  • Initial project setup.
  • Dockerfile for containerization.
  • GitHub Actions workflow for automated publishing and releases.

Release v0.1.0

25 Jun 14:37

Choose a tag to compare

Changelog

All notable changes to this project will be documented in this file.

[0.1.0] - 2025-06-25

Added

  • Initial project setup.
  • Dockerfile for containerization.
  • GitHub Actions workflow for automated publishing and releases.