Releases: Imaging-Plaza/git-metadata-extractor
Releases · Imaging-Plaza/git-metadata-extractor
Release v2.0.1
[2.0.1] - 2026-02-16
Added
- Documentation and CI for github-pages
Changed
- Bumped project version to
2.0.1. - Updated API version metadata and root welcome message to
v2.0.1.
Release v2.0.0
[2.0.0] - 2026-02-16
Added
- Project restructuring for improved maintainability and modularity:
- Reorganized
src/core/monolithic directory into categorized subdirectories undersrc/:src/agents/- PydanticAI agents for organization and user enrichmentsrc/cache/- Caching infrastructure and SQLite cache managersrc/data_models/- Pydantic models and schemas (Person, Organization, SoftwareSourceCode, etc.)src/gimie/- GIMIE integration methods for repository metadata extractionsrc/llm/- LLM processing and GenAI model wrappersrc/parsers/- Organization and user parsers for structured data extractionsrc/validation/- Verification and validation logic
- Created proper
__init__.pyfiles with explicit exports for all modules - Improved import paths throughout the codebase (e.g.,
from src.agents import...instead offrom src.core.organization_enrichment import...) - Enhanced code organization and discoverability
- Reorganized
- SQLite-based caching system for external API calls (GitHub, ORCID, GIMIE, LLM)
- Automatic TTL (Time To Live) expiration with configurable settings per API type
- Default TTL: 30 days (LLM), 7 days (GitHub users/orgs), 14 days (ORCID), 1 day (GIMIE)
- Thread-safe operations for concurrent access
- JSON storage for complex API responses
- Force refresh capability via
force_refreshquery parameter on all data endpoints - Cache management endpoints:
GET /v1/cache/stats- View comprehensive cache statisticsPOST /v1/cache/cleanup- Remove expired cache entriesPOST /v1/cache/clear- Clear all cache entriesPOST /v1/cache/enable- Enable caching systemPOST /v1/cache/disable- Disable caching systemDELETE /v1/cache/invalidate/{api_type}- Invalidate specific cache entries
- Environment-based cache configuration:
CACHE_ENABLED- Enable/disable cachingCACHE_DEFAULT_TTL_DAYS- Default TTL in daysCACHE_DB_PATH- Custom database location- API-specific TTL overrides (e.g.,
CACHE_GITHUB_USER_TTL_DAYS) - Cache size and cleanup settings
- Enhanced FastAPI documentation:
- Comprehensive API metadata (title, description, version, contact, license)
- Detailed endpoint docstrings with parameter and return descriptions
- Organized API endpoints with tags (Repository, User, Organization, Cache Management, System)
- OpenAPI schema improvements for better interactive documentation
- Cache statistics and monitoring:
- Total entries and active/expired counts
- Entries breakdown by API type
- Hit counts for cache effectiveness analysis
- Database size reporting
- Performance benefits:
- Up to 90% reduction in external API requests
- Faster response times with instant cache retrieval
- Rate limit protection for GitHub/ORCID APIs
- Cost savings on LLM API calls
- ORCID affiliation enrichment:
- Automatic extraction of ORCID IDs from author metadata
- Selenium-based scraping of ORCID profiles for employment and education history
- Smart affiliation merging that preserves existing affiliations and adds ORCID data
- Support for both Zod format (
schema:author,md4i:orcidId) and plain format (author,orcidId) - Integration with both main extraction and LLM JSON endpoints
- Enhanced logging system:
- Comprehensive logging for ORCID enrichment process
- Detailed error handling and debugging information
- Cache operation logging for monitoring and troubleshooting
- Selenium operation logging for ORCID scraping
- GPT-5 model support - Full support for GPT-5 and reasoning models
- Support for GPT-5, GPT-5 variants (gpt-5-mini, gpt-5-nano), o3-mini, and o4-mini models
- Proper model detection logic to handle GPT-5 and reasoning models
- Uses
beta.chat.completions.parse()with structured outputs for all models - Lazy initialization for async OpenAI client to prevent API key issues at module load
- Comprehensive error logging with error type and detailed debugging information
- Retry logic with exponential backoff for handling connection errors
- Unified response parsing for all OpenAI models using
.parsedattribute
- Organization Enrichment System using PydanticAI for agentic analysis
- Second-pass analysis to refine and enrich organization information
- PydanticAI agent with intelligent tool usage for:
- ROR (Research Organization Registry) API queries for standardized org data
- Web search integration (DuckDuckGo) for additional context
- Email domain analysis for institutional affiliation detection
- Enhanced
Organizationmodel with new fields:alternateNames- Other names the organization is known byorganizationType- Type classification (university, lab, company, etc.)parentOrganization- Parent organization for hierarchical relationshipscountry- Country locationwebsite- Official website URL
- Optional
enrich_orgs=trueparameter on existing/v1/repository/llm/jsonendpoint- Non-breaking change - enrichment only runs when explicitly requested
- Analyzes git author emails, ORCID affiliations, and existing metadata
- Provides detailed EPFL relationship analysis with evidence
- Graceful error handling - errors don't break the main request
- Comprehensive documentation in
docs/ORGANIZATION_ENRICHMENT.md - Example script:
examples/example_organization_enrichment.py - Test suite:
tests/test_organization_enrichment.py
- Organization enrichment for User and Organization endpoints
- Added
enrich_orgs=truequery parameter to/v1/user/llm/json/{full_path:path}endpoint - Added
enrich_orgs=truequery parameter to/v1/org/llm/json/{full_path:path}endpoint - Both endpoints now support ROR (Research Organization Registry) enrichment
- Consistent enrichment functionality across repository, user, and organization endpoints
- Enhanced organization metadata with ROR IDs, types, countries, websites, and hierarchical relationships
- Detailed EPFL relationship analysis for user and organization profiles
- Added
- Git commit temporal tracking:
- Added
Commitsmodel withfirstCommitDateandlastCommitDatefields per author - Enhanced
extract_git_authors()to extract first and last commit dates using git log - Dates stored in ISO format (YYYY-MM-DD) for consistency
- JSON-LD context mappings added for
imag:firstCommitDateandimag:lastCommitDate
- Added
- Organization confidence scoring system:
- Added
confidenceOfAttributionfield toOrganizationmodel (0.0-1.0 scale) - Added
relatedToEPFLConfidencefield toOrganizationEnrichmentResultmodel - Enhanced PydanticAI agent with detailed confidence scoring guidelines:
- 0.9-1.0: Strong evidence (verified affiliations, official emails, ORCID data)
- 0.7-0.89: Good evidence (domain match, indirect affiliation)
- 0.5-0.69: Moderate evidence (collaborations, shared projects)
- 0.3-0.49: Weak evidence (geographical proximity, field similarity)
- 0.0-0.29: Minimal or no evidence
- Confidence assessment considers temporal alignment between commit dates and affiliation dates
- JSON-LD context mapping for
imag:confidenceOfAttribution
- Added
- ORCID parser overhaul - Complete rewrite for reliability and data completeness:
- Fixed employment extraction to parse line-by-line text content instead of unreliable HTML containers
- Fixed education extraction with same line-by-line parsing approach
- Enhanced date extraction to support multiple formats:
- Full dates:
YYYY-MM-DD to YYYY-MM-DD - Year ranges:
YYYY to YYYY - Ongoing:
YYYY-MM-DD to present
- Full dates:
- Fixed role extraction to recognize ORCID's
|separator format (e.g., "Institut Pasteur | PhD Student") - Fixed degree extraction for education entries (MSc, BSc, PhD, etc.)
- Enhanced duration calculation to handle full date formats with decimal precision (e.g., 3.2 years)
- Fixed location parsing to eliminate double commas and clean formatting
- All fields now reliably extracted: dates, roles, degrees, locations, durations
- Validated with real ORCID profiles (e.g., 0000-0002-1126-1535)
- Dependencies: Added
httpxfor async HTTP requests in organization enrichment - Docker volume mounting for persistent cache storage:
- Support for mounting
./datadirectory to/app/datain container - Environment variable
CACHE_DB_PATHfor custom cache database location - Enables cache persistence across container restarts
- Support for mounting
- Environment-based log level configuration:
- Added
LOG_LEVELenvironment variable support (DEBUG, INFO, WARNING, ERROR) - Allows dynamic logging configuration without code changes
- New
serve-dev-debugjustfile recipe for easy debug mode startup - Enhanced subprocess logging with full stderr/stdout output (no truncation)
- Added
- Enhanced debugging capabilities for repository processing:
- Comprehensive debug logging for git clone operations with directory contents
- Full error output from repo-to-text subprocess (complete tracebacks)
- Directory existence checks and file listing for troubleshooting
- Detailed diagnostics when no .txt files are found after repo-to-text
- ORCID validation and normalization:
- Added
normalize_orcid_to_url()function to convert ORCID IDs to standard URL format - ORCID validation now accepts both ID format (0000-0002-1234-5678) and URL format (https://orcid.org/0000-0002-1234-5678)
- Automatic normalization to URL format before enrichment and scraping
- Enhanced validation in both scraping flow and enrichment flow
- Added
- Auto-enrichment flag for conditional ORCID enrichment:
- Added
auto_enrich_orcidquery parameter (default:true) to repository endpoints - All...
- Added
Release v1.0.0
Changelog
All notable changes to this project will be documented in this file.
[1.0.0] - 2025-08-06
Added
- Users and Organization compatibility
- Endpoints refactoring
- Parallel calling
- Multiworkers entrypoint
[0.1.0] - 2025-06-25
Added
- Initial project setup.
- Dockerfile for containerization.
- GitHub Actions workflow for automated publishing and releases.
Release v0.1.0
Changelog
All notable changes to this project will be documented in this file.
[0.1.0] - 2025-06-25
Added
- Initial project setup.
- Dockerfile for containerization.
- GitHub Actions workflow for automated publishing and releases.