@@ -16,7 +16,7 @@ The Academic Catalog Enrichment system provides integration with academic reposi
1616 │
1717 ▼
1818┌─────────────────────────────────────────────────────────────────┐
19- │ AcademicCatalogEnrichmentResult │
19+ │ linkedEntitiesEnrichmentResult │
2020│ ┌────────────────┬──────────────────┬──────────────────────┐ │
2121│ │ repository_ │ author_relations │ organization_ │ │
2222│ │ relations │ Dict[str, List] │ relations │ │
@@ -36,13 +36,13 @@ The Academic Catalog Enrichment system provides integration with academic reposi
3636## Data Models
3737
3838### Location
39- - **Path**: `src/data_models/academic_catalog .py`
39+ - **Path**: `src/data_models/linked_entities .py`
4040
4141### Key Models
4242
43- #### 1. AcademicCatalogRelation
43+ #### 1. linkedEntitiesRelation
4444```python
45- class AcademicCatalogRelation (BaseModel):
45+ class linkedEntitiesRelation (BaseModel):
4646 """A single relation to an academic catalog entity."""
4747
4848 catalogType: CatalogType # "infoscience", "openalex", "epfl_graph"
@@ -54,19 +54,19 @@ class AcademicCatalogRelation(BaseModel):
5454 # Note: externalId field has been removed
5555```
5656
57- #### 2. AcademicCatalogEnrichmentResult (Structured Output)
57+ #### 2. linkedEntitiesEnrichmentResult (Structured Output)
5858```python
59- class AcademicCatalogEnrichmentResult (BaseModel):
59+ class linkedEntitiesEnrichmentResult (BaseModel):
6060 """Organized results by what was searched for."""
6161
6262 # Publications about the repository/project itself
63- repository_relations: List[AcademicCatalogRelation ] = []
63+ repository_relations: List[linkedEntitiesRelation ] = []
6464
6565 # Keyed by exact author name provided
66- author_relations: Dict[str, List[AcademicCatalogRelation ]] = {}
66+ author_relations: Dict[str, List[linkedEntitiesRelation ]] = {}
6767
6868 # Keyed by exact organization name provided
69- organization_relations: Dict[str, List[AcademicCatalogRelation ]] = {}
69+ organization_relations: Dict[str, List[linkedEntitiesRelation ]] = {}
7070
7171 # Metadata
7272 searchStrategy: Optional[str] = None
@@ -102,7 +102,7 @@ result = agent.run(prompt, authors=["Alexander Mathis", ...])
102102 }
103103}
104104# Direct assignment:
105- author.academicCatalogRelations = result.author_relations[author.name]
105+ author.linkedEntities = result.author_relations[author.name]
106106```
107107
108108### Agent Responsibilities
@@ -124,8 +124,8 @@ Python code is responsible for:
124124
125125### 1. Agent Call
126126```python
127- # src/agents/academic_catalog_enrichment .py
128- async def enrich_repository_academic_catalog (
127+ # src/agents/linked_entities_enrichment .py
128+ async def enrich_repository_linked_entities (
129129 repository_url: str,
130130 repository_name: str,
131131 description: str,
@@ -161,9 +161,9 @@ async def enrich_repository_academic_catalog(
161161### 3. Direct Assignment
162162```python
163163# src/analysis/repositories.py
164- async def run_academic_catalog_enrichment (self):
164+ async def run_linked_entities_enrichment (self):
165165 # Call agent
166- result = await enrich_repository_academic_catalog (
166+ result = await enrich_repository_linked_entities (
167167 repository_url=self.full_path,
168168 repository_name=repository_name,
169169 authors=author_names, # ["Alexander Mathis", ...]
@@ -173,21 +173,21 @@ async def run_academic_catalog_enrichment(self):
173173 enrichment_data = result.get("data")
174174
175175 # 1. Repository-level
176- self.data.academicCatalogRelations = enrichment_data.repository_relations
176+ self.data.linkedEntities = enrichment_data.repository_relations
177177
178178 # 2. Author-level (direct lookup by name)
179179 for author in self.data.author:
180180 if author.name in enrichment_data.author_relations:
181- author.academicCatalogRelations = enrichment_data.author_relations[author.name]
181+ author.linkedEntities = enrichment_data.author_relations[author.name]
182182 else:
183- author.academicCatalogRelations = []
183+ author.linkedEntities = []
184184
185185 # 3. Organization-level (direct lookup by name)
186186 for org in self.data.author:
187187 if org.legalName in enrichment_data.organization_relations:
188- org.academicCatalogRelations = enrichment_data.organization_relations[org.legalName]
188+ org.linkedEntities = enrichment_data.organization_relations[org.legalName]
189189 else:
190- org.academicCatalogRelations = []
190+ org.linkedEntities = []
191191```
192192
193193**Key Points**:
@@ -309,10 +309,10 @@ return InfoscienceAuthor(
309309
3103104. **Agent extracts from markdown**:
311311 - Agent prompt explicitly instructs: "Extract UUID from '*UUID:* <uuid>' in markdown"
312- - Agent populates `AcademicCatalogRelation .uuid` field
312+ - Agent populates `linkedEntitiesRelation .uuid` field
313313 - Agent populates `entity.uuid` in the full entity object
314314
315- **Chain of custody**: API → Parser → Pydantic Model → Markdown → Agent → AcademicCatalogRelation
315+ **Chain of custody**: API → Parser → Pydantic Model → Markdown → Agent → linkedEntitiesRelation
316316
317317#### 3. Markdown as Transport Layer
318318Since tools return markdown (not structured data), markdown must include ALL critical fields:
@@ -466,42 +466,69 @@ ENV_VAR_MAPPINGS = {
466466| LLM Analysis | `run_llm_analysis` | o4-mini | Main repository analysis |
467467| User Enrichment | `run_user_enrichment` | o4-mini | Author enrichment with ORCID |
468468| Org Enrichment | `run_organization_enrichment` | o4-mini | ROR matching |
469- | Academic Catalog | `run_academic_catalog_enrichment ` | o4-mini | Infoscience searches (tool-heavy) |
469+ | Academic Catalog | `run_linked_entities_searcher ` | o4-mini | Infoscience searches (tool-heavy, repository-level only ) |
470470| EPFL Assessment | `run_epfl_assessment` | o4-mini | Final holistic assessment |
471+ | Repository Classifier | `run_repository_classifier` | o4-mini | Repository type and discipline classification |
472+ | Organization Identifier | `run_organization_identifier` | o4-mini | Organization identification |
471473
472474## Integration into Analysis Pipeline
473475
474476### Repository Analysis Flow
475477
476478```python
477479# src/analysis/repositories.py
478- async def run_analysis(self):
480+ async def run_analysis(self, run_author_linked_entities: bool = False ):
479481 # 1. Extract metadata with GIMIE
480482 await self.run_gimie()
481483
482- # 2. LLM analysis (main agent)
483- await self.run_llm_analysis()
484+ # 2. Atomic LLM pipeline (stages 1-5)
485+ await self.run_atomic_llm_pipeline()
486+ # Stage 1: Context compiler
487+ # Stage 2: Structured output
488+ # Stage 3: Repository classifier
489+ # Stage 4: Organization identifier
490+ # Stage 5: Linked entities searcher (repository-level only)
484491
485492 # 3. ORCID enrichment (no LLM)
486493 self.run_authors_enrichment()
487494
488- # 4. Organization enrichment (ROR agent)
495+ # 4. User enrichment (optional)
496+ await self.run_user_enrichment()
497+
498+ # 5. Organization enrichment (optional)
489499 await self.run_organization_enrichment()
490500
491- # 5. User enrichment (author agent )
492- await self.run_user_enrichment()
501+ # 6. Academic catalog enrichment (repository-level - runs in atomic pipeline )
502+ # Already completed in Stage 5 of atomic pipeline
493503
494- # 6. Academic catalog enrichment (NEW!)
495- await self.run_academic_catalog_enrichment()
504+ # 7. Optional: Author-level linked entities enrichment
505+ if run_author_linked_entities:
506+ await self.run_author_linked_entities_enrichment()
496507
497- # 7 . Final EPFL assessment (holistic)
508+ # 8 . Final EPFL assessment (holistic)
498509 await self.run_epfl_final_assessment()
499510```
500511
501512**Order matters**:
502- - Academic catalog enrichment runs AFTER user/org enrichment (needs author names)
513+ - Academic catalog enrichment (repository-level) runs in Stage 5 of atomic pipeline
514+ - Author-level linked entities enrichment is optional and runs separately
503515- EPFL assessment runs LAST (reviews all collected data)
504516
517+ ### Linked Entities Enrichment Scope
518+
519+ **Repository-Level (Default)**:
520+ - Runs automatically in Stage 5 of atomic pipeline
521+ - Searches Infoscience for publications about the repository/tool name
522+ - Stores results in `repository.linkedEntities`
523+ - Uses `search_infoscience_publications_tool` with repository name as query
524+
525+ **Author-Level (Optional)**:
526+ - Controlled by `run_author_linked_entities` parameter
527+ - Separate method: `run_author_linked_entities_enrichment()`
528+ - Searches Infoscience for each author individually
529+ - Assigns results to `author.linkedEntities` for each Person
530+ - Only runs when explicitly requested (default: `False`)
531+
505532### Estimated Token Accumulation
506533
507534**EVERY agent must accumulate estimated tokens**:
@@ -517,7 +544,7 @@ if usage and "estimated_input_tokens" in usage:
517544- ✅ `run_llm_analysis()`
518545- ✅ `run_organization_enrichment()`
519546- ✅ `run_user_enrichment()`
520- - ✅ `run_academic_catalog_enrichment ()`
547+ - ✅ `run_linked_entities_enrichment ()`
521548- ✅ `run_epfl_final_assessment()`
522549
523550## Testing Guidelines
@@ -539,8 +566,8 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
539566
540567### Verification Checklist
541568
542- - [ ] Repository `academicCatalogRelations ` populated
543- - [ ] Each author has `academicCatalogRelations ` (may be empty)
569+ - [ ] Repository `linkedEntities ` populated
570+ - [ ] Each author has `linkedEntities ` (may be empty)
544571- [ ] Relations include full entity objects (not just UUIDs)
545572- [ ] **UUIDs are populated** (not null) for all matched entities
546573- [ ] **URLs/profile_urls are populated** for all matched entities
@@ -552,7 +579,7 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
552579
553580## Common Issues & Solutions
554581
555- ### Issue: UUID is null in academicCatalogRelations
582+ ### Issue: UUID is null in linkedEntities
556583**Cause**: Field name mismatch in parser (e.g., `url=` instead of `profile_url=`)
557584**Symptoms**:
558585```json
@@ -604,6 +631,26 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
604631**Cause**: Parser passing wrong field name to Pydantic model
605632**Solution**: Pydantic silently ignores unknown fields - verify field names match model definition
606633
634+ ### Issue: Validation errors for union fields (entityInfosciencePublication, entityInfoscienceAuthor, entityInfoscienceLab)
635+ **Cause**: LLM populating all three union fields with the same data, or wrong entity type in wrong field
636+ **Symptoms**:
637+ ```json
638+ {
639+ "entityType": "publication",
640+ "entityInfosciencePublication": {...}, // ✅ Correct
641+ "entityInfoscienceAuthor": {...}, // ❌ Should be None/omitted
642+ "entityInfoscienceLab": {...} // ❌ Should be None/omitted
643+ }
644+ ```
645+
646+ **Solution**:
647+ 1. **System prompt**: Explicitly instruct LLM to populate ONLY the field matching `entityType`
648+ 2. **Reconciliation method**: `_reconcile_entity_union()` in `repositories.py`:
649+ - Checks `entityType` to select correct union variant
650+ - Removes other two fields
651+ - Converts `None` to empty lists for list fields (`subjects`, `authors`, `keywords`)
652+ 3. **List field handling**: Convert `None` to `[]` for list fields before validation
653+
607654## Future Extensions
608655
609656### Adding New Catalogs
@@ -642,7 +689,7 @@ Future enhancement: Match same entities across catalogs using:
642689
643690```python
644691# Example future feature
645- def deduplicate_across_catalogs(relations: List[AcademicCatalogRelation ]):
692+ def deduplicate_across_catalogs(relations: List[linkedEntitiesRelation ]):
646693 """Merge same entities from different catalogs."""
647694 # Group by DOI, ORCID, or other stable identifiers
648695 # Provide unified view across catalogs
@@ -660,8 +707,8 @@ def deduplicate_across_catalogs(relations: List[AcademicCatalogRelation]):
660707
661708## References
662709
663- - Implementation: `src/agents/academic_catalog_enrichment .py`
664- - Data Models: `src/data_models/academic_catalog .py`
710+ - Implementation: `src/agents/linked_entities_enrichment .py`
711+ - Data Models: `src/data_models/linked_entities .py`
665712- Infoscience Client: `src/context/infoscience.py`
666713- Integration: `src/analysis/repositories.py`
667- - Documentation: `ACADEMIC_CATALOG_OPTION_B_IMPLEMENTATION .md` (if exists)
714+ - Documentation: `linked_entities_OPTION_B_IMPLEMENTATION .md` (if exists)
0 commit comments