Skip to content

Commit 0b4e035

Browse files
authored
Merge pull request #23 from Imaging-Plaza/json-to-rdf
feat: add vibecoded but working convert_json_jsonld.py
2 parents cb23439 + 2a8dcc8 commit 0b4e035

77 files changed

Lines changed: 20031 additions & 1657 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.cursor/rules/academic-catalog-enrichment.mdc

Lines changed: 87 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The Academic Catalog Enrichment system provides integration with academic reposi
1616
1717
1818
┌─────────────────────────────────────────────────────────────────┐
19-
AcademicCatalogEnrichmentResult
19+
linkedEntitiesEnrichmentResult
2020
│ ┌────────────────┬──────────────────┬──────────────────────┐ │
2121
│ │ repository_ │ author_relations │ organization_ │ │
2222
│ │ relations │ Dict[str, List] │ relations │ │
@@ -36,13 +36,13 @@ The Academic Catalog Enrichment system provides integration with academic reposi
3636
## Data Models
3737

3838
### Location
39-
- **Path**: `src/data_models/academic_catalog.py`
39+
- **Path**: `src/data_models/linked_entities.py`
4040

4141
### Key Models
4242

43-
#### 1. AcademicCatalogRelation
43+
#### 1. linkedEntitiesRelation
4444
```python
45-
class AcademicCatalogRelation(BaseModel):
45+
class linkedEntitiesRelation(BaseModel):
4646
"""A single relation to an academic catalog entity."""
4747

4848
catalogType: CatalogType # "infoscience", "openalex", "epfl_graph"
@@ -54,19 +54,19 @@ class AcademicCatalogRelation(BaseModel):
5454
# Note: externalId field has been removed
5555
```
5656

57-
#### 2. AcademicCatalogEnrichmentResult (Structured Output)
57+
#### 2. linkedEntitiesEnrichmentResult (Structured Output)
5858
```python
59-
class AcademicCatalogEnrichmentResult(BaseModel):
59+
class linkedEntitiesEnrichmentResult(BaseModel):
6060
"""Organized results by what was searched for."""
6161

6262
# Publications about the repository/project itself
63-
repository_relations: List[AcademicCatalogRelation] = []
63+
repository_relations: List[linkedEntitiesRelation] = []
6464

6565
# Keyed by exact author name provided
66-
author_relations: Dict[str, List[AcademicCatalogRelation]] = {}
66+
author_relations: Dict[str, List[linkedEntitiesRelation]] = {}
6767

6868
# Keyed by exact organization name provided
69-
organization_relations: Dict[str, List[AcademicCatalogRelation]] = {}
69+
organization_relations: Dict[str, List[linkedEntitiesRelation]] = {}
7070

7171
# Metadata
7272
searchStrategy: Optional[str] = None
@@ -102,7 +102,7 @@ result = agent.run(prompt, authors=["Alexander Mathis", ...])
102102
}
103103
}
104104
# Direct assignment:
105-
author.academicCatalogRelations = result.author_relations[author.name]
105+
author.linkedEntities = result.author_relations[author.name]
106106
```
107107

108108
### Agent Responsibilities
@@ -124,8 +124,8 @@ Python code is responsible for:
124124

125125
### 1. Agent Call
126126
```python
127-
# src/agents/academic_catalog_enrichment.py
128-
async def enrich_repository_academic_catalog(
127+
# src/agents/linked_entities_enrichment.py
128+
async def enrich_repository_linked_entities(
129129
repository_url: str,
130130
repository_name: str,
131131
description: str,
@@ -161,9 +161,9 @@ async def enrich_repository_academic_catalog(
161161
### 3. Direct Assignment
162162
```python
163163
# src/analysis/repositories.py
164-
async def run_academic_catalog_enrichment(self):
164+
async def run_linked_entities_enrichment(self):
165165
# Call agent
166-
result = await enrich_repository_academic_catalog(
166+
result = await enrich_repository_linked_entities(
167167
repository_url=self.full_path,
168168
repository_name=repository_name,
169169
authors=author_names, # ["Alexander Mathis", ...]
@@ -173,21 +173,21 @@ async def run_academic_catalog_enrichment(self):
173173
enrichment_data = result.get("data")
174174

175175
# 1. Repository-level
176-
self.data.academicCatalogRelations = enrichment_data.repository_relations
176+
self.data.linkedEntities = enrichment_data.repository_relations
177177

178178
# 2. Author-level (direct lookup by name)
179179
for author in self.data.author:
180180
if author.name in enrichment_data.author_relations:
181-
author.academicCatalogRelations = enrichment_data.author_relations[author.name]
181+
author.linkedEntities = enrichment_data.author_relations[author.name]
182182
else:
183-
author.academicCatalogRelations = []
183+
author.linkedEntities = []
184184

185185
# 3. Organization-level (direct lookup by name)
186186
for org in self.data.author:
187187
if org.legalName in enrichment_data.organization_relations:
188-
org.academicCatalogRelations = enrichment_data.organization_relations[org.legalName]
188+
org.linkedEntities = enrichment_data.organization_relations[org.legalName]
189189
else:
190-
org.academicCatalogRelations = []
190+
org.linkedEntities = []
191191
```
192192

193193
**Key Points**:
@@ -309,10 +309,10 @@ return InfoscienceAuthor(
309309

310310
4. **Agent extracts from markdown**:
311311
- Agent prompt explicitly instructs: "Extract UUID from '*UUID:* <uuid>' in markdown"
312-
- Agent populates `AcademicCatalogRelation.uuid` field
312+
- Agent populates `linkedEntitiesRelation.uuid` field
313313
- Agent populates `entity.uuid` in the full entity object
314314

315-
**Chain of custody**: API → Parser → Pydantic Model → Markdown → Agent → AcademicCatalogRelation
315+
**Chain of custody**: API → Parser → Pydantic Model → Markdown → Agent → linkedEntitiesRelation
316316

317317
#### 3. Markdown as Transport Layer
318318
Since tools return markdown (not structured data), markdown must include ALL critical fields:
@@ -466,42 +466,69 @@ ENV_VAR_MAPPINGS = {
466466
| LLM Analysis | `run_llm_analysis` | o4-mini | Main repository analysis |
467467
| User Enrichment | `run_user_enrichment` | o4-mini | Author enrichment with ORCID |
468468
| Org Enrichment | `run_organization_enrichment` | o4-mini | ROR matching |
469-
| Academic Catalog | `run_academic_catalog_enrichment` | o4-mini | Infoscience searches (tool-heavy) |
469+
| Academic Catalog | `run_linked_entities_searcher` | o4-mini | Infoscience searches (tool-heavy, repository-level only) |
470470
| EPFL Assessment | `run_epfl_assessment` | o4-mini | Final holistic assessment |
471+
| Repository Classifier | `run_repository_classifier` | o4-mini | Repository type and discipline classification |
472+
| Organization Identifier | `run_organization_identifier` | o4-mini | Organization identification |
471473

472474
## Integration into Analysis Pipeline
473475

474476
### Repository Analysis Flow
475477

476478
```python
477479
# src/analysis/repositories.py
478-
async def run_analysis(self):
480+
async def run_analysis(self, run_author_linked_entities: bool = False):
479481
# 1. Extract metadata with GIMIE
480482
await self.run_gimie()
481483

482-
# 2. LLM analysis (main agent)
483-
await self.run_llm_analysis()
484+
# 2. Atomic LLM pipeline (stages 1-5)
485+
await self.run_atomic_llm_pipeline()
486+
# Stage 1: Context compiler
487+
# Stage 2: Structured output
488+
# Stage 3: Repository classifier
489+
# Stage 4: Organization identifier
490+
# Stage 5: Linked entities searcher (repository-level only)
484491

485492
# 3. ORCID enrichment (no LLM)
486493
self.run_authors_enrichment()
487494

488-
# 4. Organization enrichment (ROR agent)
495+
# 4. User enrichment (optional)
496+
await self.run_user_enrichment()
497+
498+
# 5. Organization enrichment (optional)
489499
await self.run_organization_enrichment()
490500

491-
# 5. User enrichment (author agent)
492-
await self.run_user_enrichment()
501+
# 6. Academic catalog enrichment (repository-level - runs in atomic pipeline)
502+
# Already completed in Stage 5 of atomic pipeline
493503

494-
# 6. Academic catalog enrichment (NEW!)
495-
await self.run_academic_catalog_enrichment()
504+
# 7. Optional: Author-level linked entities enrichment
505+
if run_author_linked_entities:
506+
await self.run_author_linked_entities_enrichment()
496507

497-
# 7. Final EPFL assessment (holistic)
508+
# 8. Final EPFL assessment (holistic)
498509
await self.run_epfl_final_assessment()
499510
```
500511

501512
**Order matters**:
502-
- Academic catalog enrichment runs AFTER user/org enrichment (needs author names)
513+
- Academic catalog enrichment (repository-level) runs in Stage 5 of atomic pipeline
514+
- Author-level linked entities enrichment is optional and runs separately
503515
- EPFL assessment runs LAST (reviews all collected data)
504516

517+
### Linked Entities Enrichment Scope
518+
519+
**Repository-Level (Default)**:
520+
- Runs automatically in Stage 5 of atomic pipeline
521+
- Searches Infoscience for publications about the repository/tool name
522+
- Stores results in `repository.linkedEntities`
523+
- Uses `search_infoscience_publications_tool` with repository name as query
524+
525+
**Author-Level (Optional)**:
526+
- Controlled by `run_author_linked_entities` parameter
527+
- Separate method: `run_author_linked_entities_enrichment()`
528+
- Searches Infoscience for each author individually
529+
- Assigns results to `author.linkedEntities` for each Person
530+
- Only runs when explicitly requested (default: `False`)
531+
505532
### Estimated Token Accumulation
506533

507534
**EVERY agent must accumulate estimated tokens**:
@@ -517,7 +544,7 @@ if usage and "estimated_input_tokens" in usage:
517544
- ✅ `run_llm_analysis()`
518545
- ✅ `run_organization_enrichment()`
519546
- ✅ `run_user_enrichment()`
520-
- ✅ `run_academic_catalog_enrichment()`
547+
- ✅ `run_linked_entities_enrichment()`
521548
- ✅ `run_epfl_final_assessment()`
522549

523550
## Testing Guidelines
@@ -539,8 +566,8 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
539566

540567
### Verification Checklist
541568

542-
- [ ] Repository `academicCatalogRelations` populated
543-
- [ ] Each author has `academicCatalogRelations` (may be empty)
569+
- [ ] Repository `linkedEntities` populated
570+
- [ ] Each author has `linkedEntities` (may be empty)
544571
- [ ] Relations include full entity objects (not just UUIDs)
545572
- [ ] **UUIDs are populated** (not null) for all matched entities
546573
- [ ] **URLs/profile_urls are populated** for all matched entities
@@ -552,7 +579,7 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
552579

553580
## Common Issues & Solutions
554581

555-
### Issue: UUID is null in academicCatalogRelations
582+
### Issue: UUID is null in linkedEntities
556583
**Cause**: Field name mismatch in parser (e.g., `url=` instead of `profile_url=`)
557584
**Symptoms**:
558585
```json
@@ -604,6 +631,26 @@ curl "http://0.0.0.0:1234/v1/extract/json/https://github.com/DeepLabCut/DeepLabC
604631
**Cause**: Parser passing wrong field name to Pydantic model
605632
**Solution**: Pydantic silently ignores unknown fields - verify field names match model definition
606633

634+
### Issue: Validation errors for union fields (entityInfosciencePublication, entityInfoscienceAuthor, entityInfoscienceLab)
635+
**Cause**: LLM populating all three union fields with the same data, or wrong entity type in wrong field
636+
**Symptoms**:
637+
```json
638+
{
639+
"entityType": "publication",
640+
"entityInfosciencePublication": {...}, // ✅ Correct
641+
"entityInfoscienceAuthor": {...}, // ❌ Should be None/omitted
642+
"entityInfoscienceLab": {...} // ❌ Should be None/omitted
643+
}
644+
```
645+
646+
**Solution**:
647+
1. **System prompt**: Explicitly instruct LLM to populate ONLY the field matching `entityType`
648+
2. **Reconciliation method**: `_reconcile_entity_union()` in `repositories.py`:
649+
- Checks `entityType` to select correct union variant
650+
- Removes other two fields
651+
- Converts `None` to empty lists for list fields (`subjects`, `authors`, `keywords`)
652+
3. **List field handling**: Convert `None` to `[]` for list fields before validation
653+
607654
## Future Extensions
608655

609656
### Adding New Catalogs
@@ -642,7 +689,7 @@ Future enhancement: Match same entities across catalogs using:
642689

643690
```python
644691
# Example future feature
645-
def deduplicate_across_catalogs(relations: List[AcademicCatalogRelation]):
692+
def deduplicate_across_catalogs(relations: List[linkedEntitiesRelation]):
646693
"""Merge same entities from different catalogs."""
647694
# Group by DOI, ORCID, or other stable identifiers
648695
# Provide unified view across catalogs
@@ -660,8 +707,8 @@ def deduplicate_across_catalogs(relations: List[AcademicCatalogRelation]):
660707

661708
## References
662709

663-
- Implementation: `src/agents/academic_catalog_enrichment.py`
664-
- Data Models: `src/data_models/academic_catalog.py`
710+
- Implementation: `src/agents/linked_entities_enrichment.py`
711+
- Data Models: `src/data_models/linked_entities.py`
665712
- Infoscience Client: `src/context/infoscience.py`
666713
- Integration: `src/analysis/repositories.py`
667-
- Documentation: `ACADEMIC_CATALOG_OPTION_B_IMPLEMENTATION.md` (if exists)
714+
- Documentation: `linked_entities_OPTION_B_IMPLEMENTATION.md` (if exists)

0 commit comments

Comments
 (0)