-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
guiGUI/interface related issuesGUI/interface related issuesloomPoliloom core project issuesPoliloom core project issues
Description
Overview
Simplified approach for adding government sources (parliaments, ministries, etc.) as data sources. Instead of scraper configs, we send index pages directly to the model (like we already do with detail pages). No pagination support.
Architecture
Refactor WikipediaLink → Source
Generalize the current WikipediaLink model to handle any source type:
class SourceType(enum.Enum):
INDEX = "INDEX" # List page with links to politicians
DETAIL = "DETAIL" # Individual politician page
class Source(Base):
id: UUID
url: str
source_type: SourceType
# Optional Wikipedia-specific fields
wikipedia_project_id: str | None # FK to WikipediaProject (NULL for non-Wikipedia)
politician_id: UUID | None # FK to Politician (NULL for INDEX sources)
# Campaign association
campaign_id: UUID | None # FK to Campaign (NULL for Wikipedia links)
# Relationships
languages: list[Language] # via SourceLanguage link table
class SourceLanguage(Base):
"""Link table between sources and language entities (many-to-many)."""
source_id: UUID # FK to Source, PK
language_id: str # FK to Language entity, PKSources can have multiple languages (Wikipedia projects can have multiple LANGUAGE_OF_WORK relations in Wikidata). The SourceLanguage link table mirrors the existing ArchivedPageLanguage pattern.
Campaign Model
Campaigns group index sources with metadata for batch processing:
class Campaign(Base):
id: UUID
name: str
country_id: str | None # FK to Country entity (optional filter)
position_ids: list[str] # Array of Position QIDs (optional filter)
created_at: datetime
# Relationships
sources: list[Source] # INDEX sources belonging to this campaignLanguage on Source vs ArchivedPage
Move language association from ArchivedPage to Source:
- Language is known at source creation time (from Wikipedia project relations or user input)
ArchivedPagebecomes purely about content storage- Simplifies the data model - source metadata stays with source
- Preserves support for multiple languages per source
Workflow
Index Page Processing
- Create Campaign with index URLs as Source records (
source_type=INDEX) - Run campaign:
- Fetch each index URL via Playwright → ArchivedPage
- Send rendered HTML to LLM with prompt: "Extract all politician detail page URLs from this index"
- Create new Source records (
source_type=DETAIL) for extracted URLs - Link detail sources to campaign's country/positions for filtering
Detail Page Processing
Detail pages go through existing enrichment pipeline:
- Triggered via
enrich-wikipediaCLI or API - Same LLM extraction for politician properties
- Same evaluation workflow
Supersedes #109
Metadata
Metadata
Assignees
Labels
guiGUI/interface related issuesGUI/interface related issuesloomPoliloom core project issuesPoliloom core project issues