Skip to content

Simplified source scraping with campaigns #112

@monneyboi

Description

@monneyboi

Overview

Simplified approach for adding government sources (parliaments, ministries, etc.) as data sources. Instead of scraper configs, we send index pages directly to the model (like we already do with detail pages). No pagination support.

Architecture

Refactor WikipediaLink → Source

Generalize the current WikipediaLink model to handle any source type:

class SourceType(enum.Enum):
    INDEX = "INDEX"      # List page with links to politicians
    DETAIL = "DETAIL"    # Individual politician page

class Source(Base):
    id: UUID
    url: str
    source_type: SourceType
    
    # Optional Wikipedia-specific fields
    wikipedia_project_id: str | None  # FK to WikipediaProject (NULL for non-Wikipedia)
    politician_id: UUID | None  # FK to Politician (NULL for INDEX sources)
    
    # Campaign association
    campaign_id: UUID | None  # FK to Campaign (NULL for Wikipedia links)
    
    # Relationships
    languages: list[Language]  # via SourceLanguage link table

class SourceLanguage(Base):
    """Link table between sources and language entities (many-to-many)."""
    source_id: UUID  # FK to Source, PK
    language_id: str  # FK to Language entity, PK

Sources can have multiple languages (Wikipedia projects can have multiple LANGUAGE_OF_WORK relations in Wikidata). The SourceLanguage link table mirrors the existing ArchivedPageLanguage pattern.

Campaign Model

Campaigns group index sources with metadata for batch processing:

class Campaign(Base):
    id: UUID
    name: str
    country_id: str | None  # FK to Country entity (optional filter)
    position_ids: list[str]  # Array of Position QIDs (optional filter)
    created_at: datetime
    
    # Relationships
    sources: list[Source]  # INDEX sources belonging to this campaign

Language on Source vs ArchivedPage

Move language association from ArchivedPage to Source:

  • Language is known at source creation time (from Wikipedia project relations or user input)
  • ArchivedPage becomes purely about content storage
  • Simplifies the data model - source metadata stays with source
  • Preserves support for multiple languages per source

Workflow

Index Page Processing

  1. Create Campaign with index URLs as Source records (source_type=INDEX)
  2. Run campaign:
    • Fetch each index URL via Playwright → ArchivedPage
    • Send rendered HTML to LLM with prompt: "Extract all politician detail page URLs from this index"
    • Create new Source records (source_type=DETAIL) for extracted URLs
    • Link detail sources to campaign's country/positions for filtering

Detail Page Processing

Detail pages go through existing enrichment pipeline:

  • Triggered via enrich-wikipedia CLI or API
  • Same LLM extraction for politician properties
  • Same evaluation workflow

Supersedes #109

Metadata

Metadata

Assignees

No one assigned

    Labels

    guiGUI/interface related issuesloomPoliloom core project issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions