Skip to content

Commit 789802a

Browse files
committed
unify positions with the rest of the wiki entities, semantic search for all
1 parent c618b9c commit 789802a

26 files changed

+365
-1356
lines changed

poliloom/CLAUDE.md

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,8 @@ This document outlines the high-level architecture and strategy for the PoliLoom
2121
- **API:** FastAPI with MediaWiki OAuth 2.0 authentication
2222
- **Database:** PostgreSQL with SQLAlchemy ORM and Alembic migrations
2323
- **LLM Integration:** OpenAI API for structured data extraction
24-
- **Vector Search:** SentenceTransformers ('paraphrase-multilingual-MiniLM-L12-v2') + pgvector extension
24+
- **Search:** Meilisearch with OpenAI embeddings for hybrid search (keyword + semantic)
2525
- **Storage:** Google Cloud Storage (GCS) for dump processing (automatic gs:// path detection)
26-
- **PyTorch:** CPU-only in Docker (`uv sync --extra cpu`), GPU for development (`uv sync --extra cu128`)
2726

2827
**Important:** Always use `uv` for running Python commands and managing dependencies.
2928

@@ -44,7 +43,7 @@ Each Wikidata entity type has a dedicated class (`WikidataPolitician`, `Wikidata
4443
For entity-linked properties (OpenAI's 500 enum limit):
4544

4645
1. **Free-form Extraction:** LLM extracts natural language descriptions
47-
2. **Vector Mapping:** Generate embeddings → similarity search top 100 → LLM maps to specific Wikidata entity or None
46+
2. **Entity Mapping:** Meilisearch hybrid search (keyword + semantic) → top 100 candidates → LLM maps to specific Wikidata entity or None
4847

4948
## **4. Core Functionality**
5049

@@ -60,7 +59,7 @@ For entity-linked properties (OpenAI's 500 enum limit):
6059
- **Wikipedia Content:** Fetch and process linked articles
6160
- **LLM Extraction:** OpenAI structured data API for politician properties
6261
- **Conflict Detection:** Flag discrepancies between extracted and existing Wikidata values
63-
- **Similarity Search:** Match unlinked entities using embeddings
62+
- **Similarity Search:** Match unlinked entities using Meilisearch hybrid search
6463

6564
### **API Endpoints**
6665

@@ -87,11 +86,11 @@ _Use `--help` for detailed command documentation._
8786
- Actions: **Accept** new extracted data (submit to Wikidata), **Reject** incorrect extracted data (soft delete), **Deprecate** existing statements (mark as deprecated in Wikidata)
8887
- Supports multiple users and threshold-based workflows
8988

90-
### **Embedding Workflow**
89+
### **Search & Similarity**
9190

92-
- Position/Location embeddings initially NULL during import
93-
- Generated separately in batch processing for optimal performance
94-
- Used for similarity search in two-stage extraction
91+
- All entities indexed to Meilisearch with labels during import
92+
- Meilisearch uses OpenAI embeddings for hybrid search (keyword + semantic)
93+
- Position entities use higher semantic ratio (0.8) for better matching
9594

9695
### **Conflict Handling**
9796

@@ -163,9 +162,6 @@ uv run poliloom import-hierarchy --file ./dump.json
163162
uv run poliloom import-entities --file ./dump.json
164163
uv run poliloom import-politicians --file ./dump.json
165164

166-
# Generate embeddings
167-
uv run poliloom embed-entities
168-
169165
# Enrich politician data
170166
uv run poliloom enrich-wikipedia --id Q6279
171167
uv run poliloom enrich-wikipedia --limit 100
@@ -187,15 +183,15 @@ uv run poliloom garbage-collect
187183

188184
- **Framework**: pytest with asyncio support
189185
- **Database**: PostgreSQL test database (port 5433)
190-
- **Mocking**: External APIs (OpenAI, sentence-transformers) mocked in `conftest.py`
186+
- **Mocking**: External APIs (OpenAI, Meilisearch) mocked in `conftest.py`
191187
- **Coverage Focus**: Entity classes, database models, core data pipeline
192188
- **Approach**: Minimal, behavior-focused testing. Test business logic and data transformations, not language mechanics (inheritance, type checking). Avoid over-engineering tests.
193189

194190
### **Key Patterns**
195191

196192
- **Entity-Oriented Architecture**: Each Wikidata entity type has dedicated class
197193
- **Date Handling**: Store incomplete dates as strings ('1962', 'JUN 1982')
198-
- **Embedding Strategy**: NULL during import, batch-generated separately
194+
- **Search Indexing**: Entities indexed to Meilisearch during import, embeddings generated by Meilisearch
199195
- **Error Handling**: Comprehensive logging and graceful degradation
200196

201197
### **Pre-commit Configuration**

poliloom/Dockerfile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@ ENV UV_COMPILE_BYTECODE=1 \
1010
# Set working directory
1111
WORKDIR /app
1212

13-
# Install all dependencies first with CPU-only PyTorch
13+
# Install all dependencies first
1414
RUN --mount=type=cache,target=/root/.cache/uv \
1515
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
1616
--mount=type=bind,source=uv.lock,target=uv.lock \
17-
uv sync --frozen --no-install-project --no-dev --extra cpu
17+
uv sync --frozen --no-install-project --no-dev
1818

1919
# Copy package source and setup files
2020
COPY pyproject.toml ./
@@ -45,8 +45,8 @@ ENV PATH="/app/.venv/bin:$PATH"
4545
# Create non-root user
4646
RUN groupadd -r poliloom && useradd -r -g poliloom poliloom
4747
# Create cache directories with correct ownership
48-
RUN mkdir -p /var/cache/wikidata /var/cache/huggingface /var/cache/playwright && \
49-
chown -R poliloom:poliloom /var/cache/wikidata /var/cache/huggingface /var/cache/playwright
48+
RUN mkdir -p /var/cache/wikidata /var/cache/playwright && \
49+
chown -R poliloom:poliloom /var/cache/wikidata /var/cache/playwright
5050

5151
# Install Playwright browsers as root, then fix ownership
5252
ENV PLAYWRIGHT_BROWSERS_PATH=/var/cache/playwright

poliloom/README.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ The Python backend for PoliLoom — processes Wikidata dumps, extracts politicia
55
## Requirements
66

77
- Python 3.12+ with [uv](https://docs.astral.sh/uv/)
8-
- PostgreSQL with pgvector extension
8+
- PostgreSQL
9+
- Meilisearch
910
- Linux or macOS (Windows not supported due to multiprocessing requirements)
1011
- OpenAI API key
1112

@@ -41,9 +42,6 @@ make extract-wikidata-dump
4142
uv run poliloom import-hierarchy # Build entity relationship trees
4243
uv run poliloom import-entities # Import positions, locations, countries
4344
uv run poliloom import-politicians # Import politicians
44-
45-
# Generate embeddings for semantic search
46-
uv run poliloom embed-entities
4745
```
4846

4947
### Extract politician data
@@ -74,7 +72,6 @@ API documentation available at http://localhost:8000/docs
7472
| `poliloom import-hierarchy` | Build position/location hierarchy trees from Wikidata |
7573
| `poliloom import-entities` | Import positions, locations, and countries |
7674
| `poliloom import-politicians` | Import politicians linking to existing entities |
77-
| `poliloom embed-entities` | Generate vector embeddings for semantic search |
7875
| `poliloom enrich-wikipedia` | Extract politician data from Wikipedia using AI |
7976
| `poliloom garbage-collect` | Remove entities deleted from Wikidata |
8077

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
"""remove embedding column from positions
2+
3+
Revision ID: b351d1ca5357
4+
Revises: d6772e534c56
5+
Create Date: 2025-12-14 15:30:55.249491
6+
7+
"""
8+
9+
from typing import Sequence, Union
10+
11+
from alembic import op
12+
import sqlalchemy as sa
13+
import pgvector.sqlalchemy
14+
15+
16+
# revision identifiers, used by Alembic.
17+
revision: str = "b351d1ca5357"
18+
down_revision: Union[str, None] = "d6772e534c56"
19+
branch_labels: Union[str, Sequence[str], None] = None
20+
depends_on: Union[str, Sequence[str], None] = None
21+
22+
23+
def upgrade() -> None:
24+
"""Upgrade schema."""
25+
# Drop the embedding reset trigger and function first
26+
op.execute(
27+
"DROP TRIGGER IF EXISTS wikidata_entity_name_change_trigger ON wikidata_entities"
28+
)
29+
op.execute("DROP FUNCTION IF EXISTS reset_embedding_on_name_change()")
30+
31+
# Drop the embedding column
32+
op.drop_column("positions", "embedding")
33+
34+
35+
def downgrade() -> None:
36+
"""Downgrade schema."""
37+
# Re-add the embedding column
38+
op.add_column(
39+
"positions",
40+
sa.Column(
41+
"embedding",
42+
pgvector.sqlalchemy.vector.VECTOR(dim=384),
43+
autoincrement=False,
44+
nullable=True,
45+
),
46+
)
47+
48+
# Re-create the embedding reset function and trigger
49+
op.execute("""
50+
CREATE OR REPLACE FUNCTION reset_embedding_on_name_change()
51+
RETURNS TRIGGER AS $$
52+
BEGIN
53+
IF OLD.name IS DISTINCT FROM NEW.name THEN
54+
UPDATE positions SET embedding = NULL WHERE wikidata_id = NEW.wikidata_id;
55+
END IF;
56+
RETURN NEW;
57+
END;
58+
$$ LANGUAGE plpgsql;
59+
""")
60+
op.execute("""
61+
CREATE TRIGGER wikidata_entity_name_change_trigger
62+
AFTER UPDATE ON wikidata_entities
63+
FOR EACH ROW
64+
EXECUTE FUNCTION reset_embedding_on_name_change();
65+
""")

poliloom/poliloom/api/entities.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from sqlalchemy import select, func, and_, case
77

88
from ..database import get_db_session
9-
from ..search import SearchService, get_search_service
9+
from ..search import SearchService
1010
from ..models import (
1111
Language,
1212
Country,
@@ -174,15 +174,15 @@ async def endpoint(
174174
description=f"Maximum number of {entity_name} to return",
175175
),
176176
db: Session = Depends(get_db_session),
177-
search_service: SearchService = Depends(get_search_service),
178177
current_user: User = Depends(get_current_user),
179178
):
180179
f"""
181180
Search {entity_name} by name/label using semantic similarity.
182181
183182
Returns matching {entity_name} ranked by relevance with hierarchy data.
184183
"""
185-
entity_ids = model_class.find_similar(q, db, search_service, limit=limit)
184+
search_service = SearchService()
185+
entity_ids = model_class.find_similar(q, search_service, limit=limit)
186186
if not entity_ids:
187187
return []
188188

poliloom/poliloom/api/politicians.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
enrich_batch,
1818
has_enrichable_politicians,
1919
)
20-
from ..search import SearchService, get_search_service
20+
from ..search import SearchService
2121
from ..models import (
2222
ArchivedPage,
2323
ArchivedPageLanguage,
@@ -250,15 +250,15 @@ async def search_politicians(
250250
default=50, le=100, description="Maximum number of politicians to return"
251251
),
252252
db: Session = Depends(get_db_session),
253-
search_service: SearchService = Depends(get_search_service),
254253
current_user: User = Depends(get_current_user),
255254
):
256255
"""
257256
Search politicians by name/label using semantic similarity.
258257
259258
Returns matching politicians ranked by relevance with their properties.
260259
"""
261-
entity_ids = Politician.find_similar(q, db, search_service, limit=limit)
260+
search_service = SearchService()
261+
entity_ids = Politician.find_similar(q, search_service, limit=limit)
262262
if not entity_ids:
263263
return []
264264

0 commit comments

Comments
 (0)