You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Wikipedia Content:** Fetch and process linked articles
61
60
-**LLM Extraction:** OpenAI structured data API for politician properties
62
61
-**Conflict Detection:** Flag discrepancies between extracted and existing Wikidata values
63
-
-**Similarity Search:** Match unlinked entities using embeddings
62
+
-**Similarity Search:** Match unlinked entities using Meilisearch hybrid search
64
63
65
64
### **API Endpoints**
66
65
@@ -87,11 +86,11 @@ _Use `--help` for detailed command documentation._
87
86
- Actions: **Accept** new extracted data (submit to Wikidata), **Reject** incorrect extracted data (soft delete), **Deprecate** existing statements (mark as deprecated in Wikidata)
88
87
- Supports multiple users and threshold-based workflows
89
88
90
-
### **Embedding Workflow**
89
+
### **Search & Similarity**
91
90
92
-
-Position/Location embeddings initially NULL during import
93
-
-Generated separately in batch processing for optimal performance
94
-
-Used for similarity search in two-stage extraction
91
+
-All entities indexed to Meilisearch with labels during import
92
+
-Meilisearch uses OpenAI embeddings for hybrid search (keyword + semantic)
93
+
-Position entities use higher semantic ratio (0.8) for better matching
95
94
96
95
### **Conflict Handling**
97
96
@@ -163,9 +162,6 @@ uv run poliloom import-hierarchy --file ./dump.json
163
162
uv run poliloom import-entities --file ./dump.json
164
163
uv run poliloom import-politicians --file ./dump.json
165
164
166
-
# Generate embeddings
167
-
uv run poliloom embed-entities
168
-
169
165
# Enrich politician data
170
166
uv run poliloom enrich-wikipedia --id Q6279
171
167
uv run poliloom enrich-wikipedia --limit 100
@@ -187,15 +183,15 @@ uv run poliloom garbage-collect
187
183
188
184
-**Framework**: pytest with asyncio support
189
185
-**Database**: PostgreSQL test database (port 5433)
190
-
-**Mocking**: External APIs (OpenAI, sentence-transformers) mocked in `conftest.py`
186
+
-**Mocking**: External APIs (OpenAI, Meilisearch) mocked in `conftest.py`
191
187
-**Coverage Focus**: Entity classes, database models, core data pipeline
192
188
-**Approach**: Minimal, behavior-focused testing. Test business logic and data transformations, not language mechanics (inheritance, type checking). Avoid over-engineering tests.
193
189
194
190
### **Key Patterns**
195
191
196
192
-**Entity-Oriented Architecture**: Each Wikidata entity type has dedicated class
197
193
-**Date Handling**: Store incomplete dates as strings ('1962', 'JUN 1982')
198
-
-**Embedding Strategy**: NULL during import, batch-generated separately
194
+
-**Search Indexing**: Entities indexed to Meilisearch during import, embeddings generated by Meilisearch
199
195
-**Error Handling**: Comprehensive logging and graceful degradation
0 commit comments