feat: introduce pgvector backend support and update deployment config#121
feat: introduce pgvector backend support and update deployment config#121LeslieLeung wants to merge 9 commits into
Conversation
…urations - Added support for PostgreSQL's pgvector extension as an alternative vector backend. - Updated .env.example to include configuration options for pgvector. - Modified deployment documentation to reflect the new pgvector setup. - Created new docker-compose files for pgvector development and deployment. - Refactored existing code to accommodate vector backend selection, enhancing flexibility for users.
PR Review: feat: introduce pgvector backend supportThis is a solid, well-structured PR that adds pgvector as an alternative vector backend while keeping Milvus support. The abstraction layer via SecuritySQL Injection risk in text(
f"SELECT name, model_signature " # noqa: S608
f"FROM {pgvector_config.metadata_table} "
"WHERE name IN ('entries', 'preferences')"
)The The same pattern appears in Dependency Concernpgvector is now a hard dependency even for Milvus-only deployments Both # In pyproject.toml
[project.optional-dependencies]
pgvector = ["pgvector>=0.3.6"]The current Missing Input Validation
from pydantic import field_validator
@field_validator("backend")
@classmethod
def validate_backend(cls, v: str) -> str:
allowed = {"milvus", "pgvector"}
if v.lower() not in allowed:
raise ValueError(f"VECTOR_BACKEND must be one of {allowed}, got '{v}'")
return v.lower()Performance / CorrectnessNo vector index on The migration comment acknowledges this:
This means all similarity searches via
CREATE INDEX IF NOT EXISTS idx_entry_embeddings_embedding_hnsw
ON entry_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);The downgrade migration already references
When pgvector uses the same database as the main app (the default fallback to
In published_ts = (
int(published_at.timestamp()) if published_at else int(datetime.now().timestamp())
)and now_ts = int(datetime.now().timestamp())Should use Job ID collision risk in batch_prefix = f"maint_{int(time.time())}"If the maintenance cron fires multiple times in the same second (unlikely but possible in tests or under clock skew), the job IDs will collide and arq will deduplicate them. Using Minor IssuesTest only exercises happy path / unit behavior for pgvector
Given that this is the core of the new backend, integration tests (even with a mocked engine) would significantly improve confidence.
def _is_duplicate_feed_guid_error(error: IntegrityError) -> bool:
error_text = str(error)
return (
"uq_feed_guid" in error_text
or "duplicate key value violates unique constraint" in error_text
)The second branch ( return "uq_feed_guid" in str(error.orig)What's Done Well
The issues above range from a real security concern (SQL identifier interpolation) to architectural notes (shared connection pool, hard dependency). I'd recommend addressing the security and dependency points before merging; the rest can be follow-ups. |
- Changed PostgreSQL service image to pgvector/pgvector:pg16 for improved compatibility. - Modified embedding maintenance task to use UUIDs for batch prefixes instead of timestamps, ensuring uniqueness. - Updated datetime handling in PgVectorClient to use UTC for consistency across time-related operations. - Enhanced SQL commands in PgVectorClient and validation service to properly quote identifiers, improving SQL safety and correctness.
- Added SQL command to create the vector extension during the database initialization process in the test engine function, ensuring the extension is available for use.
1135683 to
9f23044
Compare
|
@claude Review this pr. |
|
Claude finished @LeslieLeung's task in 2m 35s —— View job PR Review: feat: introduce pgvector backend support
This is a solid PR that adds pgvector as a lightweight alternative to Milvus. The What Was Fixed Since Previous Review
Remaining IssuesBug: Naive
|
| Issue | Severity | Status |
|---|---|---|
datetime.now() without UTC in embedding_service.py:159 |
Bug | Unresolved |
time.time() job prefix in embedding_rebuild.py:168 |
Minor bug | Unresolved (fixed in maintenance only) |
_is_duplicate_feed_guid_error overly broad match |
Bug | Unresolved |
SQL identifier quoting (_quote_ident duplication) |
Minor | Partially resolved |
pgvector hard dep + dead try/except fallbacks |
Architecture | Unresolved |
| No HNSW similarity index | Performance | Acknowledged, needs docs |
| Dual connection pool | Architecture | Unresolved |
Minimal PgVectorClient tests |
Testing | Unresolved |
VECTOR_BACKEND unvalidated string |
DX | Unresolved |
The first three items (datetime bug, job ID inconsistency, error matching) are straightforward fixes. Everything else can be addressed in follow-up issues. The overall implementation quality is good and the PR is close to mergeable.
- Modified datetime imports and usage in MilvusClient, EmbeddingService, and PreferenceService to ensure consistent UTC handling for timestamps.
|
@codex Review this. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4df65ca333
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if Vector is None: | ||
| raise RuntimeError("pgvector is required to run this migration") | ||
|
|
||
| op.execute("CREATE EXTENSION IF NOT EXISTS vector") |
There was a problem hiding this comment.
Don't require pgvector in every schema upgrade
This Alembic revision runs for every deployment, including existing Milvus installs, but upgrade() now unconditionally imports pgvector and executes CREATE EXTENSION vector. On any PostgreSQL instance that does not have the extension package installed or does not grant CREATE EXTENSION to the app role, the entire migration chain stops here even though the operator is not using the pgvector backend at all.
Useful? React with 👍 / 👎.
| vector_client = create_vector_store_client() | ||
| vector_client.connect() | ||
| await vector_client.ensure_collections(config.dimension, config.provider, config.model) |
There was a problem hiding this comment.
Avoid schema initialization in the per-request score dependency
get_score_service() is invoked on every GET /entries request, and with the new pgvector backend this ensure_collections() call is not read-only: a fresh PgVectorClient starts with _schema_ensured=False and _last_model_signature=None, so ensure_collections() reruns the DDL checks and upserts vector_store_metadata before scoring. In pgvector mode that turns an ordinary read path into a write/DDL path, which will fail on read-only replicas or roles without extension/schema privileges.
Useful? React with 👍 / 👎.
| }, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Preserve
last_entry_at when a concurrent insert loses the race
In the new duplicate-savepoint path, a concurrent fetch that loses the uq_feed_guid race immediately continues here, so the later latest_entry_time update is skipped for that parsed item. Because latest_entry_time was initialized from this transaction's stale feed.last_entry_at, that worker can commit after the winner and overwrite the newer timestamp, causing the feed to keep revisiting already-ingested entries.
Useful? React with 👍 / 👎.
Move vector collection initialization out of the request dependency so score service resolution reuses the app-scoped client without triggering request-time schema setup. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad6e4a6fa9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for entry_id in pending_embedding_ids: | ||
| await ctx["redis"].enqueue_job("generate_entry_embedding", entry_id) |
There was a problem hiding this comment.
Preserve embedding job enqueue on post-commit failures
This loop enqueues embedding jobs only after the DB transaction has already committed, so a transient Redis failure here raises the task after entries are persisted but before their embedding jobs are queued. On retry, those entries are now treated as existing and skipped earlier in fetch_feed_task, so they never get re-enqueued for embedding and can remain permanently unembedded (and therefore unscored).
Useful? React with 👍 / 👎.
| success=True, | ||
| message="pgvector connection successful", | ||
| details={ | ||
| "database_url_configured": True, | ||
| "vector_extension_installed": has_extension, |
There was a problem hiding this comment.
Fail pgvector validation when vector extension is missing
validate_pgvector() always returns success=True once it can connect, even when vector_extension_installed is false, so validate_and_rebuild_embeddings() treats the backend as valid and proceeds. In environments where the role cannot create/install vector, this produces a false-positive validation and then fails later during collection creation, rather than surfacing the backend incompatibility during validation.
Useful? React with 👍 / 👎.
Uh oh!
There was an error while loading. Please reload this page.