Skip to content

feat: introduce pgvector backend support and update deployment config#121

Open
LeslieLeung wants to merge 9 commits into
mainfrom
feature/pgvector
Open

feat: introduce pgvector backend support and update deployment config#121
LeslieLeung wants to merge 9 commits into
mainfrom
feature/pgvector

Conversation

@LeslieLeung
Copy link
Copy Markdown
Owner

@LeslieLeung LeslieLeung commented Feb 25, 2026

  • Added support for PostgreSQL's pgvector extension as an alternative vector backend.
  • Updated .env.example to include configuration options for pgvector.
  • Modified deployment documentation to reflect the new pgvector setup.
  • Created new docker-compose files for pgvector development and deployment.
  • Refactored existing code to accommodate vector backend selection, enhancing flexibility for users.

…urations

- Added support for PostgreSQL's pgvector extension as an alternative vector backend.
- Updated .env.example to include configuration options for pgvector.
- Modified deployment documentation to reflect the new pgvector setup.
- Created new docker-compose files for pgvector development and deployment.
- Refactored existing code to accommodate vector backend selection, enhancing flexibility for users.
@LeslieLeung LeslieLeung changed the title feat: introduce pgvector backend support and update deployment config… feat: introduce pgvector backend support and update deployment config Feb 25, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Feb 25, 2026

PR Review: feat: introduce pgvector backend support

This is a solid, well-structured PR that adds pgvector as an alternative vector backend while keeping Milvus support. The abstraction layer via VectorStoreClient Protocol is clean, and several important correctness fixes are bundled in (SELECT FOR UPDATE SKIP LOCKED, post-commit embedding enqueue, distributed rebuild lock, stuck-entry recovery). Overall the approach is sound. Here are my findings:


Security

SQL Injection risk in validate_pgvector (validation_service.py)

text(
    f"SELECT name, model_signature "  # noqa: S608
    f"FROM {pgvector_config.metadata_table} "
    "WHERE name IN ('entries', 'preferences')"
)

The # noqa: S608 comment suppresses the linter warning but the underlying concern is real. pgvector_config.metadata_table comes from an env var, so it is not direct user input, but it's still an interpolated string in raw SQL. If the env var is misconfigured it could cause unexpected query behavior. Prefer sqlalchemy.text with a bound parameter for the table name, or quote the identifier explicitly with asyncpg's quote_ident.

The same pattern appears in recreate_collections with bare DROP TABLE IF EXISTS {table.name}. PostgreSQL identifier quoting ("table_name") would be safer here.


Dependency Concern

pgvector is now a hard dependency even for Milvus-only deployments

Both glean_database/pyproject.toml and glean_vector/pyproject.toml add pgvector>=0.3.6 as a required dependency. This means users who only use Milvus now pull in the pgvector Python package. Consider making it an optional dependency:

# In pyproject.toml
[project.optional-dependencies]
pgvector = ["pgvector>=0.3.6"]

The current try/except ImportError fallbacks in the model files (entry_embedding.py, user_preference_vector.py) suggest this was originally intended to be optional. Since pgvector is now unconditionally required, those fallbacks are dead code and create confusion about whether the package is actually required.


Missing Input Validation

VECTOR_BACKEND is an unvalidated string

VectorBackendConfig.backend is a plain str with no enum/validator. An invalid value (e.g., VECTOR_BACKEND=clickhouse) will pass startup silently and only fail at runtime when create_vector_store_client() raises ValueError. Catching it early with a Pydantic validator would give a clearer error message at startup:

from pydantic import field_validator

@field_validator("backend")
@classmethod
def validate_backend(cls, v: str) -> str:
    allowed = {"milvus", "pgvector"}
    if v.lower() not in allowed:
        raise ValueError(f"VECTOR_BACKEND must be one of {allowed}, got '{v}'")
    return v.lower()

Performance / Correctness

No vector index on entry_embeddings.embedding

The migration comment acknowledges this:

HNSW index requires fixed-dimension vector columns. We intentionally keep vector columns dimension-agnostic here.

This means all similarity searches via search_similar_entries will do sequential full-table scans until an HNSW or IVFFlat index is created manually. At even modest scale (tens of thousands of entries) this will be slow. Consider:

  1. Documenting the recommended post-migration index creation command in DEPLOY.md.
  2. Or creating the index in ensure_collections after create_all, using the actual dimension (which is available at that point) — e.g.:
CREATE INDEX IF NOT EXISTS idx_entry_embeddings_embedding_hnsw
ON entry_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

The downgrade migration already references DROP INDEX IF EXISTS idx_entry_embeddings_embedding_hnsw, so this was clearly planned but not included in the upgrade.


PgVectorClient creates a second independent connection pool

When pgvector uses the same database as the main app (the default fallback to DATABASE_URL), two separate connection pools will be maintained. This doubles connection overhead and complicates observability. Consider accepting an existing engine/session-maker as a constructor argument, or at minimum documenting the trade-off.


datetime.now() without timezone

In pgvector_client.py:

published_ts = (
    int(published_at.timestamp()) if published_at else int(datetime.now().timestamp())
)

and

now_ts = int(datetime.now().timestamp())

Should use datetime.now(UTC) for consistency with the rest of the codebase (the maintenance task already uses datetime.now(UTC)).


Job ID collision risk in _re_enqueue_pending

batch_prefix = f"maint_{int(time.time())}"

If the maintenance cron fires multiple times in the same second (unlikely but possible in tests or under clock skew), the job IDs will collide and arq will deduplicate them. Using uuid4() or adding microseconds would be safer.


Minor Issues

Test only exercises happy path / unit behavior for pgvector

test_pgvector_client.py only tests two trivial default-state checks. No tests cover:

  • ensure_collections idempotency
  • insert_entry_embedding / upsert behavior
  • search_similar_entries filter logic
  • recreate_collections dropping and recreating tables

Given that this is the core of the new backend, integration tests (even with a mocked engine) would significantly improve confidence.

_is_duplicate_feed_guid_error is fragile

def _is_duplicate_feed_guid_error(error: IntegrityError) -> bool:
    error_text = str(error)
    return (
        "uq_feed_guid" in error_text
        or "duplicate key value violates unique constraint" in error_text
    )

The second branch ("duplicate key value violates unique constraint") matches any unique constraint violation on the table, not just uq_feed_guid. This could mask legitimate constraint errors. Match on the constraint name only:

return "uq_feed_guid" in str(error.orig)

What's Done Well

  • The VectorStoreClient Protocol is a clean abstraction — both backends become interchangeable without changing calling code.
  • SELECT FOR UPDATE SKIP LOCKED in batch/retry operations is exactly the right pattern for multi-worker correctness.
  • The post-commit embedding enqueue fix in feed_fetcher.py eliminates a real race condition.
  • Distributed rebuild lock with Redis prevents duplicate concurrent rebuilds.
  • _recover_stuck_entries with a generous threshold is a solid resilience mechanism.
  • The _safe_handle_error / _mark_failed_safe helpers properly handle asyncpg's failed-transaction state.
  • Bilingual documentation is well-maintained and accurate.
  • Factory tests (test_vector_factory.py) and recovery tests (test_vector_client_recovery.py) are good additions.

The issues above range from a real security concern (SQL identifier interpolation) to architectural notes (shared connection pool, hard dependency). I'd recommend addressing the security and dependency points before merging; the rest can be follow-ups.

- Changed PostgreSQL service image to pgvector/pgvector:pg16 for improved compatibility.
- Modified embedding maintenance task to use UUIDs for batch prefixes instead of timestamps, ensuring uniqueness.
- Updated datetime handling in PgVectorClient to use UTC for consistency across time-related operations.
- Enhanced SQL commands in PgVectorClient and validation service to properly quote identifiers, improving SQL safety and correctness.
- Added SQL command to create the vector extension during the database initialization process in the test engine function, ensuring the extension is available for use.
@LeslieLeung
Copy link
Copy Markdown
Owner Author

@claude Review this pr.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 11, 2026

Claude finished @LeslieLeung's task in 2m 35s —— View job


PR Review: feat: introduce pgvector backend support

  • Gather context and understand PR scope
  • Read core implementation files
  • Review models and migration
  • Review config and factory pattern
  • Review worker tasks
  • Post final review

This is a solid PR that adds pgvector as a lightweight alternative to Milvus. The VectorStoreClient Protocol abstraction is clean and the bundled correctness fixes (SELECT FOR UPDATE SKIP LOCKED, post-commit embedding enqueue, distributed rebuild lock, stuck-entry recovery) are real improvements. The previous review comments have been partially addressed. Here's a fresh look at what remains.


What Was Fixed Since Previous Review

  • datetime.now()datetime.now(UTC) in pgvector_client.py
  • Job ID collision in embedding_maintenance.py now uses uuid4()
  • SQL identifier quoting in pgvector_client.py uses _quote_ident()
  • Table existence checks in validate_pgvector now use bound parameters via to_regclass(:table_name)
  • Both docker-compose.dev.yml and docker-compose.yml switched to pgvector/pgvector:pg16, so the migration's CREATE EXTENSION IF NOT EXISTS vector won't fail for Milvus users ✓

Remaining Issues

Bug: Naive datetime.now() in embedding_service.py

embedding_service.py:159 still uses timezone-naive datetime:

embedding_at=datetime.now(),   # ← should be datetime.now(UTC)

Every other datetime call in this PR was updated to datetime.now(UTC), but this one was missed. This is inconsistent and could cause timezone-related issues for timezone-aware comparisons. Fix this →


Inconsistency: time.time() Still Used in embedding_rebuild.py

embedding_rebuild.py:168:

batch_prefix = f"rebuild_{int(time.time())}"

embedding_maintenance.py was correctly fixed to use uuid4().hex for job ID uniqueness, but the same fix wasn't applied to the rebuild task. If a rebuild is triggered twice within the same second (e.g., under high load or in tests), job IDs will collide and arq will silently deduplicate them. Fix this →


Overly Broad Error Matching in _is_duplicate_feed_guid_error

feed_fetcher.py:25-31:

def _is_duplicate_feed_guid_error(error: IntegrityError) -> bool:
    error_text = str(error)
    return (
        "uq_feed_guid" in error_text
        or "duplicate key value violates unique constraint" in error_text  # ← too broad
    )

The second branch matches any unique constraint violation on the table (e.g., a future uq_feed_url constraint), not just uq_feed_guid. A non-GUID duplicate would be silently swallowed as if it were a normal concurrent-insert race. Match on the specific constraint name only:

return "uq_feed_guid" in str(error.orig)

Fix this →


Remaining SQL Identifier Interpolation in validate_pgvector

validation_service.py:314-322:

_quoted_table = '"' + pgvector_config.metadata_table.replace('"', '""') + '"'
rows = await conn.execute(
    text(  # noqa: S608
        f"SELECT name, model_signature FROM {_quoted_table}"
        " WHERE name IN ('entries', 'preferences')"
    )
)

The double-quoting is correct for identifier safety, but the # noqa: S608 suppression is still needed, which indicates the linter still sees a concern. Now that _quote_ident() exists in pgvector_client.py, consider importing and reusing it here, or moving it to a shared utility. The logic is duplicated (the inline quoting vs. the _quote_ident function).


pgvector as Hard Dependency; Dead try/except Fallbacks

Both glean_database/pyproject.toml and glean_vector/pyproject.toml declare pgvector>=0.3.6 as a required dependency. The try/except ImportError guards in the model files then become dead code that creates confusion:

The previous review suggested making pgvector optional. If you're keeping it as a hard dependency (acceptable given the migration always creates the extension), remove the dead fallback code to avoid confusion.


No Vector Similarity Index (Acknowledged, Needs Docs)

The migration comment acknowledges that no HNSW index is created, leaving all search_similar_entries calls as sequential full-table scans. The downgrade migration even has DROP INDEX IF EXISTS idx_entry_embeddings_embedding_hnsw, showing this was planned.

At minimum, add the recommended index creation command to DEPLOY.md so operators know to run it after initial setup:

-- Run after migrations; adjust m/ef_construction for your dataset size
CREATE INDEX IF NOT EXISTS idx_entry_embeddings_embedding_hnsw
ON entry_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

PgVectorClient Independent Connection Pool

When PGVECTOR_DATABASE_URL is unset, the client falls back to DATABASE_URL, which is the main app database. Two separate async connection pools then run against the same PostgreSQL instance — doubling connection overhead and making connection-limit tuning harder. This is an architectural note rather than a blocker, but worth documenting explicitly in .env.example or DEPLOY.md.


Minimal Test Coverage for PgVectorClient

test_pgvector_client.py only tests two default-state properties. Given this is the core of the new backend, consider adding at least basic integration tests (mocked engine) for:

  • ensure_collections idempotency
  • insert_entry_embedding upsert behavior
  • search_similar_entries with and without filters

Summary Table

Issue Severity Status
datetime.now() without UTC in embedding_service.py:159 Bug Unresolved
time.time() job prefix in embedding_rebuild.py:168 Minor bug Unresolved (fixed in maintenance only)
_is_duplicate_feed_guid_error overly broad match Bug Unresolved
SQL identifier quoting (_quote_ident duplication) Minor Partially resolved
pgvector hard dep + dead try/except fallbacks Architecture Unresolved
No HNSW similarity index Performance Acknowledged, needs docs
Dual connection pool Architecture Unresolved
Minimal PgVectorClient tests Testing Unresolved
VECTOR_BACKEND unvalidated string DX Unresolved

The first three items (datetime bug, job ID inconsistency, error matching) are straightforward fixes. Everything else can be addressed in follow-up issues. The overall implementation quality is good and the PR is close to mergeable.

- Modified datetime imports and usage in MilvusClient, EmbeddingService, and PreferenceService to ensure consistent UTC handling for timestamps.
@LeslieLeung
Copy link
Copy Markdown
Owner Author

@codex Review this.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4df65ca333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +27 to +30
if Vector is None:
raise RuntimeError("pgvector is required to run this migration")

op.execute("CREATE EXTENSION IF NOT EXISTS vector")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't require pgvector in every schema upgrade

This Alembic revision runs for every deployment, including existing Milvus installs, but upgrade() now unconditionally imports pgvector and executes CREATE EXTENSION vector. On any PostgreSQL instance that does not have the extension package installed or does not grant CREATE EXTENSION to the app role, the entire migration chain stops here even though the operator is not using the pgvector backend at all.

Useful? React with 👍 / 👎.

Comment on lines +245 to +247
vector_client = create_vector_store_client()
vector_client.connect()
await vector_client.ensure_collections(config.dimension, config.provider, config.model)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid schema initialization in the per-request score dependency

get_score_service() is invoked on every GET /entries request, and with the new pgvector backend this ensure_collections() call is not read-only: a fresh PgVectorClient starts with _schema_ensured=False and _last_model_signature=None, so ensure_collections() reruns the DDL checks and upserts vector_store_metadata before scoring. In pgvector mode that turns an ordinary read path into a write/DDL path, which will fail on read-only replicas or roles without extension/schema privileges.

Useful? React with 👍 / 👎.

Comment on lines +208 to +210
},
)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve last_entry_at when a concurrent insert loses the race

In the new duplicate-savepoint path, a concurrent fetch that loses the uq_feed_guid race immediately continues here, so the later latest_entry_time update is skipped for that parsed item. Because latest_entry_time was initialized from this transaction's stale feed.last_entry_at, that worker can commit after the winner and overwrite the newer timestamp, causing the feed to keep revisiting already-ingested entries.

Useful? React with 👍 / 👎.

Move vector collection initialization out of the request dependency so score service resolution reuses the app-scoped client without triggering request-time schema setup.

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad6e4a6fa9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +286 to +287
for entry_id in pending_embedding_ids:
await ctx["redis"].enqueue_job("generate_entry_embedding", entry_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve embedding job enqueue on post-commit failures

This loop enqueues embedding jobs only after the DB transaction has already committed, so a transient Redis failure here raises the task after entries are persisted but before their embedding jobs are queued. On retry, those entries are now treated as existing and skipped earlier in fetch_feed_task, so they never get re-enqueued for embedding and can remain permanently unembedded (and therefore unscored).

Useful? React with 👍 / 👎.

Comment on lines +347 to +351
success=True,
message="pgvector connection successful",
details={
"database_url_configured": True,
"vector_extension_installed": has_extension,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail pgvector validation when vector extension is missing

validate_pgvector() always returns success=True once it can connect, even when vector_extension_installed is false, so validate_and_rebuild_embeddings() treats the backend as valid and proceeds. In environments where the role cannot create/install vector, this produces a false-positive validation and then fails later during collection creation, rather than surfacing the backend incompatibility during validation.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant