feat: introduce pgvector backend support and update deployment config by LeslieLeung · Pull Request #121 · LeslieLeung/glean

LeslieLeung · 2026-02-25T02:12:26Z

Added support for PostgreSQL's pgvector extension as an alternative vector backend.
Updated .env.example to include configuration options for pgvector.
Modified deployment documentation to reflect the new pgvector setup.
Created new docker-compose files for pgvector development and deployment.
Refactored existing code to accommodate vector backend selection, enhancing flexibility for users.

…urations - Added support for PostgreSQL's pgvector extension as an alternative vector backend. - Updated .env.example to include configuration options for pgvector. - Modified deployment documentation to reflect the new pgvector setup. - Created new docker-compose files for pgvector development and deployment. - Refactored existing code to accommodate vector backend selection, enhancing flexibility for users.

claude · 2026-02-25T02:15:07Z

PR Review: feat: introduce pgvector backend support

This is a solid, well-structured PR that adds pgvector as an alternative vector backend while keeping Milvus support. The abstraction layer via VectorStoreClient Protocol is clean, and several important correctness fixes are bundled in (SELECT FOR UPDATE SKIP LOCKED, post-commit embedding enqueue, distributed rebuild lock, stuck-entry recovery). Overall the approach is sound. Here are my findings:

Security

SQL Injection risk in validate_pgvector (validation_service.py)

text(
    f"SELECT name, model_signature "  # noqa: S608
    f"FROM {pgvector_config.metadata_table} "
    "WHERE name IN ('entries', 'preferences')"
)

The # noqa: S608 comment suppresses the linter warning but the underlying concern is real. pgvector_config.metadata_table comes from an env var, so it is not direct user input, but it's still an interpolated string in raw SQL. If the env var is misconfigured it could cause unexpected query behavior. Prefer sqlalchemy.text with a bound parameter for the table name, or quote the identifier explicitly with asyncpg's quote_ident.

The same pattern appears in recreate_collections with bare DROP TABLE IF EXISTS {table.name}. PostgreSQL identifier quoting ("table_name") would be safer here.

Dependency Concern

pgvector is now a hard dependency even for Milvus-only deployments

Both glean_database/pyproject.toml and glean_vector/pyproject.toml add pgvector>=0.3.6 as a required dependency. This means users who only use Milvus now pull in the pgvector Python package. Consider making it an optional dependency:

# In pyproject.toml
[project.optional-dependencies]
pgvector = ["pgvector>=0.3.6"]

The current try/except ImportError fallbacks in the model files (entry_embedding.py, user_preference_vector.py) suggest this was originally intended to be optional. Since pgvector is now unconditionally required, those fallbacks are dead code and create confusion about whether the package is actually required.

Missing Input Validation

VECTOR_BACKEND is an unvalidated string

VectorBackendConfig.backend is a plain str with no enum/validator. An invalid value (e.g., VECTOR_BACKEND=clickhouse) will pass startup silently and only fail at runtime when create_vector_store_client() raises ValueError. Catching it early with a Pydantic validator would give a clearer error message at startup:

from pydantic import field_validator

@field_validator("backend")
@classmethod
def validate_backend(cls, v: str) -> str:
    allowed = {"milvus", "pgvector"}
    if v.lower() not in allowed:
        raise ValueError(f"VECTOR_BACKEND must be one of {allowed}, got '{v}'")
    return v.lower()

Performance / Correctness

No vector index on entry_embeddings.embedding

The migration comment acknowledges this:

HNSW index requires fixed-dimension vector columns. We intentionally keep vector columns dimension-agnostic here.

This means all similarity searches via search_similar_entries will do sequential full-table scans until an HNSW or IVFFlat index is created manually. At even modest scale (tens of thousands of entries) this will be slow. Consider:

Documenting the recommended post-migration index creation command in DEPLOY.md.
Or creating the index in ensure_collections after create_all, using the actual dimension (which is available at that point) — e.g.:

CREATE INDEX IF NOT EXISTS idx_entry_embeddings_embedding_hnsw
ON entry_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

The downgrade migration already references DROP INDEX IF EXISTS idx_entry_embeddings_embedding_hnsw, so this was clearly planned but not included in the upgrade.

PgVectorClient creates a second independent connection pool

When pgvector uses the same database as the main app (the default fallback to DATABASE_URL), two separate connection pools will be maintained. This doubles connection overhead and complicates observability. Consider accepting an existing engine/session-maker as a constructor argument, or at minimum documenting the trade-off.

datetime.now() without timezone

In pgvector_client.py:

published_ts = (
    int(published_at.timestamp()) if published_at else int(datetime.now().timestamp())
)

and

now_ts = int(datetime.now().timestamp())

Should use datetime.now(UTC) for consistency with the rest of the codebase (the maintenance task already uses datetime.now(UTC)).

Job ID collision risk in _re_enqueue_pending

batch_prefix = f"maint_{int(time.time())}"

If the maintenance cron fires multiple times in the same second (unlikely but possible in tests or under clock skew), the job IDs will collide and arq will deduplicate them. Using uuid4() or adding microseconds would be safer.

Minor Issues

Test only exercises happy path / unit behavior for pgvector

test_pgvector_client.py only tests two trivial default-state checks. No tests cover:

ensure_collections idempotency
insert_entry_embedding / upsert behavior
search_similar_entries filter logic
recreate_collections dropping and recreating tables

Given that this is the core of the new backend, integration tests (even with a mocked engine) would significantly improve confidence.

_is_duplicate_feed_guid_error is fragile

def _is_duplicate_feed_guid_error(error: IntegrityError) -> bool:
    error_text = str(error)
    return (
        "uq_feed_guid" in error_text
        or "duplicate key value violates unique constraint" in error_text
    )

The second branch ("duplicate key value violates unique constraint") matches any unique constraint violation on the table, not just uq_feed_guid. This could mask legitimate constraint errors. Match on the constraint name only:

return "uq_feed_guid" in str(error.orig)

What's Done Well

The VectorStoreClient Protocol is a clean abstraction — both backends become interchangeable without changing calling code.
SELECT FOR UPDATE SKIP LOCKED in batch/retry operations is exactly the right pattern for multi-worker correctness.
The post-commit embedding enqueue fix in feed_fetcher.py eliminates a real race condition.
Distributed rebuild lock with Redis prevents duplicate concurrent rebuilds.
_recover_stuck_entries with a generous threshold is a solid resilience mechanism.
The _safe_handle_error / _mark_failed_safe helpers properly handle asyncpg's failed-transaction state.
Bilingual documentation is well-maintained and accurate.
Factory tests (test_vector_factory.py) and recovery tests (test_vector_client_recovery.py) are good additions.

The issues above range from a real security concern (SQL identifier interpolation) to architectural notes (shared connection pool, hard dependency). I'd recommend addressing the security and dependency points before merging; the rest can be follow-ups.

- Changed PostgreSQL service image to pgvector/pgvector:pg16 for improved compatibility. - Modified embedding maintenance task to use UUIDs for batch prefixes instead of timestamps, ensuring uniqueness. - Updated datetime handling in PgVectorClient to use UTC for consistency across time-related operations. - Enhanced SQL commands in PgVectorClient and validation service to properly quote identifiers, improving SQL safety and correctness.

- Added SQL command to create the vector extension during the database initialization process in the test engine function, ensuring the extension is available for use.

LeslieLeung · 2026-03-11T04:56:08Z

@claude Review this pr.

claude · 2026-03-11T04:56:25Z

Claude finished @LeslieLeung's task in 2m 35s —— View job

PR Review: feat: introduce pgvector backend support

This is a solid PR that adds pgvector as a lightweight alternative to Milvus. The VectorStoreClient Protocol abstraction is clean and the bundled correctness fixes (SELECT FOR UPDATE SKIP LOCKED, post-commit embedding enqueue, distributed rebuild lock, stuck-entry recovery) are real improvements. The previous review comments have been partially addressed. Here's a fresh look at what remains.

What Was Fixed Since Previous Review

datetime.now() → datetime.now(UTC) in pgvector_client.py ✓
Job ID collision in embedding_maintenance.py now uses uuid4() ✓
SQL identifier quoting in pgvector_client.py uses _quote_ident() ✓
Table existence checks in validate_pgvector now use bound parameters via to_regclass(:table_name) ✓
Both docker-compose.dev.yml and docker-compose.yml switched to pgvector/pgvector:pg16, so the migration's CREATE EXTENSION IF NOT EXISTS vector won't fail for Milvus users ✓

Remaining Issues

Bug: Naive `datetime.now()` in `embedding_service.py`

embedding_service.py:159 still uses timezone-naive datetime:

embedding_at=datetime.now(),   # ← should be datetime.now(UTC)

Every other datetime call in this PR was updated to datetime.now(UTC), but this one was missed. This is inconsistent and could cause timezone-related issues for timezone-aware comparisons. Fix this →

Inconsistency: `time.time()` Still Used in `embedding_rebuild.py`

embedding_rebuild.py:168:

batch_prefix = f"rebuild_{int(time.time())}"

embedding_maintenance.py was correctly fixed to use uuid4().hex for job ID uniqueness, but the same fix wasn't applied to the rebuild task. If a rebuild is triggered twice within the same second (e.g., under high load or in tests), job IDs will collide and arq will silently deduplicate them. Fix this →

Overly Broad Error Matching in `_is_duplicate_feed_guid_error`

feed_fetcher.py:25-31:

def _is_duplicate_feed_guid_error(error: IntegrityError) -> bool:
    error_text = str(error)
    return (
        "uq_feed_guid" in error_text
        or "duplicate key value violates unique constraint" in error_text  # ← too broad
    )

The second branch matches any unique constraint violation on the table (e.g., a future uq_feed_url constraint), not just uq_feed_guid. A non-GUID duplicate would be silently swallowed as if it were a normal concurrent-insert race. Match on the specific constraint name only:

return "uq_feed_guid" in str(error.orig)

Fix this →

Remaining SQL Identifier Interpolation in `validate_pgvector`

validation_service.py:314-322:

_quoted_table = '"' + pgvector_config.metadata_table.replace('"', '""') + '"'
rows = await conn.execute(
    text(  # noqa: S608
        f"SELECT name, model_signature FROM {_quoted_table}"
        " WHERE name IN ('entries', 'preferences')"
    )
)

The double-quoting is correct for identifier safety, but the # noqa: S608 suppression is still needed, which indicates the linter still sees a concern. Now that _quote_ident() exists in pgvector_client.py, consider importing and reusing it here, or moving it to a shared utility. The logic is duplicated (the inline quoting vs. the _quote_ident function).

`pgvector` as Hard Dependency; Dead `try/except` Fallbacks

Both glean_database/pyproject.toml and glean_vector/pyproject.toml declare pgvector>=0.3.6 as a required dependency. The try/except ImportError guards in the model files then become dead code that creates confusion:

entry_embedding.py:12-17: VECTOR_TYPE = Vector() if Vector is not None else JSON — the JSON fallback is unreachable given the hard dep.
user_preference_vector.py: same pattern.

The previous review suggested making pgvector optional. If you're keeping it as a hard dependency (acceptable given the migration always creates the extension), remove the dead fallback code to avoid confusion.

No Vector Similarity Index (Acknowledged, Needs Docs)

The migration comment acknowledges that no HNSW index is created, leaving all search_similar_entries calls as sequential full-table scans. The downgrade migration even has DROP INDEX IF EXISTS idx_entry_embeddings_embedding_hnsw, showing this was planned.

At minimum, add the recommended index creation command to DEPLOY.md so operators know to run it after initial setup:

-- Run after migrations; adjust m/ef_construction for your dataset size
CREATE INDEX IF NOT EXISTS idx_entry_embeddings_embedding_hnsw
ON entry_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

`PgVectorClient` Independent Connection Pool

When PGVECTOR_DATABASE_URL is unset, the client falls back to DATABASE_URL, which is the main app database. Two separate async connection pools then run against the same PostgreSQL instance — doubling connection overhead and making connection-limit tuning harder. This is an architectural note rather than a blocker, but worth documenting explicitly in .env.example or DEPLOY.md.

Minimal Test Coverage for `PgVectorClient`

test_pgvector_client.py only tests two default-state properties. Given this is the core of the new backend, consider adding at least basic integration tests (mocked engine) for:

ensure_collections idempotency
insert_entry_embedding upsert behavior
search_similar_entries with and without filters

Summary Table

Issue	Severity	Status
`datetime.now()` without UTC in `embedding_service.py:159`	Bug	Unresolved
`time.time()` job prefix in `embedding_rebuild.py:168`	Minor bug	Unresolved (fixed in maintenance only)
`_is_duplicate_feed_guid_error` overly broad match	Bug	Unresolved
SQL identifier quoting (`_quote_ident` duplication)	Minor	Partially resolved
`pgvector` hard dep + dead `try/except` fallbacks	Architecture	Unresolved
No HNSW similarity index	Performance	Acknowledged, needs docs
Dual connection pool	Architecture	Unresolved
Minimal `PgVectorClient` tests	Testing	Unresolved
`VECTOR_BACKEND` unvalidated string	DX	Unresolved

The first three items (datetime bug, job ID inconsistency, error matching) are straightforward fixes. Everything else can be addressed in follow-up issues. The overall implementation quality is good and the PR is close to mergeable.

- Modified datetime imports and usage in MilvusClient, EmbeddingService, and PreferenceService to ensure consistent UTC handling for timestamps.

LeslieLeung · 2026-03-19T12:23:44Z

@codex Review this.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4df65ca333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T12:31:04Z

+    if Vector is None:
+        raise RuntimeError("pgvector is required to run this migration")
+
+    op.execute("CREATE EXTENSION IF NOT EXISTS vector")


Don't require pgvector in every schema upgrade

This Alembic revision runs for every deployment, including existing Milvus installs, but upgrade() now unconditionally imports pgvector and executes CREATE EXTENSION vector. On any PostgreSQL instance that does not have the extension package installed or does not grant CREATE EXTENSION to the app role, the entire migration chain stops here even though the operator is not using the pgvector backend at all.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T12:31:04Z

+        vector_client = create_vector_store_client()
+        vector_client.connect()
+        await vector_client.ensure_collections(config.dimension, config.provider, config.model)


Avoid schema initialization in the per-request score dependency

get_score_service() is invoked on every GET /entries request, and with the new pgvector backend this ensure_collections() call is not read-only: a fresh PgVectorClient starts with _schema_ensured=False and _last_model_signature=None, so ensure_collections() reruns the DDL checks and upserts vector_store_metadata before scoring. In pgvector mode that turns an ordinary read path into a write/DDL path, which will fail on read-only replicas or roles without extension/schema privileges.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T12:31:04Z

+                            },
+                        )
+                        continue


Preserve last_entry_at when a concurrent insert loses the race

In the new duplicate-savepoint path, a concurrent fetch that loses the uq_feed_guid race immediately continues here, so the later latest_entry_time update is skipped for that parsed item. Because latest_entry_time was initialized from this transaction's stale feed.last_entry_at, that worker can commit after the winner and overwrite the newer timestamp, causing the feed to keep revisiting already-ingested entries.

Useful? React with 👍 / 👎.

Move vector collection initialization out of the request dependency so score service resolution reuses the app-scoped client without triggering request-time schema setup. Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad6e4a6fa9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T20:09:14Z

+    for entry_id in pending_embedding_ids:
+        await ctx["redis"].enqueue_job("generate_entry_embedding", entry_id)


Preserve embedding job enqueue on post-commit failures

This loop enqueues embedding jobs only after the DB transaction has already committed, so a transient Redis failure here raises the task after entries are persisted but before their embedding jobs are queued. On retry, those entries are now treated as existing and skipped earlier in fetch_feed_task, so they never get re-enqueued for embedding and can remain permanently unembedded (and therefore unscored).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T20:09:14Z

+                    success=True,
+                    message="pgvector connection successful",
+                    details={
+                        "database_url_configured": True,
+                        "vector_extension_installed": has_extension,


Fail pgvector validation when vector extension is missing

validate_pgvector() always returns success=True once it can connect, even when vector_extension_installed is false, so validate_and_rebuild_embeddings() treats the backend as valid and proceeds. In environments where the role cannot create/install vector, this produces a false-positive validation and then fails later during collection creation, rather than surfacing the backend incompatibility during validation.

Useful? React with 👍 / 👎.

LeslieLeung changed the title ~~feat: introduce pgvector backend support and update deployment config…~~ feat: introduce pgvector backend support and update deployment config Feb 25, 2026

LeslieLeung mentioned this pull request Mar 5, 2026

Pgvectors v.s. Milvus #59

Open

LeslieLeung added 3 commits March 11, 2026 11:55

Merge branch 'main' into feature/pgvector

44ae30b

feat: add PostgreSQL vector extension creation to database setup

9f23044

- Added SQL command to create the vector extension during the database initialization process in the test engine function, ensuring the extension is available for use.

LeslieLeung force-pushed the feature/pgvector branch from 1135683 to 9f23044 Compare March 11, 2026 04:49

Merge branch 'main' into feature/pgvector

784d876

fix: update datetime handling to use UTC across services

4df65ca

- Modified datetime imports and usage in MilvusClient, EmbeddingService, and PreferenceService to ensure consistent UTC handling for timestamps.

chatgpt-codex-connector Bot reviewed Mar 19, 2026

View reviewed changes

LeslieLeung added 3 commits April 9, 2026 03:43

Fix pgvector migration and score client init

48b3aa4

Finalize pgvector backend integration updates

f2dbfcb

fix: initialize vector schema during API startup

ad6e4a6

Move vector collection initialization out of the request dependency so score service resolution reuses the app-scoped client without triggering request-time schema setup. Made-with: Cursor

chatgpt-codex-connector Bot reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce pgvector backend support and update deployment config#121

feat: introduce pgvector backend support and update deployment config#121
LeslieLeung wants to merge 9 commits into
mainfrom
feature/pgvector

LeslieLeung commented Feb 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Feb 25, 2026

Uh oh!

LeslieLeung commented Mar 11, 2026

Uh oh!

claude Bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

LeslieLeung commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for entry_id in pending_embedding_ids:
		await ctx["redis"].enqueue_job("generate_entry_embedding", entry_id)

Conversation

LeslieLeung commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Feb 25, 2026

PR Review: feat: introduce pgvector backend support

Security

Dependency Concern

Missing Input Validation

Performance / Correctness

Minor Issues

What's Done Well

Uh oh!

LeslieLeung commented Mar 11, 2026

Uh oh!

claude Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat: introduce pgvector backend support

What Was Fixed Since Previous Review

Remaining Issues

Bug: Naive datetime.now() in embedding_service.py

Inconsistency: time.time() Still Used in embedding_rebuild.py

Overly Broad Error Matching in _is_duplicate_feed_guid_error

Remaining SQL Identifier Interpolation in validate_pgvector

pgvector as Hard Dependency; Dead try/except Fallbacks

No Vector Similarity Index (Acknowledged, Needs Docs)

PgVectorClient Independent Connection Pool

Minimal Test Coverage for PgVectorClient

Summary Table

Uh oh!

LeslieLeung commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeslieLeung commented Feb 25, 2026 •

edited

Loading

claude Bot commented Mar 11, 2026 •

edited

Loading

Bug: Naive `datetime.now()` in `embedding_service.py`

Inconsistency: `time.time()` Still Used in `embedding_rebuild.py`

Overly Broad Error Matching in `_is_duplicate_feed_guid_error`

Remaining SQL Identifier Interpolation in `validate_pgvector`

`pgvector` as Hard Dependency; Dead `try/except` Fallbacks

`PgVectorClient` Independent Connection Pool

Minimal Test Coverage for `PgVectorClient`