🐛 Bug fixes
NULL Vector handling Bug
Bug description
In PyTiDB 0.0.13, to address the NULL Vector issue, the client automatically appends a clause like HAVING embedding IS NOT NULL to filter out NULL vectors. However, this prevents vector search queries from using the Vector Index.
Bug Fix
PyTiDB 0.0.14 introduces the following changes:
-
NULL vector filtering is disabled by default
-
A
.skip_null_vectors(True)option is provided, allowing developers to control whether NULL vectors should be filtered -
To avoid filters causing vector indexes to become ineffective, PyTiDB now uses post-filtering mode by default for vector search:
- The ANN query is executed in the inner subquery
- Filtering is applied in the outer query
In PyTiDB 0.0.13, the NULL vector filtering condition was placed in the inner query, which caused the Vector Index to be bypassed. In PyTiDB 0.0.14, the filtering is moved to the outer query.
What is the NULL Vector issue?
In real-world RAG application development, the vector column is often populated asynchronously after the database record is created during the embedding process. Before the embedding is completed, the vector column is filled with NULL.
Since ANN queries are typically executed with ORDER BY … ASC, and in MySQL semantics NULL values are sorted before all non-NULL values, the presence of a large number of NULL vectors can severely degrade vector search results.
📝 Documentation & Examples
- docs: add vector index example by @Mini256 in #258
- docs: add example of vector search with realtime data by @Icemap in #199
- docs: use tidb_client.db_engine in README example (fixes #193) #195 by @haseebpvt in #196
New Contributors
- @haseebpvt made their first contribution in #196
Full Changelog: v0.0.13...v0.0.14