Skip to content

eth-library/CHNOBLi-vectordb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorDB Operations

A robust toolkit for managing Milvus vector database operations, featuring high-performance bulk data utilities, automated collection management, and a local development sandbox.

Table of Contents

Key Features

  • Pythonic API: Clean interface for common Milvus tasks (storing, querying, and managing embeddings).
  • High-Performance CLI: milvus_dump utility for fast export/import using Parquet files.
  • Memory Efficient: Built-in support for --mmap to optimize memory usage for large indices.
  • Validation: Pydantic-powered configuration and metadata validation.
  • Developer Sandbox: Pre-configured Docker Compose environment (Milvus + Attu + MinIO) for rapid testing.
  • Rich Observability: CLI output with real-time progress bars and styled logging.

Installation

Using uv (Recommended)

uv sync
source .venv/bin/activate

Using pip

python3.12 -m venv .venv # Windows: py -3.12 -m venv .venv
source .venv/bin/activate # Windows:  .venv\Scripts\activate
pip install -e .

Configuration

The package uses Pydantic for configuration. It automatically loads baseline values from config/env.example and overrides them with any values found in a .env file in the project root.

Variable Default Description
MILVUS_HOST localhost Milvus server address
MILVUS_PORT 19530 Milvus server port
MINIO_ENDPOINT http://localhost:9000 Object storage for bulk writer
MINIO_ACCESS_KEY minioadmin MinIO access key
MINIO_SECRET_KEY minioadmin MinIO secret key

Note: The milvus_dump CLI utility dynamically adopts these values as its default arguments. You can see the current active milvus connection (host:port) by running milvus_dump {export,import} --help.

Local Sandbox (Docker Compose)

A docker-compose.yml is included to spin up a fully self-contained Milvus environment for local development and testing. It starts:

Service Description Port
etcd Metadata store for Milvus
MinIO Object storage (S3 API & Console) 9000 (API), 9001 (console)
Milvus Vector database (standalone) 19530
Attu Web UI for browsing collections 8000

Start the sandbox

docker compose up -d

Wait until Milvus is healthy (usually ~30 s):

docker compose ps          # all services should show "healthy" or "running"

Open the Attu web UI at http://localhost:8000 — connect to milvus:19530 (no credentials needed for the sandbox).

Stop and clean up

docker compose down        # stop containers, keep volumes
docker compose down -v     # stop containers and delete all data

Library Usage (Python API)

The vector_db module provides a high-level wrapper for common operations.

import numpy as np
from vectordb_operations.vector_db import (
    create_collection,
    store_embedding,
    store_embedding_bulk,
    query_embedding,
    query_similar_by_vector,
    get_all_ids_in_namespace,
)

# --- create a collection with 384-dimensional embeddings ---
collection = create_collection("my_collection", dim=384)

# --- store a single embedding ---
vec = np.random.rand(384).tolist()
store_embedding("doc-001", vec, collection)

# --- store multiple embeddings at once ---
ids   = [f"doc-{i:03d}" for i in range(2, 1001)]
vecs  = [np.random.rand(384).tolist() for _ in ids]
store_embedding_bulk(ids, vecs, collection)

# --- retrieve embeddings by ID ---
result = query_embedding(["doc-001", "doc-002"], "my_collection")
print(result["text_ids"])       # ['doc-001', 'doc-002']
print(len(result["embeddings"])) # 2

# --- find top-5 nearest neighbours for a query vector ---
query_vec = np.random.rand(384).tolist()
hits = query_similar_by_vector(query_vec, "my_collection", top_k=5)
for hit in hits:
    print(hit["text_id"], hit["distance"])

# --- list all IDs stored in the collection ---
all_ids = get_all_ids_in_namespace("my_collection")
print(f"{len(all_ids)} embeddings in collection")

CLI Utility: milvus_dump

The milvus_dump utility handles bulk data mobility.

Export a Collection

milvus_dump export -c my_collection -o ./my_dumps --mmap

Import a Collection

milvus_dump import -d ./my_dumps/my_collection_dump -m ./my_dumps/metadata.json # (force replace --drop-existing )

To import the data for CHNOBLi, download the data here, unzip the files, and move the parquet files and the metadata JSON into a single folder gnd_de_snowflakearctic. Then use the import functionality.

The import uses the S3 API port from MinIO. If this port is has not been exposed from the container one can find its address by running:

# Get IP for the milvus database
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' milvus-standalone
# Get IP for MinIO
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' milvus-minio

where milvus-... are the containers name. The output will look like 172.##.#.## which can then be added to the .env file.

Enable Autocomplete

eval "$(register-python-argcomplete milvus_dump)"

Stability & Performance

  • Memory Efficiency (--mmap): Using the --mmap flag allows Milvus to map index files from disk instead of loading them into the heap. This is highly recommended for large collections to prevent memory spikes.
  • Rich Logging: Uses the rich library for real-time progress tracking and clear status markers.

Development & Testing

Run the test suite using pytest

pytest

FAQ

"error during connect"

If you're on Windows, remember to start up the Docker Desktop software manually.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages