Vectrain — Vector Embeddings Pipeline

Overview

Vectrain is a high-performance vector embedding service built in Go. It provides a flexible and modular pipeline architecture for sourcing data, generating embeddings, and storing them in vector databases. Designed for scalability, Vectrain enables seamless integration of embeddings into semantic search, analytics, and recommendation systems.

Features

Pipeline Architecture: Modular system with source, embedder, and storage components
Sources
- Kafka Integration: Stream processing from Kafka topics
- REST Integration: Accept data from external services via REST endpoints
Embedding
- Vector Embeddings: Generate embeddings using Ollama models
Storage
- Qdrant Storage: Store and query vector embeddings in Qdrant database
Configurable Components: Easily adjust batch sizes and worker counts
HTTP API: RESTful API for controlling and interacting with the pipeline

Pipeline

Description

Vectrain’s pipeline manages the entire process of transforming raw data into vector embeddings.

Architecture

Vectrain implements a three-stage pipeline architecture:

Source - Reads data from configured sources (Kafka)
Embedder - Generates vector embeddings from the source data
Storage - Persists embeddings to vector databases (Qdrant)

Key points

Configurable Adapters: When starting the pipeline, you specify the types of source, storage, and embedder. These adapters interface with external services.
Supported Sources: Currently, Vectrain supports Kafka and REST services for data ingestion.
Embedding Services:
- Currently, the pipeline supports Ollama as the embedding service.
- Since ML models are the main bottleneck, multiple embedder instances can be run in parallel.
- External embedding services can also be scaled with multiple instances and a load balancer to accelerate processing.
Vector Database: Currently, Vectrain supports Qdrant for storing and querying embeddings.

Workflow

The pipeline is started via a REST API request (http://127.0.0.1/api/start).
It connects to the configured source and storage adapters.
The source retrieves data from the external system (Kafka or REST, depending on configuration). More sources may be added in the future.
Data is sent to the embedding service API (e.g., Ollama, OpenAI, or other providers).
Once embeddings are generated, the pipeline stores them in the vector database.
The process repeats until all messages are processed or the pipeline receives a stop command (http://127.0.0.1/api/stop)..

Requirements

Go 1.24+
Docker (for running dependencies like Kafka, Ollama, Qdrant)

Installation

1. Clone the repository

git clone https://github.com/torys877/vectrain.git
cd vectrain

2. Install dependencies

go mod tidy

Configuration

Update the example configuration files in ./config with your endpoints and settings.
Create config/config.yaml and use kafka_config.yaml or rest_config.yaml as example depending on whether you want to receive data from Kafka or a REST service.

Usage

Running the application

go run cmd/vectrain/main.go --config=config/config.yaml

Or build and run the binary:

go build -o vectrain cmd/vectrain/main.go
./vectrain --config=config/config.yaml

API Endpoints

The service exposes HTTP endpoints for controlling the pipeline and retrieving information:

GET /api/health: Check service health
POST /api/start: Start workflow
POST /api/stop: Stop workflow

Each component is pluggable and configurable through the YAML configuration file.

Development

Adding new source types

Implement the Source interface from pkg/types/source.go in internal/app/sources

Adding new embedder types

Implement the Embedder interface from pkg/types/embedder.go in internal/app/embedders

Adding new storage types

Implement the Storage interface from pkg/types/storage.go in internal/app/storages

Adding Instances to the Factory

To register a new source, embedder, or storage, update the factory implementation in app/factory/factory.go.
The factory should return the correct instance based on the specified type.

Sources

HTTP (REST) Source

Http source has own API to receive messages. To send message use endpoint http://127.0.0.1/source/send with body like:

{
  "Text": "Some text that needs embedding",
  "Payload": {
    "genres": "Comedy, Drama, Romance, Crime",
    "rating": "7.10",
    "title": "SomeTitle",
    "year": "1988-10-21"
  }
}

Note: The source API remains available even if the pipeline is stopped.
However, messages will not be embedded until the pipeline is started.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cmd/vectrain		cmd/vectrain
config		config
deploy		deploy
docs		docs
internal		internal
pkg/types		pkg/types
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vectrain — Vector Embeddings Pipeline

Overview

Features

Pipeline

Description

Architecture

Key points

Workflow

Requirements

Installation

1. Clone the repository

2. Install dependencies

Configuration

Usage

Running the application

API Endpoints

Development

Adding new source types

Adding new embedder types

Adding new storage types

Adding Instances to the Factory

Sources

HTTP (REST) Source

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

torys877/vectrain

Folders and files

Latest commit

History

Repository files navigation

Vectrain — Vector Embeddings Pipeline

Overview

Features

Pipeline

Description

Architecture

Key points

Workflow

Requirements

Installation

1. Clone the repository

2. Install dependencies

Configuration

Usage

Running the application

API Endpoints

Development

Adding new source types

Adding new embedder types

Adding new storage types

Adding Instances to the Factory

Sources

HTTP (REST) Source

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages