Text Chunker for Embeddings

A Cloudflare Worker that processes text files from R2 storage, chunks them into manageable pieces, and sends them to an embedding generator service for AI-powered text processing.

Overview

This service is designed to:

Listen for events from a queue when new files are uploaded to R2 storage
Retrieve files from the R2 bucket
Split text content into smaller chunks (default 300 words per chunk)
Batch these chunks (default 100 chunks per batch)
Send batches to an embedding generator service that creates vector embeddings
Store the embeddings for later use in retrieval and search

The worker is deployed as a Cloudflare Worker that consumes from a queue, making it scalable and efficient for processing large text files.

Architecture

┌─────────────┐    ┌────────────────┐    ┌───────────────────┐
│ R2 Bucket   │───>│ Queue          │───>│ Text Chunker      │
│ (File Store)│    │ (Events)       │    │ (This Worker)     │
└─────────────┘    └────────────────┘    └─────────┬─────────┘
                                                   │
                                                   ▼
                                         ┌───────────────────┐
                                         │ Embedding         │
                                         │ Generator Service │
                                         └───────────────────┘

Features

Efficient Text Chunking: Splits text by word count to create optimal chunks for embedding
Batch Processing: Groups chunks into batches to optimize embedding requests
File Path Organization: Extracts collection IDs from file paths automatically
Error Handling: Robust error management for each processing stage
Logging: Comprehensive logging for monitoring and debugging

Configuration

The service is configured using the following Cloudflare Worker settings:

Bindings

FILES_BUCKET: R2 bucket containing the text files to process
EMBEDDING_GENERATOR: Service binding to the embedding generator worker

Queue Consumers

The worker consumes events from a queue named file-processing-queue with:

Max Batch Size: 10 messages
Max Batch Timeout: 30 seconds

Development

Prerequisites

Node.js (recommended latest LTS version)
Cloudflare Workers account
Wrangler CLI installed (npm install -g wrangler)

Setup

Clone the repository
Install dependencies:
```
npm install
```
Configure your Cloudflare account with Wrangler:
```
wrangler login
```

Local Development

Run the worker locally:

npm run dev

Deployment

Deploy to Cloudflare Workers:

npm run deploy

File Path Format

Files in the R2 bucket should follow this path format:

{user}/{collection_id}/{file_id}/{file_name}

The collection_id is extracted from this path and used to organize embeddings.

Customization

The chunking and batching behavior can be customized by modifying:

maxWords parameter in the chunkText function (default: 300 words per chunk)
batchSize parameter in the createBatches function (default: 100 chunks per batch)

Error Handling

The worker implements robust error handling:

Individual file processing errors are logged but don't stop the overall batch
Embedding errors for specific batches are captured and logged
Message parsing errors are handled gracefully

License

[Add license information here]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
ramus_ong_cleaned.txt		ramus_ong_cleaned.txt
tsconfig.json		tsconfig.json
wrangler.jsonc		wrangler.jsonc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Chunker for Embeddings

Overview

Architecture

Features

Configuration

Bindings

Queue Consumers

Development

Prerequisites

Setup

Local Development

Deployment

File Path Format

Customization

Error Handling

License

About

Uh oh!

Releases

Packages

Languages

history-lab/chunker

Folders and files

Latest commit

History

Repository files navigation

Text Chunker for Embeddings

Overview

Architecture

Features

Configuration

Bindings

Queue Consumers

Development

Prerequisites

Setup

Local Development

Deployment

File Path Format

Customization

Error Handling

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages