Skip to content

history-lab/chunker

Repository files navigation

Text Chunker for Embeddings

A Cloudflare Worker that processes text files from R2 storage, chunks them into manageable pieces, and sends them to an embedding generator service for AI-powered text processing.

Overview

This service is designed to:

  1. Listen for events from a queue when new files are uploaded to R2 storage
  2. Retrieve files from the R2 bucket
  3. Split text content into smaller chunks (default 300 words per chunk)
  4. Batch these chunks (default 100 chunks per batch)
  5. Send batches to an embedding generator service that creates vector embeddings
  6. Store the embeddings for later use in retrieval and search

The worker is deployed as a Cloudflare Worker that consumes from a queue, making it scalable and efficient for processing large text files.

Architecture

┌─────────────┐    ┌────────────────┐    ┌───────────────────┐
│ R2 Bucket   │───>│ Queue          │───>│ Text Chunker      │
│ (File Store)│    │ (Events)       │    │ (This Worker)     │
└─────────────┘    └────────────────┘    └─────────┬─────────┘
                                                   │
                                                   ▼
                                         ┌───────────────────┐
                                         │ Embedding         │
                                         │ Generator Service │
                                         └───────────────────┘

Features

  • Efficient Text Chunking: Splits text by word count to create optimal chunks for embedding
  • Batch Processing: Groups chunks into batches to optimize embedding requests
  • File Path Organization: Extracts collection IDs from file paths automatically
  • Error Handling: Robust error management for each processing stage
  • Logging: Comprehensive logging for monitoring and debugging

Configuration

The service is configured using the following Cloudflare Worker settings:

Bindings

  • FILES_BUCKET: R2 bucket containing the text files to process
  • EMBEDDING_GENERATOR: Service binding to the embedding generator worker

Queue Consumers

The worker consumes events from a queue named file-processing-queue with:

  • Max Batch Size: 10 messages
  • Max Batch Timeout: 30 seconds

Development

Prerequisites

  • Node.js (recommended latest LTS version)
  • Cloudflare Workers account
  • Wrangler CLI installed (npm install -g wrangler)

Setup

  1. Clone the repository
  2. Install dependencies:
    npm install
    
  3. Configure your Cloudflare account with Wrangler:
    wrangler login
    

Local Development

Run the worker locally:

npm run dev

Deployment

Deploy to Cloudflare Workers:

npm run deploy

File Path Format

Files in the R2 bucket should follow this path format:

{user}/{collection_id}/{file_id}/{file_name}

The collection_id is extracted from this path and used to organize embeddings.

Customization

The chunking and batching behavior can be customized by modifying:

  • maxWords parameter in the chunkText function (default: 300 words per chunk)
  • batchSize parameter in the createBatches function (default: 100 chunks per batch)

Error Handling

The worker implements robust error handling:

  • Individual file processing errors are logged but don't stop the overall batch
  • Embedding errors for specific batches are captured and logged
  • Message parsing errors are handled gracefully

License

[Add license information here]

About

gets triggered by R2 upload + add to queue, chunks up text and RPCs the embedder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published