A Cloudflare Worker that processes text files from R2 storage, chunks them into manageable pieces, and sends them to an embedding generator service for AI-powered text processing.
This service is designed to:
- Listen for events from a queue when new files are uploaded to R2 storage
- Retrieve files from the R2 bucket
- Split text content into smaller chunks (default 300 words per chunk)
- Batch these chunks (default 100 chunks per batch)
- Send batches to an embedding generator service that creates vector embeddings
- Store the embeddings for later use in retrieval and search
The worker is deployed as a Cloudflare Worker that consumes from a queue, making it scalable and efficient for processing large text files.
┌─────────────┐ ┌────────────────┐ ┌───────────────────┐
│ R2 Bucket │───>│ Queue │───>│ Text Chunker │
│ (File Store)│ │ (Events) │ │ (This Worker) │
└─────────────┘ └────────────────┘ └─────────┬─────────┘
│
▼
┌───────────────────┐
│ Embedding │
│ Generator Service │
└───────────────────┘
- Efficient Text Chunking: Splits text by word count to create optimal chunks for embedding
- Batch Processing: Groups chunks into batches to optimize embedding requests
- File Path Organization: Extracts collection IDs from file paths automatically
- Error Handling: Robust error management for each processing stage
- Logging: Comprehensive logging for monitoring and debugging
The service is configured using the following Cloudflare Worker settings:
- FILES_BUCKET: R2 bucket containing the text files to process
- EMBEDDING_GENERATOR: Service binding to the embedding generator worker
The worker consumes events from a queue named file-processing-queue with:
- Max Batch Size: 10 messages
- Max Batch Timeout: 30 seconds
- Node.js (recommended latest LTS version)
- Cloudflare Workers account
- Wrangler CLI installed (
npm install -g wrangler)
- Clone the repository
- Install dependencies:
npm install - Configure your Cloudflare account with Wrangler:
wrangler login
Run the worker locally:
npm run dev
Deploy to Cloudflare Workers:
npm run deploy
Files in the R2 bucket should follow this path format:
{user}/{collection_id}/{file_id}/{file_name}
The collection_id is extracted from this path and used to organize embeddings.
The chunking and batching behavior can be customized by modifying:
maxWordsparameter in thechunkTextfunction (default: 300 words per chunk)batchSizeparameter in thecreateBatchesfunction (default: 100 chunks per batch)
The worker implements robust error handling:
- Individual file processing errors are logged but don't stop the overall batch
- Embedding errors for specific batches are captured and logged
- Message parsing errors are handled gracefully
[Add license information here]