This document outlines the implementation of offline PDF handling in the Electron/React application while maintaining knowledge base functionality. The solution enables local PDF file reading without CDN dependencies while preserving the existing knowledge base search capabilities.
- Package Version: Installed
pdfjs-distversion 3.11.174npm install pdfjs-dist@3.11.174
Created a custom Vite plugin to handle PDF worker file copying:
// vite.config.ts
import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';
import path from 'path';
import fs from 'fs-extra';
function copyPdfWorker() {
return {
name: 'copy-pdf-worker',
buildStart() {
const workerSrc = path.resolve(
__dirname,
'node_modules/pdfjs-dist/build/pdf.worker.min.js'
);
const workerDest = path.resolve(
__dirname,
'public/pdf.worker.min.js'
);
fs.copySync(workerSrc, workerDest);
}
};
}
export default defineConfig({
plugins: [react(), copyPdfWorker()],
build: {
rollupOptions: {
output: {
manualChunks: {
pdfjs: ['pdfjs-dist']
}
}
}
}
});Updated documentUtils.ts to properly initialize the PDF worker:
import * as pdfjsLib from 'pdfjs-dist';
// Configure PDF.js worker
const pdfWorkerSrc = '/pdf.worker.min.js';
pdfjsLib.GlobalWorkerOptions.workerSrc = pdfWorkerSrc;
export async function readPdfContent(file: File): Promise<string> {
const arrayBuffer = await file.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
let fullText = '';
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const textContent = await page.getTextContent();
const pageText = textContent.items
.map((item: any) => item.str)
.join(' ');
fullText += pageText + '\n';
}
return fullText;
}To maintain both local document reading and knowledge base functionality:
// Handling both local and server-based content
async function getCombinedContext(userQuery: string) {
let combinedContext = '';
// 1. Local document handling
if (temporaryDocs.length > 0) {
const localContext = temporaryDocs
.map(doc => `Content from ${doc.name}:\n${doc.content}`)
.join('\n\n');
combinedContext += localContext;
}
// 2. Knowledge base search
if (ragEnabled && pythonPort) {
try {
const results = await searchDocuments(userQuery, pythonPort, [], ragEnabled);
if (results?.results?.length > 0) {
const serverContext = results.results
.map(r => {
const source = r.metadata?.source_file
? ` (from ${r.metadata.source_file})`
: '';
return `${r.content}${source}`;
})
.join('\n\n');
if (combinedContext) {
combinedContext += '\n\n--- Knowledge Base Content ---\n\n';
}
combinedContext += serverContext;
}
} catch (error) {
console.error('Error fetching knowledge base content:', error);
}
}
return combinedContext;
}- PDF processing is done client-side
- Consider implementing file size limits for large PDFs
- Monitor memory usage when processing multiple documents
- Implement proper error handling for PDF processing failures
- Add user feedback for processing status
- Handle network errors gracefully when knowledge base is unavailable
- Validate file types before processing
- Implement proper sanitization of extracted text
- Consider implementing file access restrictions if needed
- Keep
pdfjs-distversion updated while ensuring compatibility - Monitor worker file copying during builds
- Regularly test both local and knowledge base functionality
- PDF loading works offline
- Multiple PDF documents can be loaded simultaneously
- Knowledge base search still functions
- Combined results are properly formatted
- Error handling works as expected
- Memory usage remains stable
- Build process completes successfully
- Worker file is properly copied to public directory
-
Worker Not Found
- Verify worker file is in public directory
- Check worker path configuration
- Ensure build process copies worker file
-
Memory Issues
- Implement pagination for large PDFs
- Add file size limits
- Clean up document objects after processing
-
Knowledge Base Integration
- Verify Python backend connection
- Check RAG configuration
- Validate search endpoint responses
-
Performance Optimization
- Implement worker pooling for multiple PDFs
- Add caching for frequently accessed documents
- Optimize text extraction process
-
Feature Enhancements
- Add PDF preview capability
- Implement text highlighting
- Add document management interface
-
Integration Improvements
- Better error reporting
- Progress indicators for large documents
- Enhanced context merging algorithms