Skip to content

Latest commit

 

History

History
188 lines (132 loc) · 4.79 KB

File metadata and controls

188 lines (132 loc) · 4.79 KB

Handling Text Files with PDF Extensions

Issue Summary

Problem: Files in the Archive directory with .pdf extensions that are actually plain text files cause "Failed to load PDF document" errors.

Example: Archive/ WEBSITE PACKAGE/2025/Latha_2025_Veterinary Clinic.pdf was a 92-byte text file, not a valid PDF.

Solutions Implemented

✅ Option 1: Automatic Conversion (NEW - Recommended!)

The backend now automatically converts text files to proper PDFs!

Location: backend/server.py

Features:

  • Fully automatic - No user action required
  • Seamless - Happens during file upload
  • Transparent - User doesn't see any difference
  • Robust - Fallback if conversion fails
  • Logged - Server logs show conversion status

How it works:

  1. User uploads a file through LibraDigit AI
  2. Backend detects if it's a text file with .pdf extension
  3. Automatically converts to proper PDF format
  4. Extracts text from the converted PDF
  5. Continues normal workflow

User Experience:

Upload file → ✅ Auto-converted → ✅ Text extracted → Continue workflow

✅ Option 2: Manual Detection (Fallback)

The backend can also handle text files without conversion:

Location: backend/server.py

Features:

  • Detects text files masquerading as PDFs
  • Extracts content from text files
  • Adds informative notes to the extracted text
  • Allows normal workflow to continue

How it works:

  1. When a file is uploaded, the system checks the file header
  2. If it's a text file (doesn't start with %PDF), it's handled specially
  3. Text content is extracted and processed
  4. User is notified about the file type

✅ Option 3: Batch Conversion (Utility Script)

A Python utility script to convert text files to proper PDFs:

Location: backend/convert_text_to_pdf.py

Features:

  • Converts single files or entire directories
  • Preserves metadata (Author, Year, Subject, Title)
  • Creates professionally formatted PDFs
  • Safe to run multiple times (skips already-converted files)

Usage Guide

Method 1: Use the Application (Automatic)

Simply upload your file through the LibraDigit AI interface:

  1. Go to Upload & OCR page
  2. Upload your file (even if it's a text file with .pdf extension)
  3. The system will automatically detect the file type
  4. Text will be extracted and you can proceed normally

Method 2: Convert Files Manually

Convert a Single File

cd backend
python convert_text_to_pdf.py "Archive/  WEBSITE PACKAGE/2025/Latha_2025_Veterinary Clinic.pdf"

Convert All Files in Archive Directory

cd backend
python convert_text_to_pdf.py "Archive/"

This will scan recursively and convert all text files with .pdf extensions.

Installation

If you need to use the conversion script, ensure reportlab is installed:

cd backend
pip install -r requirements.txt

Or install just reportlab:

pip install reportlab

Example Output

Before Conversion

File: Latha_2025_Veterinary Clinic.pdf
Size: 92 bytes
Type: Plain text
Status: ✗ Cannot be opened by PDF readers

After Conversion

File: Latha_2025_Veterinary Clinic.pdf
Size: 1,822 bytes
Type: Valid PDF document
Status: ✓ Can be opened by any PDF reader

Technical Details

File Detection Logic

def is_text_file(filepath):
    with open(filepath, 'rb') as f:
        header = f.read(4)
        # PDF files start with %PDF
        if header.startswith(b'%PDF'):
            return False
        return True

Supported File Types

The system now handles:

  • ✓ Valid PDF documents
  • ✓ Text files with .pdf extension
  • ✓ Image files (PNG, JPG, JPEG, TIFF, BMP)
  • ✓ Scanned PDFs (with Tesseract OCR)

Troubleshooting

"Failed to load PDF document" Error

Cause: File is a text file with .pdf extension

Solutions:

  1. Upload through LibraDigit AI (automatic handling)
  2. Run the conversion script manually
  3. Check file size (very small files like 92 bytes are likely text files)

Conversion Script Not Working

Check:

  1. Is Python installed? python --version
  2. Is reportlab installed? pip show reportlab
  3. Is the file path correct? (use quotes for paths with spaces)

Backend Not Detecting Text Files

Verify:

  1. Backend server is running
  2. Latest version of server.py is deployed
  3. Check server logs for errors

Additional Resources

  • Conversion Script Documentation: backend/CONVERT_TEXT_TO_PDF.md
  • Backend API Documentation: backend/server.py
  • Main README: README.md

Summary

Both solutions are now in place:

  1. Automatic Detection: The backend automatically handles text files during upload
  2. Manual Conversion: Use the utility script to batch-convert existing files

You can use either method or both together for maximum flexibility!