Skip to content

Commit a604804

Browse files
committed
fix(docs): add poppler-utils dependency for PDF document processing
- Add poppler-utils to Dockerfile runtime dependencies - Document poppler-utils installation in README.md prerequisites - Add early validation check for poppler-utils availability - Improve error messages with platform-specific installation instructions - Fix logger initialization order in nemo_retriever.py Resolves PDF processing failures when poppler-utils is not installed. Required by pdf2image package for PDF to image conversion.
1 parent 371992e commit a604804

3 files changed

Lines changed: 27 additions & 1 deletion

File tree

Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ WORKDIR /app
6060
RUN apt-get update && apt-get install -y \
6161
curl \
6262
git \
63+
poppler-utils \
6364
&& rm -rf /var/lib/apt/lists/*
6465

6566
# Copy Python dependencies from backend-deps stage

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,11 @@ See the [Local Development Setup](#local-development-setup) section below for ma
218218
- **macOS**: `brew install postgresql` or `brew install libpq`
219219
- **Windows**: Install from [PostgreSQL downloads](https://www.postgresql.org/download/windows/)
220220
- **Alternative**: Use Docker (see [DEPLOYMENT.md](DEPLOYMENT.md))
221+
- **Poppler utilities** (`poppler-utils`) - Required for PDF document processing
222+
- **Ubuntu/Debian**: `sudo apt-get install poppler-utils`
223+
- **macOS**: `brew install poppler`
224+
- **Windows**: Install from [Poppler for Windows](http://blog.alivate.com.au/poppler-windows/) or use Chocolatey: `choco install poppler`
225+
- **Note**: Required by `pdf2image` package for converting PDF pages to images
221226
- **CUDA (for GPU acceleration)** - Optional but recommended for RAPIDS GPU-accelerated forecasting
222227
- **Recommended**: CUDA 12.x (default for RAPIDS packages)
223228
- **Supported**: CUDA 11.x (via `install_rapids.sh` auto-detection)

src/api/agents/document/preprocessing/nemo_retriever.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929
from PIL import Image
3030
import io
3131

32+
logger = logging.getLogger(__name__)
33+
3234
# Try to import pdf2image, fallback to None if not available
3335
try:
3436
from pdf2image import convert_from_path
@@ -37,7 +39,16 @@
3739
PDF2IMAGE_AVAILABLE = False
3840
logger.warning("pdf2image not available. PDF processing will be limited. Install with: pip install pdf2image")
3941

40-
logger = logging.getLogger(__name__)
42+
43+
def _check_poppler_available() -> bool:
44+
"""
45+
Check if poppler-utils is installed and available in PATH.
46+
47+
Returns:
48+
True if poppler-utils is available, False otherwise
49+
"""
50+
import shutil
51+
return shutil.which("pdfinfo") is not None
4152

4253

4354
class NeMoRetrieverPreprocessor:
@@ -205,6 +216,15 @@ async def _extract_pdf_images(self, file_path: str) -> List[Image.Image]:
205216
"Also requires poppler-utils system package: sudo apt-get install poppler-utils"
206217
)
207218

219+
# Check if poppler-utils is available before attempting conversion
220+
if not _check_poppler_available():
221+
raise RuntimeError(
222+
"poppler-utils is not installed or not in PATH. "
223+
"Install it with: sudo apt-get install poppler-utils (Ubuntu/Debian) "
224+
"or brew install poppler (macOS). "
225+
"This is required for PDF to image conversion."
226+
)
227+
208228
logger.info(f"Converting PDF to images: {file_path}")
209229

210230
# Limit pages for faster processing

0 commit comments

Comments
 (0)