Skip to content

Commit 4452186

Browse files
feat: add flash-attn to reduce VRAM usage and speed up inference
Without flash-attention, eager attention materializes O(N²) matrices for each layer. On high-res PDF pages this needs 7+ GB just for activations, exceeding the MPS memory limit. Flash-attention reduces this to O(N).
1 parent 89cf547 commit 4452186

1 file changed

Lines changed: 3 additions & 1 deletion

File tree

Containerfile.mayflower-qwen3vl

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,10 @@ USER 0
99
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
1010

1111
RUN /opt/app-root/bin/python -m pip install --no-cache-dir \
12+
flash-attn --no-build-isolation && \
13+
/opt/app-root/bin/python -m pip install --no-cache-dir \
1214
"git+${QWEN3VL_PLUGIN_REPO}@${QWEN3VL_PLUGIN_REF}" && \
13-
/opt/app-root/bin/python -c "from docling_ocr_qwen3vl.options import DEFAULT_QWEN3VL_MODEL_REPO_ID; print('default_model_repo_id=', DEFAULT_QWEN3VL_MODEL_REPO_ID)"
15+
/opt/app-root/bin/python -c "import flash_attn; print('flash-attn', flash_attn.__version__); from docling_ocr_qwen3vl.options import DEFAULT_QWEN3VL_MODEL_REPO_ID; print('default_model_repo_id=', DEFAULT_QWEN3VL_MODEL_REPO_ID)"
1416

1517
ENV DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS=true
1618

0 commit comments

Comments
 (0)