Skip to content

Fix Stanza model permissions and ANALYZER_CONF_FILE section override in Docker#1911

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/fix-stanza-models-accessibility
Draft

Fix Stanza model permissions and ANALYZER_CONF_FILE section override in Docker#1911
Copilot wants to merge 5 commits intomainfrom
copilot/fix-stanza-models-accessibility

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 17, 2026

Two configuration bugs when using Docker-based deployments with a custom ANALYZER_CONF_FILE.

Bug 1: Stanza models written to /root/.cache — inaccessible at runtime

stanza.download() runs as root during build, writing models to /root/.cache/stanza/. The container runs as user 1001 (presidio) at runtime, making those models unreachable.

Fix: Set ENV STANZA_RESOURCES_DIR=/app/stanza_resources in Dockerfile.stanza before the model download step. /app is already chown-ed to the presidio user.

Bug 2: Default NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE silently override sections in ANALYZER_CONF_FILE

All Dockerfile variants (Dockerfile, Dockerfile.stanza, Dockerfile.transformers) bake default values for NLP_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE into the image via ENV. When a user provides a single unified ANALYZER_CONF_FILE (containing nlp_configuration and recognizer_registry sections) and leaves the other two vars at their Dockerfile defaults, AnalyzerEngineProvider prioritizes the per-section files over the corresponding sections inside ANALYZER_CONF_FILE, causing a fatal mismatch at startup:

Misconfigured engine, supported languages have to be consistent
registry.supported_languages: ['en'], analyzer_engine.supported_languages: ['en', 'th']

Fix — Runtime (AnalyzerEngineProvider)

The priority logic in AnalyzerEngineProvider._load_nlp_engine() and _load_recognizer_registry() is updated so that inline sections embedded in the analyzer conf take priority over separately provided per-section files. A per-section file is only used when the corresponding inline section is absent from the analyzer conf.

Priority order (highest to lowest):
1. Inline section in analyzer conf (nlp_configuration / recognizer_registry)
2. Separately provided per-section file (nlp_engine_conf_file / recognizer_registry_conf_file)
3. Default

This is robust for all use cases:

  • User passes a unified ANALYZER_CONF_FILE with inline nlp_configuration and recognizer_registry → inline sections win, Dockerfile-baked defaults are ignored ✓
  • User passes an ANALYZER_CONF_FILE with only top-level params (no inline sections) + separate NLP/registry files → per-section files are used ✓
  • User passes only per-section files (no unified file) → per-section files used ✓

Fix — Build-time model downloads (install_nlp_models.py + all Dockerfiles)

The model download script now accepts an optional --analyzer_conf_file argument. When provided and the file contains an nlp_configuration section (unified analyzer conf format), models from that section are downloaded and --conf_file is ignored. This ensures that a unified ANALYZER_CONF_FILE drives both build-time model downloads and runtime configuration without requiring a separate NLP_CONF_FILE.

All three Dockerfiles now pass both arguments to the script:

RUN poetry run python install_nlp_models.py \
    --conf_file ${NLP_CONF_FILE} \
    --analyzer_conf_file ${ANALYZER_CONF_FILE}

Resulting behaviour:

  • User sets only ANALYZER_CONF_FILE (with nlp_configuration inside) → models from nlp_configuration are downloaded at build time and used at runtime ✓
  • User sets only NLP_CONF_FILE (split-file workflow) → unchanged behaviour ✓
  • User sets both to custom values → both are respected explicitly ✓
Original prompt

This section details on the original issue you should resolve

<issue_title>Dockerfile.stanza: Stanza models inaccessible at runtime + per-section env vars override unified ANALYZER_CONF_FILE</issue_title>
<issue_description>Two related configuration bugs were found when building and running Dockerfile.stanza with a custom ANALYZER_CONF_FILE.

Bug 1: Stanza models downloaded to /root/.cache are inaccessible at runtime

Steps to reproduce:

  1. Build Dockerfile.stanza with any NLP_CONF_FILE containing a Stanza model.
  2. Run the container.

Expected behavior: The Stanza NLP engine loads its models successfully.

Actual behavior: The container fails to load models because stanza.download() runs as root during the build step and writes models to /root/.cache/stanza/. At runtime, the process runs as user 1001 (presidio) and cannot access /root/.

Fix: Set ENV STANZA_RESOURCES_DIR=/app/stanza_resources before the model download step. Since /app is later chown-ed to the presidio user, the models will be accessible at runtime.


Bug 2: NLP_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE env vars silently override sections in ANALYZER_CONF_FILE

Steps to reproduce:

  1. Create a single unified analyzer.yaml containing supported_languages, nlp_configuration, and recognizer_registry sections.
  2. Build with --build-arg ANALYZER_CONF_FILE=analyzer.yaml, leaving NLP_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE at their defaults.
  3. Run the container.

Expected behavior: The analyzer uses the NLP engine and recognizer registry settings from analyzer.yaml.

Actual behavior: The Dockerfile bakes in default values for NLP_CONF_FILE (SpaCy) and RECOGNIZER_REGISTRY_CONF_FILE (English-only) as env vars. app.py reads all three env vars independently and passes them to AnalyzerEngineProvider, so the per-section conf files take priority over the corresponding sections inside ANALYZER_CONF_FILE. This causes a fatal mismatch error at startup:

Misconfigured engine, supported languages have to be consistent
registry.supported_languages: ['en'], analyzer_engine.supported_languages: ['en', 'th']

Workaround: Either:

  • Pass all three files separately as build args and keep them consistent, or
  • Clear the conflicting env vars at runtime: -e NLP_CONF_FILE= -e RECOGNIZER_REGISTRY_CONF_FILE=

Suggested fix: In app.py, if ANALYZER_CONF_FILE is set and the other two env vars point to their default values, do not pass them to AnalyzerEngineProvider — let the sections inside the analyzer config file take effect. Alternatively, document clearly that providing ANALYZER_CONF_FILE alone is not sufficient and all three env vars must be managed consistently.</issue_description>

Comments on the Issue (you are @copilot in this section)


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Mar 17, 2026

@copilot please check if the yaml config problem is relevant to the other docker files as well. Also note that the NLP conf file is used both for configuration and for downloading the models during build, so not passing it would result in models not being downloaded.

…erriding ANALYZER_CONF_FILE

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix stanza models inaccessible issue at runtime Fix Stanza model permissions and ANALYZER_CONF_FILE section override in Docker Mar 17, 2026
Copilot AI requested a review from omri374 March 17, 2026 06:37
…nsformers.yaml)

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 17, 2026

@copilot please check if the yaml config problem is relevant to the other docker files as well. Also note that the NLP conf file is used both for configuration and for downloading th...

Yes, the same Bug 2 applies to Dockerfile and Dockerfile.transformers — all three Dockerfiles use the same ENV NLP_CONF_FILE=... / ENV RECOGNIZER_REGISTRY_CONF_FILE=... bake-in pattern. Since they all run the same app.py, the fix there covers all of them.

Regarding model downloading: model downloading is done entirely at build time via RUN poetry run python install_nlp_models.py --conf_file ${NLP_CONF_FILE} in the Dockerfile (using the Docker ARG, not the runtime ENV). The nlp_engine_conf_file I'm clearing in app.py only affects runtime configuration passed to AnalyzerEngineProvider — it has no effect on what models were downloaded during the build.

One gap in the original fix: it only compared against default.yaml but Dockerfile.transformers bakes in transformers.yaml as its NLP default. I've updated app.py in commit aa85ba9 to use a set of all known per-Dockerfile NLP defaults:

_KNOWN_DEFAULT_NLP_CONF_FILES = {
    "presidio_analyzer/conf/default.yaml",      # Dockerfile, Dockerfile.stanza
    "presidio_analyzer/conf/transformers.yaml",  # Dockerfile.transformers
}

Explicitly customized per-section files (non-default paths, e.g. a user-provided my_nlp.yaml) are still respected and will NOT be cleared.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: /home/REDACTED/.cache/pypoetry/virtualenvs/presidio-analyzer-MCOD5T2k-py3.12/bin/pytest /home/REDACTED/.cache/pypoetry/virtualenvs/presidio-analyzer-MCOD5T2k-py3.12/bin/pytest tests/test_analyzer_engine_provider.py -v (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Mar 18, 2026

@copilot please check if the yaml config problem is relevant to the other docker files as well. Also note that the NLP conf file is used both for configuration and for downloading th...

Yes, the same Bug 2 applies to Dockerfile and Dockerfile.transformers — all three Dockerfiles use the same ENV NLP_CONF_FILE=... / ENV RECOGNIZER_REGISTRY_CONF_FILE=... bake-in pattern. Since they all run the same app.py, the fix there covers all of them.

Regarding model downloading: model downloading is done entirely at build time via RUN poetry run python install_nlp_models.py --conf_file ${NLP_CONF_FILE} in the Dockerfile (using the Docker ARG, not the runtime ENV). The nlp_engine_conf_file I'm clearing in app.py only affects runtime configuration passed to AnalyzerEngineProvider — it has no effect on what models were downloaded during the build.

One gap in the original fix: it only compared against default.yaml but Dockerfile.transformers bakes in transformers.yaml as its NLP default. I've updated app.py in commit aa85ba9 to use a set of all known per-Dockerfile NLP defaults:

_KNOWN_DEFAULT_NLP_CONF_FILES = {
    "presidio_analyzer/conf/default.yaml",      # Dockerfile, Dockerfile.stanza
    "presidio_analyzer/conf/transformers.yaml",  # Dockerfile.transformers
}

Explicitly customized per-section files (non-default paths, e.g. a user-provided my_nlp.yaml) are still respected and will NOT be cleared.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

@copilot the model download currently uses --conf_file ${NLP_CONF_FILE} so if the user doesn't pass it, it would not download the right nodels. Solving it in app.py only doesn't help. Can you think of alternatives which would tailor the two problems? app.py and model downloads?

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Mar 18, 2026

@copilot the fix you did for app.py is not robust enough. It could be that the user passed an anaylzer.yaml with only analyzer params, in addition to a registry yaml and nlp_engine yaml. The fact that there's an analyzer yaml doesn't mean we should ignore the others. The fix should be done at the AnalyzerEngineProvider lever.

…nf_file to install_nlp_models.py and update all Dockerfiles

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot please check if the yaml config problem is relevant to the other docker files as well. Also note that the NLP conf file is used both for configuration and for downloadin...

Good point — fixed in commit d0ba3cb. The solution addresses both sides:

Build time (install_nlp_models.py): Added a new optional --analyzer_conf_file argument. When provided and the file contains an nlp_configuration section, models from that section are downloaded (and --conf_file is ignored). This means a unified ANALYZER_CONF_FILE now drives model downloads correctly.

All three Dockerfiles now pass both arguments to the script:

RUN poetry run python install_nlp_models.py \
    --conf_file ${NLP_CONF_FILE} \
    --analyzer_conf_file ${ANALYZER_CONF_FILE}

Runtime (app.py): The existing fix ensures the runtime configuration also uses the NLP and recognizer-registry sections from ANALYZER_CONF_FILE rather than the Dockerfile-baked defaults.

Behaviour summary:

  • User sets only ANALYZER_CONF_FILE (with nlp_configuration inside) → models from nlp_configuration are downloaded at build time and used at runtime ✓
  • User sets only NLP_CONF_FILE (old split-file workflow) → unchanged, models from NLP_CONF_FILE are downloaded and used ✓
  • User sets both to custom values → both are respected explicitly ✓

…e priority over per-section files

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot the fix you did for app.py is not robust enough. It could be that the user passed an anaylzer.yaml with only analyzer params, in addition to a registry yaml and nlp_engine y...

Fixed in commit 493bf1b. The logic now lives entirely in AnalyzerEngineProvider._load_nlp_engine() and _load_recognizer_registry().

The priority order is now:

  1. Inline section in the analyzer conf (nlp_configuration / recognizer_registry) — wins if present, regardless of what per-section files are provided
  2. Per-section file (nlp_engine_conf_file / recognizer_registry_conf_file) — used only when no inline section is present
  3. Default — fallback when neither is provided

This means:

  • User passes a unified analyzer.yaml with nlp_configuration inside → inline section wins, Dockerfile-baked NLP_CONF_FILE is ignored ✓
  • User passes an analyzer.yaml with only top-level params (no inline sections) + separate NLP/registry files → per-section files are used ✓ (the scenario you described)
  • User passes only per-section files → per-section files used ✓

The fragile "is this a Dockerfile default?" detection has been removed from app.py. A new test test_analyzer_engine_provider_inline_sections_take_priority_over_per_section_files covers this behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dockerfile.stanza: Stanza models inaccessible at runtime + per-section env vars override unified ANALYZER_CONF_FILE

2 participants