Fix Stanza model permissions and ANALYZER_CONF_FILE section override in Docker#1911
Fix Stanza model permissions and ANALYZER_CONF_FILE section override in Docker#1911
Conversation
|
@copilot please check if the yaml config problem is relevant to the other docker files as well. Also note that the NLP conf file is used both for configuration and for downloading the models during build, so not passing it would result in models not being downloaded. |
…erriding ANALYZER_CONF_FILE Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
…nsformers.yaml) Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Yes, the same Bug 2 applies to Regarding model downloading: model downloading is done entirely at build time via One gap in the original fix: it only compared against _KNOWN_DEFAULT_NLP_CONF_FILES = {
"presidio_analyzer/conf/default.yaml", # Dockerfile, Dockerfile.stanza
"presidio_analyzer/conf/transformers.yaml", # Dockerfile.transformers
}Explicitly customized per-section files (non-default paths, e.g. a user-provided Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
@copilot the model download currently uses |
|
@copilot the fix you did for app.py is not robust enough. It could be that the user passed an anaylzer.yaml with only analyzer params, in addition to a registry yaml and nlp_engine yaml. The fact that there's an analyzer yaml doesn't mean we should ignore the others. The fix should be done at the AnalyzerEngineProvider lever. |
…nf_file to install_nlp_models.py and update all Dockerfiles Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Good point — fixed in commit Build time ( All three Dockerfiles now pass both arguments to the script: RUN poetry run python install_nlp_models.py \
--conf_file ${NLP_CONF_FILE} \
--analyzer_conf_file ${ANALYZER_CONF_FILE}Runtime ( Behaviour summary:
|
…e priority over per-section files Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Fixed in commit The priority order is now:
This means:
The fragile "is this a Dockerfile default?" detection has been removed from |
Two configuration bugs when using Docker-based deployments with a custom
ANALYZER_CONF_FILE.Bug 1: Stanza models written to
/root/.cache— inaccessible at runtimestanza.download()runs asrootduring build, writing models to/root/.cache/stanza/. The container runs as user1001(presidio) at runtime, making those models unreachable.Fix: Set
ENV STANZA_RESOURCES_DIR=/app/stanza_resourcesinDockerfile.stanzabefore the model download step./appis alreadychown-ed to the presidio user.Bug 2: Default
NLP_CONF_FILE/RECOGNIZER_REGISTRY_CONF_FILEsilently override sections inANALYZER_CONF_FILEAll Dockerfile variants (
Dockerfile,Dockerfile.stanza,Dockerfile.transformers) bake default values forNLP_CONF_FILEandRECOGNIZER_REGISTRY_CONF_FILEinto the image viaENV. When a user provides a single unifiedANALYZER_CONF_FILE(containingnlp_configurationandrecognizer_registrysections) and leaves the other two vars at their Dockerfile defaults,AnalyzerEngineProviderprioritizes the per-section files over the corresponding sections insideANALYZER_CONF_FILE, causing a fatal mismatch at startup:Fix — Runtime (
AnalyzerEngineProvider)The priority logic in
AnalyzerEngineProvider._load_nlp_engine()and_load_recognizer_registry()is updated so that inline sections embedded in the analyzer conf take priority over separately provided per-section files. A per-section file is only used when the corresponding inline section is absent from the analyzer conf.This is robust for all use cases:
ANALYZER_CONF_FILEwith inlinenlp_configurationandrecognizer_registry→ inline sections win, Dockerfile-baked defaults are ignored ✓ANALYZER_CONF_FILEwith only top-level params (no inline sections) + separate NLP/registry files → per-section files are used ✓Fix — Build-time model downloads (
install_nlp_models.py+ all Dockerfiles)The model download script now accepts an optional
--analyzer_conf_fileargument. When provided and the file contains annlp_configurationsection (unified analyzer conf format), models from that section are downloaded and--conf_fileis ignored. This ensures that a unifiedANALYZER_CONF_FILEdrives both build-time model downloads and runtime configuration without requiring a separateNLP_CONF_FILE.All three Dockerfiles now pass both arguments to the script:
RUN poetry run python install_nlp_models.py \ --conf_file ${NLP_CONF_FILE} \ --analyzer_conf_file ${ANALYZER_CONF_FILE}Resulting behaviour:
ANALYZER_CONF_FILE(withnlp_configurationinside) → models fromnlp_configurationare downloaded at build time and used at runtime ✓NLP_CONF_FILE(split-file workflow) → unchanged behaviour ✓Original prompt
ANALYZER_CONF_FILE#1910📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.