Unified Analyzer Configuration#1970
Conversation
There was a problem hiding this comment.
Pull request overview
This PR consolidates Presidio Analyzer configuration into a single unified conf/analyzer.yaml, while keeping the legacy per-section YAML files for backward compatibility with deprecation messaging across code, Docker, and docs.
Changes:
- Added a unified analyzer configuration schema (
AnalyzerConfiguration) and a new defaultconf/analyzer.yaml. - Updated
AnalyzerEngineProviderdefaults/deprecation warnings to prefer the unified config file. - Updated Dockerfiles and documentation to reference the unified configuration approach.
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/tests/test_analyzer_engine_provider.py | Updates default config filename expectation to analyzer.yaml. |
| presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py | Adds deprecation warning when loading standalone recognizer-registry config files. |
| presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py | Adds deprecation warning when loading standalone NLP config files. |
| presidio-analyzer/presidio_analyzer/configuration/analyzer_configuration.py | Introduces Pydantic-based unified configuration models and YAML loader. |
| presidio-analyzer/presidio_analyzer/configuration/init.py | Exposes unified configuration models as a package API. |
| presidio-analyzer/presidio_analyzer/conf/default.yaml | Adds deprecation banner pointing users to conf/analyzer.yaml. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Adds deprecation banner pointing users to conf/analyzer.yaml. |
| presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml | Adds deprecation banner pointing users to conf/analyzer.yaml. |
| presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml | Adds deprecation banner pointing users to conf/analyzer.yaml. |
| presidio-analyzer/presidio_analyzer/conf/analyzer.yaml | Adds the new unified default analyzer configuration file. |
| presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py | Makes analyzer.yaml the default analyzer config and adds deprecation warnings for old patterns/params. |
| presidio-analyzer/install_nlp_models.py | Switches model installation to prefer NLP config embedded in analyzer config. |
| presidio-analyzer/Dockerfile.windows | Switches defaults toward unified config and changes model-install invocation. |
| presidio-analyzer/Dockerfile.transformers | Switches defaults toward unified config and changes model-install invocation. |
| presidio-analyzer/Dockerfile.stanza | Switches defaults toward unified config and changes model-install invocation. |
| presidio-analyzer/Dockerfile.dev | Updates build args to prefer unified config defaults. |
| presidio-analyzer/Dockerfile | Switches defaults toward unified config and changes model-install invocation. |
| presidio-analyzer/app.py | Adds deprecation warnings for legacy env vars and minor formatting fixes. |
| docs/tutorial/08_no_code.md | Updates tutorial to reference unified config and adds deprecation note for separate files. |
| docs/samples/python/no_code_config.ipynb | Updates sample notebook links/text to use the unified config approach. |
| docs/samples/python/langextract/index.md | Updates LangExtract sample to enable/configure recognizer via unified config and AnalyzerEngineProvider. |
| docs/analyzer/languages.md | Updates guidance to configure NLP models via nlp_configuration inside unified config. |
| docs/analyzer/customizing_nlp_models.md | Updates config references from default.yaml to unified config (but includes an inaccurate statement). |
| docs/analyzer/analyzer_engine_provider.md | Updates docs to recommend unified configuration and marks multi-file usage as deprecated. |
| docs/analyzer/adding_recognizers.md | Updates docs to add recognizers via conf/analyzer.yaml. |
| CONTRIBUTING.md | Updates contributor guidance to add recognizers in conf/analyzer.yaml. |
| .github/copilot-instructions.md | Updates repo instructions to reference conf/analyzer.yaml as the configuration target. |
| ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml | ||
| ARG NLP_CONF_FILE= | ||
| ARG RECOGNIZER_REGISTRY_CONF_FILE= |
There was a problem hiding this comment.
NLP_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE are now defaulted to an empty string, but the Dockerfile still unconditionally uses these build args in COPY instructions later. With empty defaults, the Docker build will fail when it reaches those COPY ${NLP_CONF_FILE} ... / COPY ${RECOGNIZER_REGISTRY_CONF_FILE} ... steps. Either remove those COPY steps entirely, or keep non-empty defaults (and rely on the unified analyzer config taking priority at runtime).
| ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml | ||
| ARG NLP_CONF_FILE= | ||
| ARG RECOGNIZER_REGISTRY_CONF_FILE= |
There was a problem hiding this comment.
NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are defaulted to empty, but this Dockerfile still uses them in later COPY instructions. With empty defaults, docker build will fail on those COPY steps. Either drop the per-section COPYs entirely or keep non-empty defaults (the unified analyzer_conf_file already takes priority in AnalyzerEngineProvider).
| ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml | ||
| ARG NLP_CONF_FILE= | ||
| ARG RECOGNIZER_REGISTRY_CONF_FILE= |
There was a problem hiding this comment.
NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are set to empty defaults, but later COPY steps still reference them. This will break Docker builds unless the args are overridden. Consider removing the per-section COPYs (preferred if deprecating them) or keeping non-empty defaults for backward compatibility.
| ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml | ||
| ARG NLP_CONF_FILE= | ||
| ARG RECOGNIZER_REGISTRY_CONF_FILE= |
There was a problem hiding this comment.
NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are defaulted to empty strings, but this Dockerfile still unconditionally COPYs them later. With empty defaults, the Docker build will fail on those COPY steps unless the args are always overridden. Either remove those COPY instructions or keep non-empty defaults.
| install_models( | ||
| nlp_conf_file=args.nlp_conf_file or "presidio_analyzer/conf/default.yaml", | ||
| analyzer_conf_file=args.analyzer_conf_file, | ||
| nlp_conf_file=args.nlp_conf_file, | ||
| analyzer_conf_file=args.analyzer_conf_file | ||
| or "presidio_analyzer/conf/analyzer.yaml", | ||
| ) |
There was a problem hiding this comment.
The CLI entrypoint always passes analyzer_conf_file=args.analyzer_conf_file or "presidio_analyzer/conf/analyzer.yaml". Since the default analyzer.yaml contains nlp_configuration, this makes --nlp_conf_file (and the deprecated --conf_file) effectively ignored in the common case, and can also lead to errors if the analyzer conf file lacks nlp_configuration while nlp_conf_file is still None. Consider only defaulting analyzer_conf_file when neither an analyzer config nor an NLP config was provided, and otherwise default nlp_conf_file to the legacy conf/default.yaml for backward compatibility.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| def from_yaml(cls, file_path: Union[str, Path]) -> "AnalyzerConfiguration": | ||
| """Load and validate configuration from a YAML file. | ||
|
|
||
| :param file_path: Path to the YAML configuration file. | ||
| :return: Validated AnalyzerConfiguration instance. | ||
| :raises FileNotFoundError: If the file does not exist. | ||
| :raises ValueError: If the YAML content is invalid. | ||
| """ | ||
| path = Path(file_path) | ||
| if not path.exists(): | ||
| raise FileNotFoundError(f"Configuration file not found: {path}") | ||
| if not path.is_file(): | ||
| raise ValueError(f"Path is not a file: {path}") | ||
|
|
||
| with open(path) as f: | ||
| raw = yaml.safe_load(f) | ||
|
|
There was a problem hiding this comment.
AnalyzerConfiguration.from_yaml() documents :raises ValueError: If the YAML content is invalid, but yaml.safe_load() can raise yaml.YAMLError, and return cls(**raw) can raise pydantic.ValidationError. Either catch/wrap these into ValueError (to match the API contract) or update the docstring/type hints to reflect the actual exceptions.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
|
@copilot address the comments on COPY command for deprecated conf file, remove the COPY command altogether and assume the files are copied as part of being in the root dir |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…kerfiles Agent-Logs-Url: https://github.com/microsoft/presidio/sessions/88e42ff4-6920-4698-9769-1c702ccf5556 Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Done in commit |
|
@omri374 Notice that the base branch is v3 |
| """Load and validate configuration from a YAML file. | ||
|
|
||
| :param file_path: Path to the YAML configuration file. | ||
| :return: Validated AnalyzerConfiguration instance. | ||
| :raises FileNotFoundError: If the file does not exist. | ||
| :raises ValueError: If the YAML content is invalid. | ||
| """ | ||
| path = Path(file_path) | ||
| if not path.exists(): | ||
| raise FileNotFoundError(f"Configuration file not found: {path}") | ||
| if not path.is_file(): | ||
| raise ValueError(f"Path is not a file: {path}") | ||
|
|
||
| with open(path) as f: | ||
| raw = yaml.safe_load(f) | ||
|
|
||
| if raw is None: | ||
| raise ValueError(f"Configuration file is empty: {path}") | ||
| if not isinstance(raw, dict): | ||
| raise ValueError( | ||
| f"Configuration file must contain a YAML mapping, " | ||
| f"got {type(raw).__name__}: {path}" | ||
| ) | ||
|
|
||
| # Detect and warn about deprecated separate-file format keys | ||
| deprecated_top_keys = {"nlp_engine_name", "models"} | ||
| found_deprecated = deprecated_top_keys & set(raw.keys()) | ||
| if found_deprecated: | ||
| warnings.warn( | ||
| f"Configuration file '{path}' appears to use the deprecated " | ||
| f"standalone NLP configuration format " | ||
| f"(found top-level keys: {sorted(found_deprecated)}). " | ||
| f"Please migrate to the unified analyzer configuration format. " | ||
| f"See: https://microsoft.github.io/presidio/analyzer/" | ||
| f"analyzer_engine_provider/", | ||
| DeprecationWarning, | ||
| stacklevel=2, | ||
| ) | ||
|
|
||
| return cls(**raw) |
| install_models( | ||
| nlp_conf_file=args.nlp_conf_file or "presidio_analyzer/conf/default.yaml", | ||
| analyzer_conf_file=args.analyzer_conf_file, | ||
| nlp_conf_file=args.nlp_conf_file, | ||
| analyzer_conf_file=args.analyzer_conf_file | ||
| or "presidio_analyzer/conf/analyzer.yaml", | ||
| ) |
| ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE} | ||
| ENV NLP_CONF_FILE=${NLP_CONF_FILE} | ||
|
|
||
| ENV PORT=3000 | ||
| ENV WORKERS=1 | ||
|
|
||
| COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE} |
| ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE} | ||
| ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE} | ||
| ENV NLP_CONF_FILE=${NLP_CONF_FILE} | ||
|
|
||
| COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE} |
| ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE} | ||
| ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE} | ||
| ENV NLP_CONF_FILE=${NLP_CONF_FILE} | ||
|
|
||
| ENV PORT=3000 | ||
| ENV WORKERS=1 | ||
|
|
||
| COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE} |
| ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE} | ||
| ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE} | ||
| ENV NLP_CONF_FILE=${NLP_CONF_FILE} | ||
|
|
||
| RUN powershell -Command \ | ||
| "Write-Host 'Downloading VC++ Redistributable...'; \ | ||
| Invoke-WebRequest -Uri 'https://aka.ms/vs/16/release/vc_redist.x64.exe' -OutFile 'vc_redist.x64.exe' -UseBasicParsing; \ | ||
| Write-Host 'Installing VC++ Redistributable...'; \ | ||
| Start-Process -FilePath 'vc_redist.x64.exe' -ArgumentList '/quiet', '/install' -Wait; \ | ||
| Write-Host 'VC++ Redistributable installed successfully'; \ | ||
| Remove-Item 'vc_redist.x64.exe'" | ||
|
|
||
| COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE} |
| - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. | ||
|
|
||
| The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`: | ||
| To load NLP settings from a configuration file, use `NlpEngineProvider` with a config file such as the [analyzer configuration file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/analyzer.yaml). Alternatively, NLP configuration can be passed directly: |
omri374
left a comment
There was a problem hiding this comment.
Thanks! there's one issue with the recognizer registry currently reading default_recognizers.yaml but aside from this it looks great.
We should think about how this would look like once it is deprecated completely and not just a warning- does it break? do we have a fallback mechanism? not for this pr of course.
| ): | ||
| warnings.warn( | ||
| f"Configuration file '{conf_file}' uses the deprecated " | ||
| f"partial-configuration format (only supported_languages and " |
There was a problem hiding this comment.
it's not cleat what partial-configuration format is. Is the warning about using deprecated files or is it aiming at something more specific here?
There was a problem hiding this comment.
Indeed a warning about using deprecated files
| # Load defaults if needed (no config provided, | ||
| # or registry_configuration is incomplete) | ||
| if use_defaults: | ||
| with open(RecognizerConfigurationLoader._get_full_conf_path()) as file: |
There was a problem hiding this comment.
this still uses the "default_recognizers.yaml" file. Do we want to update this too?
There was a problem hiding this comment.
I think that considering it relates to recognizers_loader, we can keep it until the full deprecation, in which, to your question, I think we should eventually fail when passed with non-unified config.
Consolidate analyzer configuration into a single unified file
Migrates the presidio-analyzer from three separate config files (default_analyzer.yaml, default.yaml, default_recognizers.yaml) to a single unified analyzer.yaml. Old files are preserved with deprecation banners.