Skip to content

Unified Analyzer Configuration#1970

Open
SharonHart wants to merge 4 commits intov3from
shhart/feature/unified-analyzer-config
Open

Unified Analyzer Configuration#1970
SharonHart wants to merge 4 commits intov3from
shhart/feature/unified-analyzer-config

Conversation

@SharonHart
Copy link
Copy Markdown
Contributor

Consolidate analyzer configuration into a single unified file
Migrates the presidio-analyzer from three separate config files (default_analyzer.yaml, default.yaml, default_recognizers.yaml) to a single unified analyzer.yaml. Old files are preserved with deprecation banners.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates Presidio Analyzer configuration into a single unified conf/analyzer.yaml, while keeping the legacy per-section YAML files for backward compatibility with deprecation messaging across code, Docker, and docs.

Changes:

  • Added a unified analyzer configuration schema (AnalyzerConfiguration) and a new default conf/analyzer.yaml.
  • Updated AnalyzerEngineProvider defaults/deprecation warnings to prefer the unified config file.
  • Updated Dockerfiles and documentation to reference the unified configuration approach.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
presidio-analyzer/tests/test_analyzer_engine_provider.py Updates default config filename expectation to analyzer.yaml.
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py Adds deprecation warning when loading standalone recognizer-registry config files.
presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py Adds deprecation warning when loading standalone NLP config files.
presidio-analyzer/presidio_analyzer/configuration/analyzer_configuration.py Introduces Pydantic-based unified configuration models and YAML loader.
presidio-analyzer/presidio_analyzer/configuration/init.py Exposes unified configuration models as a package API.
presidio-analyzer/presidio_analyzer/conf/default.yaml Adds deprecation banner pointing users to conf/analyzer.yaml.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Adds deprecation banner pointing users to conf/analyzer.yaml.
presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml Adds deprecation banner pointing users to conf/analyzer.yaml.
presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml Adds deprecation banner pointing users to conf/analyzer.yaml.
presidio-analyzer/presidio_analyzer/conf/analyzer.yaml Adds the new unified default analyzer configuration file.
presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py Makes analyzer.yaml the default analyzer config and adds deprecation warnings for old patterns/params.
presidio-analyzer/install_nlp_models.py Switches model installation to prefer NLP config embedded in analyzer config.
presidio-analyzer/Dockerfile.windows Switches defaults toward unified config and changes model-install invocation.
presidio-analyzer/Dockerfile.transformers Switches defaults toward unified config and changes model-install invocation.
presidio-analyzer/Dockerfile.stanza Switches defaults toward unified config and changes model-install invocation.
presidio-analyzer/Dockerfile.dev Updates build args to prefer unified config defaults.
presidio-analyzer/Dockerfile Switches defaults toward unified config and changes model-install invocation.
presidio-analyzer/app.py Adds deprecation warnings for legacy env vars and minor formatting fixes.
docs/tutorial/08_no_code.md Updates tutorial to reference unified config and adds deprecation note for separate files.
docs/samples/python/no_code_config.ipynb Updates sample notebook links/text to use the unified config approach.
docs/samples/python/langextract/index.md Updates LangExtract sample to enable/configure recognizer via unified config and AnalyzerEngineProvider.
docs/analyzer/languages.md Updates guidance to configure NLP models via nlp_configuration inside unified config.
docs/analyzer/customizing_nlp_models.md Updates config references from default.yaml to unified config (but includes an inaccurate statement).
docs/analyzer/analyzer_engine_provider.md Updates docs to recommend unified configuration and marks multi-file usage as deprecated.
docs/analyzer/adding_recognizers.md Updates docs to add recognizers via conf/analyzer.yaml.
CONTRIBUTING.md Updates contributor guidance to add recognizers in conf/analyzer.yaml.
.github/copilot-instructions.md Updates repo instructions to reference conf/analyzer.yaml as the configuration target.

Comment on lines +7 to +9
ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml
ARG NLP_CONF_FILE=
ARG RECOGNIZER_REGISTRY_CONF_FILE=
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLP_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE are now defaulted to an empty string, but the Dockerfile still unconditionally uses these build args in COPY instructions later. With empty defaults, the Docker build will fail when it reaches those COPY ${NLP_CONF_FILE} ... / COPY ${RECOGNIZER_REGISTRY_CONF_FILE} ... steps. Either remove those COPY steps entirely, or keep non-empty defaults (and rely on the unified analyzer config taking priority at runtime).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml
ARG NLP_CONF_FILE=
ARG RECOGNIZER_REGISTRY_CONF_FILE=
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are defaulted to empty, but this Dockerfile still uses them in later COPY instructions. With empty defaults, docker build will fail on those COPY steps. Either drop the per-section COPYs entirely or keep non-empty defaults (the unified analyzer_conf_file already takes priority in AnalyzerEngineProvider).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml
ARG NLP_CONF_FILE=
ARG RECOGNIZER_REGISTRY_CONF_FILE=
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are set to empty defaults, but later COPY steps still reference them. This will break Docker builds unless the args are overridden. Consider removing the per-section COPYs (preferred if deprecating them) or keeping non-empty defaults for backward compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
ARG ANALYZER_CONF_FILE=presidio_analyzer/conf/analyzer.yaml
ARG NLP_CONF_FILE=
ARG RECOGNIZER_REGISTRY_CONF_FILE=
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLP_CONF_FILE / RECOGNIZER_REGISTRY_CONF_FILE are defaulted to empty strings, but this Dockerfile still unconditionally COPYs them later. With empty defaults, the Docker build will fail on those COPY steps unless the args are always overridden. Either remove those COPY instructions or keep non-empty defaults.

Copilot uses AI. Check for mistakes.
Comment thread presidio-analyzer/Dockerfile.windows Outdated
Comment on lines 176 to 180
install_models(
nlp_conf_file=args.nlp_conf_file or "presidio_analyzer/conf/default.yaml",
analyzer_conf_file=args.analyzer_conf_file,
nlp_conf_file=args.nlp_conf_file,
analyzer_conf_file=args.analyzer_conf_file
or "presidio_analyzer/conf/analyzer.yaml",
)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI entrypoint always passes analyzer_conf_file=args.analyzer_conf_file or "presidio_analyzer/conf/analyzer.yaml". Since the default analyzer.yaml contains nlp_configuration, this makes --nlp_conf_file (and the deprecated --conf_file) effectively ignored in the common case, and can also lead to errors if the analyzer conf file lacks nlp_configuration while nlp_conf_file is still None. Consider only defaulting analyzer_conf_file when neither an analyzer config nor an NLP config was provided, and otherwise default nlp_conf_file to the legacy conf/default.yaml for backward compatibility.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +251 to +267
def from_yaml(cls, file_path: Union[str, Path]) -> "AnalyzerConfiguration":
"""Load and validate configuration from a YAML file.

:param file_path: Path to the YAML configuration file.
:return: Validated AnalyzerConfiguration instance.
:raises FileNotFoundError: If the file does not exist.
:raises ValueError: If the YAML content is invalid.
"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"Configuration file not found: {path}")
if not path.is_file():
raise ValueError(f"Path is not a file: {path}")

with open(path) as f:
raw = yaml.safe_load(f)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalyzerConfiguration.from_yaml() documents :raises ValueError: If the YAML content is invalid, but yaml.safe_load() can raise yaml.YAMLError, and return cls(**raw) can raise pydantic.ValidationError. Either catch/wrap these into ValueError (to match the API contract) or update the docstring/type hints to reflect the actual exceptions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread docs/analyzer/customizing_nlp_models.md Outdated
@SharonHart SharonHart marked this pull request as ready for review April 13, 2026 09:47
@SharonHart
Copy link
Copy Markdown
Contributor Author

@copilot address the comments on COPY command for deprecated conf file, remove the COPY command altogether and assume the files are copied as part of being in the root dir

SharonHart and others added 3 commits April 13, 2026 12:56
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

@copilot address the comments on COPY command for deprecated conf file, remove the COPY command altogether and assume the files are copied as part of being in the root dir

Done in commit 01edd76. Removed the COPY ${NLP_CONF_FILE} and COPY ${RECOGNIZER_REGISTRY_CONF_FILE} instructions from all 4 Dockerfiles (Dockerfile, Dockerfile.transformers, Dockerfile.stanza, Dockerfile.windows). Also removed the duplicate COPY ${ANALYZER_CONF_FILE} in Dockerfile.windows. These deprecated conf files are now picked up via the later COPY . /app/ (or COPY . .) step.

@SharonHart
Copy link
Copy Markdown
Contributor Author

@omri374 Notice that the base branch is v3

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 7 comments.

Comment on lines +252 to +291
"""Load and validate configuration from a YAML file.

:param file_path: Path to the YAML configuration file.
:return: Validated AnalyzerConfiguration instance.
:raises FileNotFoundError: If the file does not exist.
:raises ValueError: If the YAML content is invalid.
"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"Configuration file not found: {path}")
if not path.is_file():
raise ValueError(f"Path is not a file: {path}")

with open(path) as f:
raw = yaml.safe_load(f)

if raw is None:
raise ValueError(f"Configuration file is empty: {path}")
if not isinstance(raw, dict):
raise ValueError(
f"Configuration file must contain a YAML mapping, "
f"got {type(raw).__name__}: {path}"
)

# Detect and warn about deprecated separate-file format keys
deprecated_top_keys = {"nlp_engine_name", "models"}
found_deprecated = deprecated_top_keys & set(raw.keys())
if found_deprecated:
warnings.warn(
f"Configuration file '{path}' appears to use the deprecated "
f"standalone NLP configuration format "
f"(found top-level keys: {sorted(found_deprecated)}). "
f"Please migrate to the unified analyzer configuration format. "
f"See: https://microsoft.github.io/presidio/analyzer/"
f"analyzer_engine_provider/",
DeprecationWarning,
stacklevel=2,
)

return cls(**raw)
Comment on lines 176 to 180
install_models(
nlp_conf_file=args.nlp_conf_file or "presidio_analyzer/conf/default.yaml",
analyzer_conf_file=args.analyzer_conf_file,
nlp_conf_file=args.nlp_conf_file,
analyzer_conf_file=args.analyzer_conf_file
or "presidio_analyzer/conf/analyzer.yaml",
)
Comment on lines 14 to 20
ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE}
ENV NLP_CONF_FILE=${NLP_CONF_FILE}

ENV PORT=3000
ENV WORKERS=1

COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE}
Comment on lines 15 to 19
ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE}
ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE}
ENV NLP_CONF_FILE=${NLP_CONF_FILE}

COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE}
Comment on lines 11 to 18
ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE}
ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE}
ENV NLP_CONF_FILE=${NLP_CONF_FILE}

ENV PORT=3000
ENV WORKERS=1

COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE}
Comment on lines 13 to 25
ENV ANALYZER_CONF_FILE=${ANALYZER_CONF_FILE}
ENV RECOGNIZER_REGISTRY_CONF_FILE=${RECOGNIZER_REGISTRY_CONF_FILE}
ENV NLP_CONF_FILE=${NLP_CONF_FILE}

RUN powershell -Command \
"Write-Host 'Downloading VC++ Redistributable...'; \
Invoke-WebRequest -Uri 'https://aka.ms/vs/16/release/vc_redist.x64.exe' -OutFile 'vc_redist.x64.exe' -UseBasicParsing; \
Write-Host 'Installing VC++ Redistributable...'; \
Start-Process -FilePath 'vc_redist.x64.exe' -ArgumentList '/quiet', '/install' -Wait; \
Write-Host 'VC++ Redistributable installed successfully'; \
Remove-Item 'vc_redist.x64.exe'"

COPY ${ANALYZER_CONF_FILE} /app/${ANALYZER_CONF_FILE}
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:
To load NLP settings from a configuration file, use `NlpEngineProvider` with a config file such as the [analyzer configuration file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/analyzer.yaml). Alternatively, NLP configuration can be passed directly:
Copy link
Copy Markdown
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! there's one issue with the recognizer registry currently reading default_recognizers.yaml but aside from this it looks great.
We should think about how this would look like once it is deprecated completely and not just a warning- does it break? do we have a fallback mechanism? not for this pr of course.

):
warnings.warn(
f"Configuration file '{conf_file}' uses the deprecated "
f"partial-configuration format (only supported_languages and "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not cleat what partial-configuration format is. Is the warning about using deprecated files or is it aiming at something more specific here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed a warning about using deprecated files

# Load defaults if needed (no config provided,
# or registry_configuration is incomplete)
if use_defaults:
with open(RecognizerConfigurationLoader._get_full_conf_path()) as file:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still uses the "default_recognizers.yaml" file. Do we want to update this too?

Copy link
Copy Markdown
Contributor Author

@SharonHart SharonHart May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that considering it relates to recognizers_loader, we can keep it until the full deprecation, in which, to your question, I think we should eventually fail when passed with non-unified config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants