Add a configurable LangExtract recognizer for use with any provider. #1815

telackey · 2025-12-12T20:12:34Z

Change Description

Add a new, basic LangExtract-based recognizer class that is generic. The current implementations focus on ollama or azure support. This one instantiates an lx.ModelConfig from the yaml, so that it can specify different providers and custom configurations (eg, developed using Ollama via a LiteLLM OpenAI proxy).

Issue reference

Fixes #XX

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

…with any provider.

RonShakutai

What a great PR !
he lx_factory.ModelConfig approach is elegant !

left a few comments to finalize it

RonShakutai · 2025-12-14T13:06:45Z

presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml

      - en
    type: predefined
    enabled: false
    config_path: presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml


Can we remove OllamaLangExtractRecognizer?

BasicLangExtractRecognizer already supports Ollama through provider configuration.
The dedicated Ollama recognizer seems redundant now.

also should we adjust the e2e tests as well https://github.com/microsoft/presidio/blob/main/e2e-tests/tests/test_package_e2e_integration_flows.py#L68.

RonShakutai · 2025-12-14T13:11:05Z

...nalyzer/presidio_analyzer/predefined_recognizers/third_party/basic_langextract_recognizer.py

+        model_config = self.config.get("model", {})
+        provider_config = model_config.get("provider", {})
+        self.model_id = model_config.get("model_id")
+        self.provider = provider_config.get("name")


Should we add validation here with descriptive error messages?

RonShakutai · 2025-12-14T13:19:46Z

...nalyzer/presidio_analyzer/predefined_recognizers/third_party/basic_langextract_recognizer.py

+            config_path if config_path else str(self.DEFAULT_CONFIG_PATH)
+        )
+
+        super().__init__(


Should we support extract_params in BasicLangExtractRecognizer?
Those parameters are needed for different scenarios,
for example for ollama if we use small llm we would need the max_char_buffer..

OllamaLangExtractRecognizer passes parameters like max_char_buffer, timeout, and num_ctx , max_workers, language_model_params, extraction_passes.

to the parent class, but BasicLangExtractRecognizer doesn't support these yet.

i have thought about something like this:

Extract optional parameters from config extract_params = {} if "max_char_buffer" in model_config: extract_params["extract"] = {"max_char_buffer": model_config["max_char_buffer"]} lang_model_params = {} for key in ["timeout", "num_ctx"]: if key in model_config: lang_model_params[key] = model_config[key] if lang_model_params: extract_params["language_model"] = lang_model_params super().__init__( config_path=actual_config_path, name="Basic LangExtract PII", supported_language=supported_language, extract_params=extract_params or None )

RonShakutai · 2025-12-14T13:27:38Z

...nalyzer/presidio_analyzer/predefined_recognizers/third_party/basic_langextract_recognizer.py

+        )
+
+    def _get_provider_params(self):
+        """Return Azure OpenAI-specific params."""


please fix the doc string

RonShakutai · 2025-12-14T13:42:47Z

...nalyzer/presidio_analyzer/predefined_recognizers/third_party/basic_langextract_recognizer.py

+            provider_kwargs=self.provider_kwargs,
+        )
+
+    def _get_provider_params(self):


this method can be removed also from the parent and from the AzureOpenAILangExtractRecognizer recognizer... but keep the abstraction i guess

telackey · 2025-12-16T04:32:33Z

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

RonShakutai · 2025-12-16T06:48:14Z

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

No worries, whenever you get a chance :)
Thanks again for your great contribution!

SharonHart · 2025-12-17T13:43:11Z

@telackey
This branch also needs rebasing to main, sorry for that

RonShakutai · 2026-01-04T07:58:23Z

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

Hi :) @telackey
Are you still planning to complete this PR?

telackey · 2026-01-05T19:00:36Z

Yes, the holidays just intervened.

Add a basic, configurable LangExtract-based recognizer class for use …

f559d70

…with any provider.

github-actions bot added the external label Dec 12, 2025

Add a basic, configurable LangExtract-based recognizer class for use …

0a72da8

…with any provider.

telackey marked this pull request as ready for review December 12, 2025 21:10

RonShakutai self-requested a review December 13, 2025 13:11

RonShakutai requested changes Dec 14, 2025

View reviewed changes

SharonHart force-pushed the main branch from 150664d to b707c40 Compare December 17, 2025 11:42

Merge branch 'main' into telackey/lellm

5e8a049

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a configurable LangExtract recognizer for use with any provider. #1815

Add a configurable LangExtract recognizer for use with any provider. #1815

Uh oh!

telackey commented Dec 12, 2025 •

edited

Loading

Uh oh!

RonShakutai left a comment

Uh oh!

RonShakutai Dec 14, 2025

Uh oh!

RonShakutai Dec 14, 2025

Uh oh!

RonShakutai Dec 14, 2025

Uh oh!

RonShakutai Dec 14, 2025

Uh oh!

RonShakutai Dec 14, 2025

Uh oh!

telackey commented Dec 16, 2025

Uh oh!

RonShakutai commented Dec 16, 2025

Uh oh!

SharonHart commented Dec 17, 2025

Uh oh!

RonShakutai commented Jan 4, 2026

Uh oh!

telackey commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add a configurable LangExtract recognizer for use with any provider. #1815

Are you sure you want to change the base?

Add a configurable LangExtract recognizer for use with any provider. #1815

Uh oh!

Conversation

telackey commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Issue reference

Checklist

Uh oh!

RonShakutai left a comment

Choose a reason for hiding this comment

Uh oh!

RonShakutai Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

RonShakutai Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

RonShakutai Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

RonShakutai Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

RonShakutai Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

telackey commented Dec 16, 2025

Uh oh!

RonShakutai commented Dec 16, 2025

Uh oh!

SharonHart commented Dec 17, 2025

Uh oh!

RonShakutai commented Jan 4, 2026

Uh oh!

telackey commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

telackey commented Dec 12, 2025 •

edited

Loading