Skip to content

Conversation

@telackey
Copy link

@telackey telackey commented Dec 12, 2025

Change Description

Add a new, basic LangExtract-based recognizer class that is generic. The current implementations focus on ollama or azure support. This one instantiates an lx.ModelConfig from the yaml, so that it can specify different providers and custom configurations (eg, developed using Ollama via a LiteLLM OpenAI proxy).

Issue reference

Fixes #XX

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@telackey telackey marked this pull request as ready for review December 12, 2025 21:10
@RonShakutai RonShakutai self-requested a review December 13, 2025 13:11
Copy link
Collaborator

@RonShakutai RonShakutai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great PR !
he lx_factory.ModelConfig approach is elegant !

left a few comments to finalize it

- en
type: predefined
enabled: false
config_path: presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove OllamaLangExtractRecognizer?

BasicLangExtractRecognizer already supports Ollama through provider configuration.
The dedicated Ollama recognizer seems redundant now.

also should we adjust the e2e tests as well https://github.com/microsoft/presidio/blob/main/e2e-tests/tests/test_package_e2e_integration_flows.py#L68.

model_config = self.config.get("model", {})
provider_config = model_config.get("provider", {})
self.model_id = model_config.get("model_id")
self.provider = provider_config.get("name")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add validation here with descriptive error messages?

config_path if config_path else str(self.DEFAULT_CONFIG_PATH)
)

super().__init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support extract_params in BasicLangExtractRecognizer?
Those parameters are needed for different scenarios,
for example for ollama if we use small llm we would need the max_char_buffer..

OllamaLangExtractRecognizer passes parameters like max_char_buffer, timeout, and num_ctx , max_workers, language_model_params, extraction_passes.

to the parent class, but BasicLangExtractRecognizer doesn't support these yet.

i have thought about something like this:

 Extract optional parameters from config
extract_params = {}
if "max_char_buffer" in model_config:
    extract_params["extract"] = {"max_char_buffer": model_config["max_char_buffer"]}

lang_model_params = {}
for key in ["timeout", "num_ctx"]:
    if key in model_config:
        lang_model_params[key] = model_config[key]
if lang_model_params:
    extract_params["language_model"] = lang_model_params

super().__init__(
    config_path=actual_config_path,
    name="Basic LangExtract PII",
    supported_language=supported_language,
    extract_params=extract_params or None
)

)

def _get_provider_params(self):
"""Return Azure OpenAI-specific params."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the doc string

provider_kwargs=self.provider_kwargs,
)

def _get_provider_params(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be removed also from the parent and from the AzureOpenAILangExtractRecognizer recognizer... but keep the abstraction i guess

@telackey
Copy link
Author

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

@RonShakutai
Copy link
Collaborator

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

No worries, whenever you get a chance :)
Thanks again for your great contribution!

@SharonHart
Copy link
Contributor

@telackey
This branch also needs rebasing to main, sorry for that

@RonShakutai
Copy link
Collaborator

Good comments! I'm happy to make the requested changes, but it may need to be late this week or early next.

Hi :) @telackey
Are you still planning to complete this PR?

@telackey
Copy link
Author

telackey commented Jan 5, 2026

Yes, the holidays just intervened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants