fix(analyzer): honour language_model_params in BasicLangExtractRecognizer#1943
Conversation
…izer
BasicLangExtractRecognizer loaded langextract.model.provider.language_model_params
from the yaml into self._language_model_params and forwarded it to
lx.extract(..., language_model_params=...), but langextract.extract() ignores
its language_model_params argument when a pre-built ModelConfig is passed via
config= — it takes the elif config: branch (extraction.py:240) and only the
else: branch at line 255 reads language_model_params. Because
BasicLangExtractRecognizer._get_provider_params() always returns
{"config": ModelConfig(...)}, values like timeout and num_ctx were silently
dropped and Ollama fell back to langextract's 120s default regardless of
what the yaml said.
Merge _language_model_params into provider_kwargs before building the
ModelConfig so the values actually reach OllamaLanguageModel(**provider_kwargs).
setdefault() preserves precedence for explicit provider.kwargs: entries, so
configs that already work under kwargs: continue to work unchanged.
Adds two regression tests and strengthens an existing assertion to cover
ModelConfig.provider_kwargs (not just the separate language_model_params
argument that langextract ignores in this branch).
AzureOpenAILangExtractRecognizer is unaffected because its
_get_provider_params() returns {"model_id": ..., "language_model_params": ...}
with no config= key, so lx.extract() takes the else: branch where
language_model_params is applied.
Fixes microsoft#1942
|
@lsternlicht please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Fixes BasicLangExtractRecognizer so values configured under langextract.model.provider.language_model_params (e.g., timeout, num_ctx) are applied at runtime by ensuring they reach ModelConfig.provider_kwargs, which is what the langextract.extract(config=...) path actually uses.
Changes:
- Merge
language_model_paramsintoprovider_kwargs(withkwargstaking precedence) before constructinglx_factory.ModelConfig. - Strengthen and add regression tests to assert
timeout/num_ctxarrive onModelConfig.provider_kwargs. - Document the behavior fix in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/basic_langextract_recognizer.py | Ensures language_model_params are surfaced via ModelConfig.provider_kwargs so providers (e.g., Ollama) receive them. |
| presidio-analyzer/tests/test_basic_langextract_recognizer.py | Adds/strengthens regression tests verifying params reach ModelConfig.provider_kwargs and precedence rules. |
| CHANGELOG.md | Records the fix under Unreleased → Analyzer → Fixed. |
| self.model_id = model_config.get("model_id") | ||
| self.provider = provider_config.get("name") | ||
| self.provider_kwargs = provider_config.get("kwargs", {}) | ||
| self.provider_kwargs = dict(provider_config.get("kwargs", {})) |
There was a problem hiding this comment.
dict(provider_config.get("kwargs", {})) will raise a TypeError if the YAML contains kwargs: null (or an empty kwargs: key). Consider using dict(provider_config.get("kwargs") or {}) (or otherwise normalizing/validating the value) so missing/empty kwargs are treated as an empty mapping and the error message remains actionable for users.
| self.provider_kwargs = dict(provider_config.get("kwargs", {})) | |
| self.provider_kwargs = dict(provider_config.get("kwargs") or {}) |
RonShakutai
left a comment
There was a problem hiding this comment.
LGTM,
Great PR!!
Left minor comments.
| self.model_id = model_config.get("model_id") | ||
| self.provider = provider_config.get("name") | ||
| self.provider_kwargs = provider_config.get("kwargs", {}) | ||
| self.provider_kwargs = dict(provider_config.get("kwargs", {})) |
| self._extract_params.update(provider_config.get("extract_params", {})) | ||
| self._language_model_params.update( | ||
| provider_config.get("language_model_params", {}) | ||
| ) |
There was a problem hiding this comment.
.update(None) crashes if the YAML key is present but empty.
so we should add the or {} like copilot suggested above.
| @@ -574,3 +574,68 @@ def test_when_analyze_called_then_params_passed_to_langextract(self, tmp_path): | |||
| assert call_kwargs["config"].provider == "ollama" | |||
There was a problem hiding this comment.
Should we add a test where kwargs: null in the yaml and check if the yaml doesnt crash ?
|
Hi @lsternlicht would you be interested in merging this PR? There are a few minor changes to make |
Summary
Fixes #1942.
BasicLangExtractRecognizerloadslangextract.model.provider.language_model_paramsfrom the yaml intoself._language_model_paramsand forwards it tolx.extract(..., language_model_params=...). Butlangextract/extraction.pyignores thelanguage_model_paramsargument when a pre-builtModelConfigis passed viaconfig=— it takes theelif config:branch at line 240, and only theelse:branch at line 255 readslanguage_model_params. BecauseBasicLangExtractRecognizer._get_provider_params()always returns{"config": ModelConfig(...)}, values liketimeoutandnum_ctxwere silently dropped and Ollama fell back to langextract's 120 s default regardless of what the yaml said.This reliably breaks any Ollama model whose first inference takes longer than 120 s, with the confusing error:
even when the user explicitly set
timeout: 1800in the yaml. The default bundled config atpresidio-analyzer/presidio_analyzer/conf/langextract_config_basic.yamlis also affected — itstimeout: 240andnum_ctx: 8192have no effect at runtime.AzureOpenAILangExtractRecognizeris not affected, because its_get_provider_params()returns{"model_id": ..., "language_model_params": ...}with noconfig=key, solx.extract()takes theelse:branch wherelanguage_model_paramsis actually applied.Fix
Merge
self._language_model_paramsintoself.provider_kwargsbefore buildinglx_factory.ModelConfiginbasic_langextract_recognizer.py.setdefaultpreserves precedence for explicitprovider.kwargs:entries, so configs that already work by placingtimeoutunderkwargs:(as a workaround) continue to work unchanged.Tests
Before the fix, existing tests were green while the behaviour was broken.
test_basic_langextract_recognizer.py:568-570only asserted thatlanguage_model_paramswas passed tolx.extract(), but never that it reached theModelConfigor the provider.This PR:
test_when_analyze_called_then_params_passed_to_langextractto also assert thattimeoutandnum_ctxland oncall_kwargs["config"].provider_kwargs.test_language_model_params_reach_provider_kwargs— direct regression test for the reported behaviour, without going through the fullanalyze()flow.test_provider_kwargs_take_precedence_over_language_model_params— asserts backwards compatibility for configs that already placed values underkwargs:.All 83 LangExtract-related tests pass locally:
Test plan
test_basic_langextract_recognizer.pytests pass (21 tests).ModelConfig.provider_kwargscontainstimeoutandnum_ctxfrom yaml'slanguage_model_paramssection.provider.kwargs:values win overprovider.language_model_params:values of the same name.gemma4:31brun: with the fix, the configuredtimeout: 3600is passed torequests.post(..., timeout=3600)insidelangextract/providers/ollama.py, replacing the hardcoded 120 s default.