feat: add negative_context support to reduce false positives in context-aware PII detection#1969
feat: add negative_context support to reduce false positives in context-aware PII detection#1969TheSabari07 wants to merge 17 commits intomicrosoft:mainfrom
Conversation
|
Hi @omri374 I worked on this as part of the discussion in #1686. This PR focuses specifically on adding negative context support to reduce false positives in rule-based detection. Would appreciate your thoughts on the approach Also, I had a couple of quick questions:
|
| f"Got: {context_matching_mode}" | ||
| ) | ||
| self.context_matching_mode = context_matching_mode | ||
| self.negative_context_penalty = negative_context_penalty |
There was a problem hiding this comment.
Can we add this to the parent ContextAwareEnhancer? it could serve other context enhancers too
There was a problem hiding this comment.
Okay @omri374
that makes sense.
I’ll move negative_context_penalty to the base ContextAwareEnhancer so it can be reused across other context enhancers as well.
omri374
left a comment
There was a problem hiding this comment.
This is a great start. Please add tests + update a recognizer to include negative context.
There was a problem hiding this comment.
Pull request overview
Adds negative_context support to Presidio Analyzer’s context-aware scoring to reduce false positives by penalizing matches when “negative” keywords appear near an entity.
Changes:
- Extend recognizer configuration loading to read and pass
negative_context(including per-language config handling). - Add
negative_contextplumbing toEntityRecognizer/PatternRecognizer(init + serialization). - Update
LemmaContextAwareEnhancerto apply a configurable score penalty when negative context words appear near detected entities.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py | Loads negative_context from config and filters it when recognizers don’t accept the argument. |
| presidio-analyzer/presidio_analyzer/pattern_recognizer.py | Adds negative_context to PatternRecognizer init and (de)serialization. |
| presidio-analyzer/presidio_analyzer/entity_recognizer.py | Adds negative_context to the base recognizer API and stores it on the instance. |
| presidio-analyzer/presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py | Implements negative-context score penalty logic in the default context enhancer. |
| if negative_context_word != "": | ||
| result.score -= self.negative_context_penalty | ||
| result.score = max(result.score, ContextAwareEnhancer.MIN_SCORE) | ||
| logger.debug("Applied negative context penalty for word '%s'", negative_context_word) |
There was a problem hiding this comment.
After applying the negative_context penalty, the RecognizerResult.analysis_explanation isn't updated (set_improved_score is only called after the positive boost). This makes the explainability fields (score/score_context_improvement) inconsistent with the final result.score; update the AnalysisExplanation to reflect the post-penalty score as well.
| logger.debug("Applied negative context penalty for word '%s'", negative_context_word) | |
| result.analysis_explanation.set_improved_score(result.score) | |
| logger.debug( | |
| "Applied negative context penalty for word '%s'", | |
| negative_context_word, | |
| ) |
There was a problem hiding this comment.
@TheSabari07 please make sure you update the analysis explanation fields as well.
| ), | ||
| "negative_context": RecognizerListLoader._get_recognizer_negative_context( | ||
| recognizer=recognizer_conf | ||
| ), |
There was a problem hiding this comment.
The added negative_context extraction call is on lines that exceed the configured Ruff line-length (88) (e.g., the 'negative_context' assignment in the returned dict). Please wrap these function calls to avoid E501 lint failures.
| # Apply negative context penalty if recognizer has negative_context defined | ||
| if recognizer.negative_context: | ||
| negative_context_word = self._find_supportive_word_in_context( | ||
| surrounding_words, recognizer.negative_context, self.context_matching_mode | ||
| ) | ||
| if negative_context_word != "": | ||
| result.score -= self.negative_context_penalty | ||
| result.score = max(result.score, ContextAwareEnhancer.MIN_SCORE) | ||
| logger.debug("Applied negative context penalty for word '%s'", negative_context_word) |
There was a problem hiding this comment.
negative_context introduces new scoring behavior (penalty application and clamping) but there are currently no unit tests covering negative_context in the test suite (no references found under presidio-analyzer/tests). Please add tests to validate: (1) penalty is applied when negative context appears in the window, (2) score is clamped at 0, (3) interaction with positive context (boost then penalty), and (4) backward compatibility when negative_context is unset.
Thank you @omri374 I’ll add unit tests to cover negative_context (penalty, edge cases, and backward compatibility) and also update an existing recognizer to explicitly include negative_context for validation. |
|
Hi @omri374 I’ve made the changes suggested in the review:
All tests are passing locally, and I’ve verified no regressions in existing analyzer tests. Please verify and let me know if any changes are needed |
|
Hi @omri374, |
|
Apologies, will review shortly. |
No issues @omri374, thanks for the update |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Comments suppressed due to low confidence (2)
presidio-analyzer/presidio_analyzer/analyzer_engine.py:165
negative_contextis added toAnalyzerEngine.analyze, but the analyzer service (/analyze) builds its arguments fromAnalyzerRequest(app.py), which currently doesn't parse/forward anegative_contextfield. As-is, REST clients can't use this feature; either wire it through the request object/endpoint or clarify that it's Python-only.
def analyze(
self,
text: str,
language: str,
entities: Optional[List[str]] = None,
correlation_id: Optional[str] = None,
score_threshold: Optional[float] = None,
return_decision_process: Optional[bool] = False,
ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
context: Optional[List[str]] = None,
negative_context: Optional[List[str]] = None,
allow_list: Optional[List[str]] = None,
allow_list_match: Optional[str] = "exact",
regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]:
presidio-analyzer/presidio_analyzer/analyzer_engine.py:165
- There’s no unit test exercising the new
AnalyzerEngine.analyze(..., negative_context=...)parameter end-to-end (engine -> context enhancer). Adding a focused test would guard the public API behavior and ensure request-level negative context is applied correctly.
def analyze(
self,
text: str,
language: str,
entities: Optional[List[str]] = None,
correlation_id: Optional[str] = None,
score_threshold: Optional[float] = None,
return_decision_process: Optional[bool] = False,
ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
context: Optional[List[str]] = None,
negative_context: Optional[List[str]] = None,
allow_list: Optional[List[str]] = None,
allow_list_match: Optional[str] = "exact",
regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]:
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
|
|
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
There was a problem hiding this comment.
PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.
| self, | ||
| context_similarity_factor: float, | ||
| min_score_with_context_similarity: float, | ||
| context_prefix_count: int, | ||
| context_suffix_count: int, | ||
| negative_context_penalty: float = 0.3, | ||
| ): | ||
| self.context_similarity_factor = context_similarity_factor | ||
| self.min_score_with_context_similarity = min_score_with_context_similarity | ||
| self.context_prefix_count = context_prefix_count | ||
| self.context_suffix_count = context_suffix_count | ||
| self.negative_context_penalty = negative_context_penalty | ||
|
|
||
| @abstractmethod | ||
| def enhance_using_context( | ||
| self, | ||
| text: str, | ||
| raw_results: List[RecognizerResult], | ||
| nlp_artifacts: NlpArtifacts, | ||
| recognizers: List[EntityRecognizer], | ||
| context: Optional[List[str]] = None, | ||
| negative_context: Optional[List[str]] = None, | ||
| ) -> List[RecognizerResult]: |
There was a problem hiding this comment.
This change adds a new negative_context keyword argument to ContextAwareEnhancer.enhance_using_context, and AnalyzerEngine now always passes it. Any user-provided custom enhancer implementing the old signature will raise TypeError; consider a backward-compatible call pattern (e.g., only pass negative_context if the enhancer accepts it) or clearly document this as a breaking change.
omri374
left a comment
There was a problem hiding this comment.
Thanks! Looks great, there are a few minor things here and there but it's mostly done.
Please confider adding this to the documentation, for example here: https://microsoft.github.io/presidio/tutorial/06_context/ or here
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| negative_context = ( | ||
| negative_context if negative_context else self.NEGATIVE_CONTEXT |
There was a problem hiding this comment.
@TheSabari07 please change this to make sure a user can disable negative context by passing an empty list.
| def analyze( | ||
| self, | ||
| text: str, | ||
| language: str, | ||
| entities: Optional[List[str]] = None, | ||
| correlation_id: Optional[str] = None, | ||
| score_threshold: Optional[float] = None, | ||
| return_decision_process: Optional[bool] = False, | ||
| ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None, | ||
| context: Optional[List[str]] = None, | ||
| negative_context: Optional[List[str]] = None, | ||
| allow_list: Optional[List[str]] = None, | ||
| allow_list_match: Optional[str] = "exact", | ||
| regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE, |
There was a problem hiding this comment.
@TheSabari07 please move it to the bottom of the parameter list
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
| original_score = results[0].score |
| nlp_artifacts = spacy_nlp_engine.process_text(text, "en") | ||
|
|
||
| results = recognizer.analyze(text, nlp_artifacts) | ||
|
|
|
Hi @omri7374 Thanks for detailed feedback |
|
Hi @omri374, All the suggested changes are done. |
|
Thanks @TheSabari07, looks good, please fix the remaining copilot issues. You are calling analyze(text, nlp_artifacts) but that's not working. |
Okay @omri374, i will work on it and will update the progress. |
Change Description
Added support for negative_context in context-aware PII detection to reduce false positives.
Issue reference
Fixes #1686
Checklist