Skip to content

feat: add negative_context support to reduce false positives in context-aware PII detection#1969

Open
TheSabari07 wants to merge 17 commits intomicrosoft:mainfrom
TheSabari07:feature/negative-context
Open

feat: add negative_context support to reduce false positives in context-aware PII detection#1969
TheSabari07 wants to merge 17 commits intomicrosoft:mainfrom
TheSabari07:feature/negative-context

Conversation

@TheSabari07
Copy link
Copy Markdown
Contributor

Change Description

Added support for negative_context in context-aware PII detection to reduce false positives.

  • Added support for negative_context in context-aware PII detection
  • Applied score penalty when negative context words appear near detected entities
  • Updated context enhancer to handle both positive and negative signals independently
  • Ensured compatibility with predefined recognizers by filtering unsupported arguments
  • Added tests to validate behavior, edge cases, and backward compatibility

Issue reference

Fixes #1686

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Hi @omri374

I worked on this as part of the discussion in #1686.

This PR focuses specifically on adding negative context support to reduce false positives in rule-based detection. Would appreciate your thoughts on the approach

Also, I had a couple of quick questions:

  • Can I include the additional test file I used to verify the negative context behavior?
  • Is it okay if I start experimenting with this in the presidio-research repo to further validate the approach?

f"Got: {context_matching_mode}"
)
self.context_matching_mode = context_matching_mode
self.negative_context_penalty = negative_context_penalty
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this to the parent ContextAwareEnhancer? it could serve other context enhancers too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @omri374

that makes sense.

I’ll move negative_context_penalty to the base ContextAwareEnhancer so it can be reused across other context enhancers as well.

Copy link
Copy Markdown
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. Please add tests + update a recognizer to include negative context.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds negative_context support to Presidio Analyzer’s context-aware scoring to reduce false positives by penalizing matches when “negative” keywords appear near an entity.

Changes:

  • Extend recognizer configuration loading to read and pass negative_context (including per-language config handling).
  • Add negative_context plumbing to EntityRecognizer/PatternRecognizer (init + serialization).
  • Update LemmaContextAwareEnhancer to apply a configurable score penalty when negative context words appear near detected entities.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py Loads negative_context from config and filters it when recognizers don’t accept the argument.
presidio-analyzer/presidio_analyzer/pattern_recognizer.py Adds negative_context to PatternRecognizer init and (de)serialization.
presidio-analyzer/presidio_analyzer/entity_recognizer.py Adds negative_context to the base recognizer API and stores it on the instance.
presidio-analyzer/presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py Implements negative-context score penalty logic in the default context enhancer.

if negative_context_word != "":
result.score -= self.negative_context_penalty
result.score = max(result.score, ContextAwareEnhancer.MIN_SCORE)
logger.debug("Applied negative context penalty for word '%s'", negative_context_word)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After applying the negative_context penalty, the RecognizerResult.analysis_explanation isn't updated (set_improved_score is only called after the positive boost). This makes the explainability fields (score/score_context_improvement) inconsistent with the final result.score; update the AnalysisExplanation to reflect the post-penalty score as well.

Suggested change
logger.debug("Applied negative context penalty for word '%s'", negative_context_word)
result.analysis_explanation.set_improved_score(result.score)
logger.debug(
"Applied negative context penalty for word '%s'",
negative_context_word,
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheSabari07 please make sure you update the analysis explanation fields as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @omri374

Comment on lines +127 to +134
),
"negative_context": RecognizerListLoader._get_recognizer_negative_context(
recognizer=recognizer_conf
),
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added negative_context extraction call is on lines that exceed the configured Ruff line-length (88) (e.g., the 'negative_context' assignment in the returned dict). Please wrap these function calls to avoid E501 lint failures.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +170
# Apply negative context penalty if recognizer has negative_context defined
if recognizer.negative_context:
negative_context_word = self._find_supportive_word_in_context(
surrounding_words, recognizer.negative_context, self.context_matching_mode
)
if negative_context_word != "":
result.score -= self.negative_context_penalty
result.score = max(result.score, ContextAwareEnhancer.MIN_SCORE)
logger.debug("Applied negative context penalty for word '%s'", negative_context_word)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negative_context introduces new scoring behavior (penalty application and clamping) but there are currently no unit tests covering negative_context in the test suite (no references found under presidio-analyzer/tests). Please add tests to validate: (1) penalty is applied when negative context appears in the window, (2) score is clamped at 0, (3) interaction with positive context (boost then penalty), and (4) backward compatibility when negative_context is unset.

Copilot generated this review using guidance from repository custom instructions.
@TheSabari07
Copy link
Copy Markdown
Contributor Author

This is a great start. Please add tests + update a recognizer to include negative context.

Thank you @omri374

I’ll add unit tests to cover negative_context (penalty, edge cases, and backward compatibility) and also update an existing recognizer to explicitly include negative_context for validation.

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Hi @omri374

I’ve made the changes suggested in the review:

  • Updated negative_context handling to behave consistently with positive context (as discussed)
  • Moved negative_context support into EntityRecognizer level for cleaner design
  • Removed unnecessary filtering logic in recognizers_loader_utils.py as suggested
  • Ensured YAML/simple language configs don’t incorrectly extract context/negative_context
  • Aligned enhancer logic to avoid double boosting and correctly apply negative penalty independently
  • Updated tests to fully cover these changes and ensure backward compatibility

All tests are passing locally, and I’ve verified no regressions in existing analyzer tests.

Please verify and let me know if any changes are needed

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Hi @omri374,
just a quick check on this PR. Happy to make any changes if needed.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 25, 2026

Apologies, will review shortly.

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Apologies, will review shortly.

No issues @omri374, thanks for the update

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Comments suppressed due to low confidence (2)

presidio-analyzer/presidio_analyzer/analyzer_engine.py:165

  • negative_context is added to AnalyzerEngine.analyze, but the analyzer service (/analyze) builds its arguments from AnalyzerRequest (app.py), which currently doesn't parse/forward a negative_context field. As-is, REST clients can't use this feature; either wire it through the request object/endpoint or clarify that it's Python-only.
    def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
        context: Optional[List[str]] = None,
        negative_context: Optional[List[str]] = None,
        allow_list: Optional[List[str]] = None,
        allow_list_match: Optional[str] = "exact",
        regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        nlp_artifacts: Optional[NlpArtifacts] = None,
    ) -> List[RecognizerResult]:

presidio-analyzer/presidio_analyzer/analyzer_engine.py:165

  • There’s no unit test exercising the new AnalyzerEngine.analyze(..., negative_context=...) parameter end-to-end (engine -> context enhancer). Adding a focused test would guard the public API behavior and ensure request-level negative context is applied correctly.
    def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
        context: Optional[List[str]] = None,
        negative_context: Optional[List[str]] = None,
        allow_list: Optional[List[str]] = None,
        allow_list_match: Optional[str] = "exact",
        regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        nlp_artifacts: Optional[NlpArtifacts] = None,
    ) -> List[RecognizerResult]:

Comment thread presidio-analyzer/tests/test_negative_context.py
Comment on lines +189 to +192
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix

Comment thread presidio-analyzer/tests/test_negative_context.py
Comment on lines +320 to +323
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +346
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +168
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix

Comment on lines +216 to +219
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +244
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +295
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternRecognizer.analyze expects entities as the second argument; here nlp_artifacts is being passed positionally instead. Pass entities explicitly (and nlp_artifacts as a keyword) so the test exercises the real public signature and types.

Copilot uses AI. Check for mistakes.
Comment on lines 31 to 53
self,
context_similarity_factor: float,
min_score_with_context_similarity: float,
context_prefix_count: int,
context_suffix_count: int,
negative_context_penalty: float = 0.3,
):
self.context_similarity_factor = context_similarity_factor
self.min_score_with_context_similarity = min_score_with_context_similarity
self.context_prefix_count = context_prefix_count
self.context_suffix_count = context_suffix_count
self.negative_context_penalty = negative_context_penalty

@abstractmethod
def enhance_using_context(
self,
text: str,
raw_results: List[RecognizerResult],
nlp_artifacts: NlpArtifacts,
recognizers: List[EntityRecognizer],
context: Optional[List[str]] = None,
negative_context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change adds a new negative_context keyword argument to ContextAwareEnhancer.enhance_using_context, and AnalyzerEngine now always passes it. Any user-provided custom enhancer implementing the old signature will raise TypeError; consider a backward-compatible call pattern (e.g., only pass negative_context if the enhancer accepts it) or clearly document this as a breaking change.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great, there are a few minor things here and there but it's mostly done.
Please confider adding this to the documentation, for example here: https://microsoft.github.io/presidio/tutorial/06_context/ or here

Comment on lines +51 to +54
patterns = patterns if patterns else self.PATTERNS
context = context if context else self.CONTEXT
negative_context = (
negative_context if negative_context else self.NEGATIVE_CONTEXT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheSabari07 please change this to make sure a user can disable negative context by passing an empty list.

Comment on lines 150 to 163
def analyze(
self,
text: str,
language: str,
entities: Optional[List[str]] = None,
correlation_id: Optional[str] = None,
score_threshold: Optional[float] = None,
return_decision_process: Optional[bool] = False,
ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
context: Optional[List[str]] = None,
negative_context: Optional[List[str]] = None,
allow_list: Optional[List[str]] = None,
allow_list_match: Optional[str] = "exact",
regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheSabari07 please move it to the bottom of the parameter list

Comment thread presidio-analyzer/tests/test_negative_context.py
Comment on lines +165 to +168
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)
original_score = results[0].score
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix

Comment on lines +189 to +192
nlp_artifacts = spacy_nlp_engine.process_text(text, "en")

results = recognizer.analyze(text, nlp_artifacts)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Hi @omri7374

Thanks for detailed feedback
i will work on the changes and update the porgress soon.

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Hi @omri374,

All the suggested changes are done.
Please verify and let me know if any further changes are needed.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented May 4, 2026

Thanks @TheSabari07, looks good, please fix the remaining copilot issues. You are calling analyze(text, nlp_artifacts) but that's not working.

@TheSabari07
Copy link
Copy Markdown
Contributor Author

Thanks @TheSabari07, looks good, please fix the remaining copilot issues. You are calling analyze(text, nlp_artifacts) but that's not working.

Okay @omri374, i will work on it and will update the progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improved context awareness using ML - Embeddings, ML classifiers or other non-rule-based approaches

3 participants