Skip to content

Conversation

@klockla
Copy link
Contributor

@klockla klockla commented Nov 17, 2025

This PR adds a bolt which allows removal of Personally Identifiable Information (PII).
The PiiBolt is to be used with a class implementing the PiiInterface and which will provide the actual implementation of PII.

This PR implements also the PresidioRedactor class which uses Microsoft Presidio
( https://microsoft.github.io/presidio/ ) as a PII back-end.
It can be configured for different PII entities (names, phones, location, etc...) and different languages according to how you deployed the back-end.

@klockla klockla marked this pull request as ready for review November 17, 2025 16:43
Copy link
Contributor

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @klockla,

Thanks for the PR and for proposing an abstraction for PII removal during crawls. I’ve added a few comments.

I also have a couple of questions:

What was the reasoning behind choosing a bolt instead of a (parse) filter for the redaction step?

Since this is a larger contribution, we’ll likely need an ICLA
before we can accept it.

// Default value for language metadata field
private String languageFieldName = "parse.lang";

OutputCollector _collector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why package private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why in core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why in core?

My idea was to have the abstraction layer in core and the specific implementation(s) in external. Please let me know where to move it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why package private?

This is the default value of the metadata field that may contain the language. It can be configured through "pii.language.field" in the topology's configuration.

I see this as being at the same level as for instance queueMode in SimpleFetcherBolt or emitOutlinks in JSoupParserBolt
so why it shouldn't be private ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at OutputCollector _collector; :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will change it to protected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klockla it can stay in core actually, since it will be extended by classes in modules

LOG.info("No text to process for URL: {}", url);
metadata.addValue("pii.processed", "false");
// Force the binary content to a dummy content
emitTuple(input, url, REDACTED_BYTES, metadata, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we should default to a redacted html here, if the original content was empty. Why not just return (similar to pii is disabled). Any reason for this?

piiRedacter.redact(text);

if (redacted == null) {
throw new Exception("PII Redacter returned null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Shouldn't we fallback or just default to something instead of raising a hard exception here; triggering a re-try in the topology?


private List<String> supportedLanguages = Arrays.asList("en", "fr", "de", "xx");

private OkHttpClient httpClient = new OkHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make some of the properties configurable or re-use existing configuration, i.e. user-agent, timeouts or a like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user-agent I don't think it is needed, this is just used for the calls to the Presidio REST API
If you see any timeout property, I could reuse please let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzo1 rzo1 requested review from jnioche and sigee November 17, 2025 20:41
@rzo1 rzo1 added this to the 3.5.1 milestone Nov 17, 2025
@rzo1 rzo1 added enhancement external java Pull requests that update Java code labels Nov 17, 2025
@rzo1
Copy link
Contributor

rzo1 commented Nov 17, 2025

In addition, see build output.

@klockla
Copy link
Contributor Author

klockla commented Nov 18, 2025

Hi @klockla,

Thanks for the PR and for proposing an abstraction for PII removal during crawls. I’ve added a few comments.

I also have a couple of questions:

What was the reasoning behind choosing a bolt instead of a (parse) filter for the redaction step?

Since this is a larger contribution, we’ll likely need an ICLA before we can accept it.

Hi @rzo1

I didn't really think about implementing it as a parse filter but as the process (text analysis by the NLP engine in Presidio) is quite consuming, I think it may be better to have this in a separate bolt to have better measures about tuple processing time/latency.

I will fix the points related to your other comments and will need to have a look at this ICLA thing.

@jnioche
Copy link
Contributor

jnioche commented Nov 20, 2025

I didn't really think about implementing it as a parse filter but as the process (text analysis by the NLP engine in Presidio) is quite consuming, I think it may be better to have this in a separate bolt to have better measures about tuple processing time/latency.

I haven't looked at the details yet but I agree with @rzo1 that this feels like it should be a ParseFilter.

We log processing times in ParseFilters but could extend that to send proper metrics

@rzo1
Copy link
Contributor

rzo1 commented Nov 20, 2025

@jzonthemtn Could you take a look, particularly regarding how well the abstraction in "core" interoperates with other PII-redaction engines (such as Philter)?

@jnioche
Copy link
Contributor

jnioche commented Nov 25, 2025

@jzonthemtn Could you take a look, particularly regarding how well the abstraction in "core" interoperates with other PII-redaction engines (such as Philter)?

https://github.com/philterd/philter

@rzo1
Copy link
Contributor

rzo1 commented Nov 25, 2025

I reached out to @jzonthemtn via ASF Slack with the essence:

I think it could work [...]

so there would be the possibility to integrate another implementation on the long run.

@rzo1
Copy link
Contributor

rzo1 commented Nov 25, 2025

I haven't looked at the details yet but I agree with @rzo1 that this feels like it should be a ParseFilter.

I thought a bit about it (over the last weekend) and I think implementing this as a separate bolt would be ok, as it allows the NLP processing to run as an independent step, enabling better isolation, scaling, and more precise measurement of tuple processing time without impacting the parsing bolt (at all). This also allows redaction to be treated as an isolated use case, separate from the core parsing logic. So I would be ok with a bolt in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement external java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants