Added PiiBolt and PresidioRedacter implementation #1728

klockla · 2025-11-17T16:42:29Z

This PR adds a bolt which allows removal of Personally Identifiable Information (PII).
The PiiBolt is to be used with a class implementing the PiiInterface and which will provide the actual implementation of PII.

This PR implements also the PresidioRedactor class which uses Microsoft Presidio
( https://microsoft.github.io/presidio/ ) as a PII back-end.
It can be configured for different PII entities (names, phones, location, etc...) and different languages according to how you deployed the back-end.

rzo1

Hi @klockla,

Thanks for the PR and for proposing an abstraction for PII removal during crawls. I’ve added a few comments.

I also have a couple of questions:

What was the reasoning behind choosing a bolt instead of a (parse) filter for the redaction step?

Since this is a larger contribution, we’ll likely need an ICLA
before we can accept it.

rzo1 · 2025-11-17T20:25:06Z

core/src/main/java/org/apache/stormcrawler/pii/PiiBolt.java

+	// Default value for language metadata field
+	private String languageFieldName = "parse.lang";
+
+	OutputCollector _collector;


Why package private?

and why in core?

and why in core?

My idea was to have the abstraction layer in core and the specific implementation(s) in external. Please let me know where to move it

Why package private?

This is the default value of the metadata field that may contain the language. It can be configured through "pii.language.field" in the topology's configuration.

I see this as being at the same level as for instance queueMode in SimpleFetcherBolt or emitOutlinks in JSoupParserBolt
so why it shouldn't be private ?

I was looking at OutputCollector _collector; :)

ok, will change it to protected

@klockla it can stay in core actually, since it will be extended by classes in modules

core/src/main/java/org/apache/stormcrawler/pii/PiiBolt.java

rzo1 · 2025-11-17T20:29:04Z

core/src/main/java/org/apache/stormcrawler/pii/PiiBolt.java

+			LOG.info("No text to process for URL: {}", url);
+			metadata.addValue("pii.processed", "false");
+			// Force the binary content to a dummy content
+			emitTuple(input, url, REDACTED_BYTES, metadata, "");


Unsure if we should default to a redacted html here, if the original content was empty. Why not just return (similar to pii is disabled). Any reason for this?

rzo1 · 2025-11-17T20:30:49Z

core/src/main/java/org/apache/stormcrawler/pii/PiiBolt.java

+					piiRedacter.redact(text);
+
+			if (redacted == null) {
+				throw new Exception("PII Redacter returned null");


Is this needed? Shouldn't we fallback or just default to something instead of raising a hard exception here; triggering a re-try in the topology?

external/presidio/README.md

external/presidio/pom.xml

external/presidio/src/main/java/org/apache/stormcrawler/pii/PresidioRedacter.java

rzo1 · 2025-11-17T20:38:29Z

external/presidio/src/main/java/org/apache/stormcrawler/pii/PresidioRedacter.java

+
+	private List<String> supportedLanguages = Arrays.asList("en", "fr", "de", "xx");
+
+	private OkHttpClient httpClient = new OkHttpClient();


Should we make some of the properties configurable or re-use existing configuration, i.e. user-agent, timeouts or a like?

For user-agent I don't think it is needed, this is just used for the calls to the Presidio REST API
If you see any timeout property, I could reuse please let me know

See https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/okhttp/HttpProtocol.java#L140

rzo1 · 2025-11-17T20:44:07Z

In addition, see build output.

klockla · 2025-11-18T08:59:43Z

Hi @klockla,

Thanks for the PR and for proposing an abstraction for PII removal during crawls. I’ve added a few comments.

I also have a couple of questions:

What was the reasoning behind choosing a bolt instead of a (parse) filter for the redaction step?

Since this is a larger contribution, we’ll likely need an ICLA before we can accept it.

Hi @rzo1

I didn't really think about implementing it as a parse filter but as the process (text analysis by the NLP engine in Presidio) is quite consuming, I think it may be better to have this in a separate bolt to have better measures about tuple processing time/latency.

I will fix the points related to your other comments and will need to have a look at this ICLA thing.

Signed-off-by: Laurent Klock <[email protected]>

jnioche · 2025-11-20T08:39:03Z

I didn't really think about implementing it as a parse filter but as the process (text analysis by the NLP engine in Presidio) is quite consuming, I think it may be better to have this in a separate bolt to have better measures about tuple processing time/latency.

I haven't looked at the details yet but I agree with @rzo1 that this feels like it should be a ParseFilter.

We log processing times in ParseFilters but could extend that to send proper metrics

rzo1 · 2025-11-20T09:07:59Z

@jzonthemtn Could you take a look, particularly regarding how well the abstraction in "core" interoperates with other PII-redaction engines (such as Philter)?

jnioche · 2025-11-25T06:49:32Z

@jzonthemtn Could you take a look, particularly regarding how well the abstraction in "core" interoperates with other PII-redaction engines (such as Philter)?

https://github.com/philterd/philter

rzo1 · 2025-11-25T14:34:18Z

I reached out to @jzonthemtn via ASF Slack with the essence:

I think it could work [...]

so there would be the possibility to integrate another implementation on the long run.

rzo1 · 2025-11-25T14:41:47Z

I haven't looked at the details yet but I agree with @rzo1 that this feels like it should be a ParseFilter.

I thought a bit about it (over the last weekend) and I think implementing this as a separate bolt would be ok, as it allows the NLP processing to run as an independent step, enabling better isolation, scaling, and more precise measurement of tuple processing time without impacting the parsing bolt (at all). This also allows redaction to be treated as an isolated use case, separate from the core parsing logic. So I would be ok with a bolt in that case.

klockla marked this pull request as ready for review November 17, 2025 16:43

rzo1 requested changes Nov 17, 2025

View reviewed changes

rzo1 requested review from jnioche and sigee November 17, 2025 20:41

rzo1 added this to the 3.5.1 milestone Nov 17, 2025

rzo1 added enhancement external java Pull requests that update Java code labels Nov 17, 2025

Added PiiBolt and PresidioRedacter implementation

3965c11

Signed-off-by: Laurent Klock <[email protected]>

klockla force-pushed the presidio branch from 093d5c0 to 3965c11 Compare November 18, 2025 16:11


		private List<String> supportedLanguages = Arrays.asList("en", "fr", "de", "xx");

		private OkHttpClient httpClient = new OkHttpClient();

Added PiiBolt and PresidioRedacter implementation #1728

Are you sure you want to change the base?

Added PiiBolt and PresidioRedacter implementation #1728

Uh oh!

Conversation

klockla commented Nov 17, 2025

Uh oh!

rzo1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzo1 commented Nov 17, 2025

Uh oh!

klockla commented Nov 18, 2025

Uh oh!

jnioche commented Nov 20, 2025

Uh oh!

rzo1 commented Nov 20, 2025

Uh oh!

jnioche commented Nov 25, 2025

Uh oh!

rzo1 commented Nov 25, 2025

Uh oh!

rzo1 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants