-
Notifications
You must be signed in to change notification settings - Fork 268
Added PiiBolt and PresidioRedacter implementation #1728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
rzo1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @klockla,
Thanks for the PR and for proposing an abstraction for PII removal during crawls. I’ve added a few comments.
I also have a couple of questions:
What was the reasoning behind choosing a bolt instead of a (parse) filter for the redaction step?
Since this is a larger contribution, we’ll likely need an ICLA
before we can accept it.
| // Default value for language metadata field | ||
| private String languageFieldName = "parse.lang"; | ||
|
|
||
| OutputCollector _collector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why package private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why in core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why in core?
My idea was to have the abstraction layer in core and the specific implementation(s) in external. Please let me know where to move it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why package private?
This is the default value of the metadata field that may contain the language. It can be configured through "pii.language.field" in the topology's configuration.
I see this as being at the same level as for instance queueMode in SimpleFetcherBolt or emitOutlinks in JSoupParserBolt
so why it shouldn't be private ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at OutputCollector _collector; :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will change it to protected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klockla it can stay in core actually, since it will be extended by classes in modules
| LOG.info("No text to process for URL: {}", url); | ||
| metadata.addValue("pii.processed", "false"); | ||
| // Force the binary content to a dummy content | ||
| emitTuple(input, url, REDACTED_BYTES, metadata, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if we should default to a redacted html here, if the original content was empty. Why not just return (similar to pii is disabled). Any reason for this?
| piiRedacter.redact(text); | ||
|
|
||
| if (redacted == null) { | ||
| throw new Exception("PII Redacter returned null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Shouldn't we fallback or just default to something instead of raising a hard exception here; triggering a re-try in the topology?
external/presidio/src/main/java/org/apache/stormcrawler/pii/PresidioRedacter.java
Outdated
Show resolved
Hide resolved
|
|
||
| private List<String> supportedLanguages = Arrays.asList("en", "fr", "de", "xx"); | ||
|
|
||
| private OkHttpClient httpClient = new OkHttpClient(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make some of the properties configurable or re-use existing configuration, i.e. user-agent, timeouts or a like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For user-agent I don't think it is needed, this is just used for the calls to the Presidio REST API
If you see any timeout property, I could reuse please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
In addition, see build output. |
Hi @rzo1 I didn't really think about implementing it as a parse filter but as the process (text analysis by the NLP engine in Presidio) is quite consuming, I think it may be better to have this in a separate bolt to have better measures about tuple processing time/latency. I will fix the points related to your other comments and will need to have a look at this ICLA thing. |
Signed-off-by: Laurent Klock <[email protected]>
I haven't looked at the details yet but I agree with @rzo1 that this feels like it should be a ParseFilter. We log processing times in ParseFilters but could extend that to send proper metrics |
|
@jzonthemtn Could you take a look, particularly regarding how well the abstraction in "core" interoperates with other PII-redaction engines (such as Philter)? |
|
|
I reached out to @jzonthemtn via ASF Slack with the essence:
so there would be the possibility to integrate another implementation on the long run. |
I thought a bit about it (over the last weekend) and I think implementing this as a separate bolt would be ok, as it allows the NLP processing to run as an independent step, enabling better isolation, scaling, and more precise measurement of tuple processing time without impacting the parsing bolt (at all). This also allows redaction to be treated as an isolated use case, separate from the core parsing logic. So I would be ok with a bolt in that case. |
This PR adds a bolt which allows removal of Personally Identifiable Information (PII).
The PiiBolt is to be used with a class implementing the PiiInterface and which will provide the actual implementation of PII.
This PR implements also the PresidioRedactor class which uses Microsoft Presidio
( https://microsoft.github.io/presidio/ ) as a PII back-end.
It can be configured for different PII entities (names, phones, location, etc...) and different languages according to how you deployed the back-end.