Open
Description
Description
Relates to elastic/crawler#144
Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.
The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples:
- finding the text value of a specific element
- counting the number of times a class/element is present on the page
- stripping headers/footers from the page
- removing embedded javascript
- obfuscating PII that might be embedded in the HTML in certain elements