HTML parsing libs in Painless

### Description

Relates to https://github.com/elastic/crawler/issues/144

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to [store full HTML](https://www.elastic.co/guide/en/enterprise-search/current/crawler-managing.html#crawler-managing-html-storage), and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples:

- finding the text value of a specific element
- counting the number of times a class/element is present on the page
- stripping headers/footers from the page
- removing embedded javascript
- obfuscating PII that might be embedded in the HTML in certain elements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML parsing libs in Painless #113132

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML parsing libs in Painless #113132

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions