Skip to content

HTML parsing libs in Painless #113132

Open
Open
@seanstory

Description

@seanstory

Description

Relates to elastic/crawler#144

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples:

  • finding the text value of a specific element
  • counting the number of times a class/element is present on the page
  • stripping headers/footers from the page
  • removing embedded javascript
  • obfuscating PII that might be embedded in the HTML in certain elements

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions