HTML pipeline processor

Relates to https://github.com/elastic/crawler/issues/144
relates to https://github.com/elastic/elasticsearch/issues/113132

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to [store full HTML](https://www.elastic.co/guide/en/enterprise-search/current/crawler-managing.html#crawler-managing-html-storage), and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields without having to write code to parse HTML. Currently, if I wanted to do some processing of an HTML field in an ingest pipeline, I'd need to use a ScriptProcessor on an HTML field, using regexes. https://github.com/elastic/elasticsearch/issues/113132 would make it easier to do this in a script processor. But some users would prefer to not have to get so in-the weeds for more simple HTML processing tasks. Common usecases might include:

- removing specific elements and their children (for dropping headers, footers, and ads)
- pulling specific element text into other fields (like getting `<h1>` or `<title>` into a `title` field)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML pipeline processor #113133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML pipeline processor #113133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions