Skip to content

HTML pipeline processor #113133

Open
Open
@seanstory

Description

@seanstory

Relates to elastic/crawler#144
relates to #113132

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields without having to write code to parse HTML. Currently, if I wanted to do some processing of an HTML field in an ingest pipeline, I'd need to use a ScriptProcessor on an HTML field, using regexes. #113132 would make it easier to do this in a script processor. But some users would prefer to not have to get so in-the weeds for more simple HTML processing tasks. Common usecases might include:

  • removing specific elements and their children (for dropping headers, footers, and ads)
  • pulling specific element text into other fields (like getting <h1> or <title> into a title field)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions