Skip to content

Separate extraction+selection from cleaning, serializing and other features. #16

@kevincox

Description

@kevincox

While there is a lot of public API it seems that the core method is readability::extractor::extract with scrape as a small wrapper. However this function makes a lot of assumptions. Currently it does:

  1. Extracts candidates.
  2. Ranks candidates.
  3. Cleans the DOM.
  4. Fixes up img links.
  5. Extracts a title.
  6. Generates a plain-text version.

1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.

I wonder if a better API would be something like:

  • extractv2() that returns some sort of opaque result object Extracted.
  • Extracted#html with a settings object that controls sanitizing, simplifying and URL rewriting.
  • Extracted#text, Extracted#title

The current extract would then be implemented as something like:

fn extract(input: &mut R, url: &Url) -> Result<Product, Error> {
  let extracted = extractv2(input, url)?;
  Product {
    title: extracted.title(),
    html: extracted.html(Default::default(),
    text: extracted.text(),
  }
}

But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions