Separate extraction+selection from cleaning, serializing and other features.

While there is a lot of public API it seems that the core method is [readability::extractor::extract](https://docs.rs/readability/latest/readability/extractor/fn.extract.html) with `scrape` as a small wrapper. However this function makes a lot of assumptions. Currently it does:

1. Extracts candidates.
2. Ranks candidates.
3. Cleans the DOM.
4. Fixes up `img` links.
5. Extracts a title.
6. Generates a plain-text version.

1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.

I wonder if a better API would be something like:

- `extractv2()` that returns some sort of opaque result object `Extracted`.
- `Extracted#html` with a settings object that controls sanitizing, simplifying and URL rewriting.
- `Extracted#text`, `Extracted#title`

The current extract would then be implemented as something like:

```rust
fn extract(input: &mut R, url: &Url) -> Result<Product, Error> {
  let extracted = extractv2(input, url)?;
  Product {
    title: extracted.title(),
    html: extracted.html(Default::default(),
    text: extracted.text(),
  }
}
```

But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate extraction+selection from cleaning, serializing and other features. #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Separate extraction+selection from cleaning, serializing and other features. #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions