-
Notifications
You must be signed in to change notification settings - Fork 45
Description
While there is a lot of public API it seems that the core method is readability::extractor::extract with scrape as a small wrapper. However this function makes a lot of assumptions. Currently it does:
- Extracts candidates.
- Ranks candidates.
- Cleans the DOM.
- Fixes up
imglinks. - Extracts a title.
- Generates a plain-text version.
1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.
I wonder if a better API would be something like:
extractv2()that returns some sort of opaque result objectExtracted.Extracted#htmlwith a settings object that controls sanitizing, simplifying and URL rewriting.Extracted#text,Extracted#title
The current extract would then be implemented as something like:
fn extract(input: &mut R, url: &Url) -> Result<Product, Error> {
let extracted = extractv2(input, url)?;
Product {
title: extracted.title(),
html: extracted.html(Default::default(),
text: extracted.text(),
}
}But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.