Skip to content

Use Readability to crawl webpages #13

@PierreMesure

Description

@PierreMesure

Hi @Jontpan, we started using Intric a few weeks ago at Ekonomistyrningsverket. I came back from parental leave last week and I now have an account.

Where would be a good place to send you feature request? Can I do it by creating issues here?
I'll start with a first one:

I recently created a first assistant using a website as a source and the crawling worked great. That being said, I noticed that what is extracted from webpages contains a lot of useless content. I'm talking about the header, footer, menus and sidebars.

I would like to suggest to use packages such as Mozilla's readability (python version) to extract the article or the main text on the page, when it exists.

I had a look at the code and I think it could be used here instead of html2text. Nothing wrong about Aaron's 13 year-old package but website menus aren't useful content here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions