Use Readability to crawl webpages

Hi @Jontpan, we started using Intric a few weeks ago at Ekonomistyrningsverket. I came back from parental leave last week and I now have an account.

Where would be a good place to send you feature request? Can I do it by creating issues here?
I'll start with a first one:

I recently created a first assistant using a website as a source and the crawling worked great. That being said, I noticed that what is extracted from webpages contains a lot of useless content. I'm talking about the header, footer, menus and sidebars.

I would like to suggest to use packages such as Mozilla's [readability](https://github.com/mozilla/readability) ([python version](https://github.com/alan-turing-institute/ReadabiliPy)) to extract the article or the main text on the page, when it exists.

I had a look at the code and I think it could be used [here ](https://github.com/inooLabs/intric-release/blob/57a2a28f1f41574275b78de6c3bf7cb698d3b073/backend/src/intric/crawler/parse_html.py#L25) instead of html2text. Nothing wrong about Aaron's 13 year-old package but website menus aren't useful content here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Readability to crawl webpages #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use Readability to crawl webpages #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions