Skip to content

Replace BeautifulSoup with Scrapling's parser #4106

@D4Vinci

Description

@D4Vinci

Hi, I like your project, and I use it, but I see you are still using BeautifulSoup. Well, actually, there's a better, faster new parser now: Scrapling.

Scrapling is up to 1735 times faster than BS4 with html5lib and up to 698 times faster than BS4 with Lxml, as benchmarks show below, while using significantly less memory. While analyzing memory usage with memray for another library I use, I noticed that BeautifulSoup consumes more than 7 MB during import alone.

Image

While Scrapling uses 1.5 MB in the same task, plus importing. Scrapling's parser is built on lxml

However, Scrapling doesn't provide only ways to select elements using CSS/XPATH selectors; it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find function similar to BS's, but more powerful and much faster.

Also, Scrapling automatically handles invalid/incomplete HTML through lxml.
Here's an article that shows every function in BeautifulSoup with its equivalent in Scrapling to make the migration easy: https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/

Also, if that interests you, it provides a way to make self-healing spiders that adapt to website design changes without AI, and it gives a method for finding elements similar to those found, as the AutoScraper library does. Still, it's way faster and better at this.

Here are two benchmarks from the documentation that compare it to all Python libraries in the market:

Text Extraction Speed Test (5000 nested elements)

# Library Time (ms) vs Scrapling
1 Scrapling 1.92 1.0x
2 Parsel/Scrapy 1.99 1.036x
3 Raw Lxml 2.33 1.214x
4 PyQuery 20.61 ~11x
5 Selectolax 80.65 ~42x
6 BS4 with Lxml 1283.21 ~698x
7 MechanicalSoup 1304.57 ~679x
8 BS4 with html5lib 3331.96 ~1735x

Element Similarity & Text Search Performance

Library Time (ms) vs Scrapling
Scrapling 1.87 1.0x
AutoScraper 10.24 5.476x

It would be a solid addition to unstructured, what do you think?

I'm the author of Scrapling, so you might think I'm biased towards my library, but that's not true —you can see for yourself.
If you need any changes to make this happen, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions