-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Hi, I like your project, and I use it, but I see you are still using BeautifulSoup. Well, actually, there's a better, faster new parser now: Scrapling.
Scrapling is up to 1735 times faster than BS4 with html5lib and up to 698 times faster than BS4 with Lxml, as benchmarks show below, while using significantly less memory. While analyzing memory usage with memray for another library I use, I noticed that BeautifulSoup consumes more than 7 MB during import alone.
While Scrapling uses 1.5 MB in the same task, plus importing. Scrapling's parser is built on lxml
However, Scrapling doesn't provide only ways to select elements using CSS/XPATH selectors; it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find function similar to BS's, but more powerful and much faster.
Also, Scrapling automatically handles invalid/incomplete HTML through lxml.
Here's an article that shows every function in BeautifulSoup with its equivalent in Scrapling to make the migration easy: https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/
Also, if that interests you, it provides a way to make self-healing spiders that adapt to website design changes without AI, and it gives a method for finding elements similar to those found, as the AutoScraper library does. Still, it's way faster and better at this.
Here are two benchmarks from the documentation that compare it to all Python libraries in the market:
Text Extraction Speed Test (5000 nested elements)
| # | Library | Time (ms) | vs Scrapling |
|---|---|---|---|
| 1 | Scrapling | 1.92 | 1.0x |
| 2 | Parsel/Scrapy | 1.99 | 1.036x |
| 3 | Raw Lxml | 2.33 | 1.214x |
| 4 | PyQuery | 20.61 | ~11x |
| 5 | Selectolax | 80.65 | ~42x |
| 6 | BS4 with Lxml | 1283.21 | ~698x |
| 7 | MechanicalSoup | 1304.57 | ~679x |
| 8 | BS4 with html5lib | 3331.96 | ~1735x |
Element Similarity & Text Search Performance
| Library | Time (ms) | vs Scrapling |
|---|---|---|
| Scrapling | 1.87 | 1.0x |
| AutoScraper | 10.24 | 5.476x |
It would be a solid addition to unstructured, what do you think?
I'm the author of Scrapling, so you might think I'm biased towards my library, but that's not true —you can see for yourself.
If you need any changes to make this happen, please let me know.