Replace BeautifulSoup with Scrapling's parser

Hi, I like your project, and I use it, but I see you are still using BeautifulSoup. Well, actually, there's a better, faster new parser now: [Scrapling](https://github.com/D4Vinci/Scrapling).

Scrapling is up to 1735 times faster than `BS4 with html5lib` and up to 698 times faster than `BS4 with Lxml`, as benchmarks show below, while using significantly less memory. While analyzing memory usage with [memray](https://github.com/bloomberg/memray) for another library I use, I noticed that BeautifulSoup consumes more than 7 MB during import alone.

<img width="2539" height="346" alt="Image" src="https://github.com/user-attachments/assets/fb33e186-f22f-40cf-b972-d2194d7d37de" />

While Scrapling uses 1.5 MB in the same task, plus importing. Scrapling's parser is built on [lxml](https://lxml.de/)

However, Scrapling doesn't provide only ways to select elements using CSS/XPATH selectors; it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a `find` function similar to BS's, but more powerful and much faster.

Also, `Scrapling` automatically handles invalid/incomplete HTML through `lxml`.
Here's an article that shows every function in `BeautifulSoup` with its equivalent in `Scrapling` to make the migration easy: https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/

Also, if that interests you, it provides a way to make self-healing spiders that adapt to website design changes without AI, and it gives a method for finding elements similar to those found, as the `AutoScraper` library does. Still, it's way faster and better at this.

Here are two benchmarks from the [documentation](https://scrapling.readthedocs.io/en/latest/benchmarks/) that compare it to all Python libraries in the market:

### Text Extraction Speed Test (5000 nested elements)

| # |      Library      | Time (ms) | vs Scrapling | 
|---|:-----------------:|:---------:|:------------:|
| 1 |     Scrapling     |   1.92    |     1.0x     |
| 2 |   Parsel/Scrapy   |   1.99    |    1.036x    |
| 3 |     Raw Lxml      |   2.33    |    1.214x    |
| 4 |      PyQuery      |   20.61   |     ~11x     |
| 5 |    Selectolax     |   80.65   |     ~42x     |
| 6 |   BS4 with Lxml   |  1283.21  |    ~698x     |
| 7 |  MechanicalSoup   |  1304.57  |    ~679x     |
| 8 | BS4 with html5lib |  3331.96  |    ~1735x    |

### Element Similarity & Text Search Performance

|   Library   | Time (ms) | vs Scrapling |
|-------------|:---------:|:------------:|
|  Scrapling  |   1.87    |     1.0x     |
| AutoScraper |   10.24   |    5.476x    |


It would be a solid addition to `unstructured`, what do you think?

I'm the author of Scrapling, so you might think I'm biased towards my library, but that's not true —you can see for yourself.
If you need any changes to make this happen, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace BeautifulSoup with Scrapling's parser #4106

Text Extraction Speed Test (5000 nested elements)

Element Similarity & Text Search Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Library	Time (ms)	vs Scrapling
1	Scrapling	1.92	1.0x
2	Parsel/Scrapy	1.99	1.036x
3	Raw Lxml	2.33	1.214x
4	PyQuery	20.61	~11x
5	Selectolax	80.65	~42x
6	BS4 with Lxml	1283.21	~698x
7	MechanicalSoup	1304.57	~679x
8	BS4 with html5lib	3331.96	~1735x

Replace BeautifulSoup with Scrapling's parser #4106

Description

Text Extraction Speed Test (5000 nested elements)

Element Similarity & Text Search Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions