Explore substituting external lxml parser for builtin html.parser

The largest current non-dev dependencies (above 1k) in my local venv are: (excluding small dist-info directories)
* 13344 lxml
* 10660 pip
* 10648 cryptography
*  2780 pytz
*  2224 setuptools
*  1820 urwid
*  1588 chardet (via requests)

Of the top 3, `pip` and `cryptography` are essential for the application, but on recently reviewing the `bs4` (856) documentation I wondered if we could use the other backends, particularly since our input is fairly well constrained via the zulip rendered markdown. A potentially suitable backend would be the built-in html parser in the standard library, which comes for free in terms of space and would remove the need for this largest dependency.

Another reason to consider removal is that `lxml` is fast by virtue of its reliance on a C library; the downsides are this requires extra C-library build and runtime dependencies (eg. libxml2-dev libxslt1-dev) for installation/build in some cases (where wheels are not available?), including:
* PyPy
* Alpine dockfiles

I've drafted a WIP implementation of this at `2021-02-27-lxml-to-html.parser` on my fork and it seems to behave OK and be "fast enough", though I'm interested to hear feedback!

One option we could consider is having lxml be an optional extra, albeit with added runtime complexity.

This was briefly discussed on chat.zulip.org in #**zulip-terminal>lxml vs html.parser**.

I've tagged this with `area: optimization`, which here is in terms of space/simplicity; I've not profiled how this impacts speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Explore substituting external lxml parser for builtin html.parser #1036

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Explore substituting external lxml parser for builtin html.parser #1036

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions