Skip to content

Explore substituting external lxml parser for builtin html.parser #1036

Open
@neiljp

Description

@neiljp

The largest current non-dev dependencies (above 1k) in my local venv are: (excluding small dist-info directories)

  • 13344 lxml
  • 10660 pip
  • 10648 cryptography
  • 2780 pytz
  • 2224 setuptools
  • 1820 urwid
  • 1588 chardet (via requests)

Of the top 3, pip and cryptography are essential for the application, but on recently reviewing the bs4 (856) documentation I wondered if we could use the other backends, particularly since our input is fairly well constrained via the zulip rendered markdown. A potentially suitable backend would be the built-in html parser in the standard library, which comes for free in terms of space and would remove the need for this largest dependency.

Another reason to consider removal is that lxml is fast by virtue of its reliance on a C library; the downsides are this requires extra C-library build and runtime dependencies (eg. libxml2-dev libxslt1-dev) for installation/build in some cases (where wheels are not available?), including:

  • PyPy
  • Alpine dockfiles

I've drafted a WIP implementation of this at 2021-02-27-lxml-to-html.parser on my fork and it seems to behave OK and be "fast enough", though I'm interested to hear feedback!

One option we could consider is having lxml be an optional extra, albeit with added runtime complexity.

This was briefly discussed on chat.zulip.org in #zulip-terminal>lxml vs html.parser.

I've tagged this with area: optimization, which here is in terms of space/simplicity; I've not profiled how this impacts speed.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions