Proposal: Limited mutability API #311
Description
Based off discussion in #295. Both Sigil and Github have a need to make small local modifications to the parse tree before reserializing it out. This is currently very difficult because of the number of pointers that must be kept in sync, the possibility of introducing memory leaks by not updating them, and the need to pass a GumboParser around for the allocator.
Concrete proposal
- Remove the ability to set custom allocators on GumboOptions. Use the system malloc for all memory.
- Expose create_node, destroy_node, get_attribute, set_attribute, set_attribute_value, and the vector modification functions (add, remove, remove_at, insert_at) to the public API.
Current workaround
We currently recommend that people who want mutation wrap the whole parse tree in an API of their choice, mutate that, and then serialize it out. Gumbo's API is simple enough that a tree-walker can be written in a page or so of code, and tree traversal time is negligible compared to parse time (~1%). Several outside bindings have DOM APIs already, eg. lua-gumbo, gumbo-libxml, and the html5lib and BeautifulSoup adaptors that come with the main distribution.
Benefits
If this is useful to you, you'll probably know it immediately. :-) But enumerating them:
- No need to use & learn an outside library just to do mutation.
- Mutation can work in terms of the GumboNodes you already have; if you're doing querying or traversal already, there's no need to adapt that code to work on a different DOM representation.
- Well-suited to small local mutations, where it feels like overkill to have to reserialize the whole parse tree just to change one node.
- Possibly marginally faster, since there's no need to traverse & allocate for a new parse tree, although empirically this effect has been negligible.
- Simplified API in some cases, since some functions that previously needed a GumboParser/GumboOptions argument no longer do (notably gumbo_destroy_output).
There is a partial branch demonstrating some of these changes at vmg/development.
Drawbacks
- Incompatible with the existing allocator machinery.
- Incompatible with the arena change in Arena #309.
- Backwards-incompatible; at a minimum, this change results in signature changes for GumboOptions and gumbo_destroy_output, and exposes a half dozen or so new functions.
- Possibly more API surface for third-party bindings to wrap. External bindings are under no obligation to offer the full feature set of Gumbo, but if this goes in, there will likely be pressure from users to expand the feature set of them.
- More API surface for new users of the library to learn.
- Many of the existing helpers that would be exposed by this proposal are not designed for efficiency or for this usage. gumbo_get_attribute, for example, takes linear time, and gumbo_create_node wouldn't know where to insert the node in the list of next/prev pointers.
Compromise solutions
- Replace the custom allocators in GumboOptions with global gumbo_set_allocator/gumbo_set_deallocator functions. This restores custom allocators, but still eliminates the ability of different instances of gumbo_parse (eg. in a multithreaded program) to run with separate heaps, so eg. a per-parse arena would require locking that destroys many of the speed benefits of an arena.
- Have functions take an optional first parameter, perhaps a GumboOptions with the allocator/deallocator functions or a GumboArena, and use that if provided. If NULL, it falls back to the system malloc. This gets all the functionality and doesn't compromise any design options, but it still results in an ugly API, and it's very easy to make a mistake and forget what allocator you used.
Comment with a +1 or -1, or any additional comments or considerations.