Work on asdf-format/libasdf-gwcs#6 quite a few bugs in asdf_write_to_mem, which I hadn't thought about in a while. It's used quite a bit in the test suite, but in trivial cases where it happens to just work.
But there are actually a few logic errors and unhandled TODOs in the code for it that come up as soon as you try to write something less trivial:
-
The existing logic writes the YAML tree first -- here it doesn't pre-suppose the size in bytes of the tree, and the memory stream logic in stream.c handles realloc calls while writing -- this works OK. However, the implementation of asdf_write_to_mem has an optimization where after it has output the YAML and before it outputs binary blocks, it checks to see how much space is needed for the binary blocks and, if necessary, does a single realloc call to reserve space for them. This makes basic sense but it has two serious problems in the implementation:
- After the
realloc it calls asdf_emitter_set_output_malloc to the new buffer returned by realloc. This, however, has the effect of resetting the emitter state. All of the asdf_emitter_set_output_* functions have this bug. The intention was that the emitter state should not move to ASDF_EMITTER_STATE_INITIAL until an ouput stream has been set on the emitter. But there was also the intention that it should be possible to change the stream at any point between emitter states and this fails on that intent. The asdf_emitter_set_output_* are also are kinda copypasta of each other -- they all are basically the same just wrapping different asdf_stream constructors. That's fine for convenience I guess but I think we could simplify this and fix the bug. Need tests for this.
-
The other issue stems from the comment
In order to reasonably estimate the size to reserve for each
block we also need to know it's compressed size if compression is
enabled on that block. We haven't implemented compressed writing
yet so fix that later. Probably we'll need to compress each block
individually first in order to do that; lots of refactoring
involved...
At the time I wrote that it was true, but now compressed writing is implemented, but I forgot to update that comment / revisit this issue. I don't think that much refactoring is really needed. In fact thinking harder about it I don't think it's that complicated. The compressed writing already writes the compressed data to a temporary buf that's copied/written to the stream. This implementation is not ideal either because sometimes even the compressed data can take up a lot of space in memory. At the time it was implemented this way in part as a quick solution, and in part because the Python asdf library already does the same thing. For now we don't have to fix that. The fact remains that if writing the compressed data to a memory stream, it will necessitate, generally, a realloc per compressed block written. But looking at the code it should just work as-is. It only means we can't use the aforementioned pre-realloc optimization on any compressed blocks (could still use it for non-compressed blocks). This could still be optimized heuristically later using known optimal compression ratios.
-
There is also a TODO:
// TODO: We should also take into account the block index here; ideally have
// a separate routine for just pre-rendering the block index to a buffer
This is probably less critical but would also be worth doing to avoid an otherwise entirely avoidable realloc
- I also forgot at the time of writing this issue that it does already reserve one _SC_PAGESIZE for the block index. There are some pathological cases where that might not be enough, but not going to optimize for that right now.
Work on asdf-format/libasdf-gwcs#6 quite a few bugs in
asdf_write_to_mem, which I hadn't thought about in a while. It's used quite a bit in the test suite, but in trivial cases where it happens to just work.But there are actually a few logic errors and unhandled TODOs in the code for it that come up as soon as you try to write something less trivial:
The existing logic writes the YAML tree first -- here it doesn't pre-suppose the size in bytes of the tree, and the memory stream logic in
stream.chandlesrealloccalls while writing -- this works OK. However, the implementation ofasdf_write_to_memhas an optimization where after it has output the YAML and before it outputs binary blocks, it checks to see how much space is needed for the binary blocks and, if necessary, does a singlerealloccall to reserve space for them. This makes basic sense but it has two serious problems in the implementation:reallocit callsasdf_emitter_set_output_mallocto the new buffer returned byrealloc. This, however, has the effect of resetting the emitter state. All of theasdf_emitter_set_output_*functions have this bug. The intention was that the emitter state should not move toASDF_EMITTER_STATE_INITIALuntil an ouput stream has been set on the emitter. But there was also the intention that it should be possible to change the stream at any point between emitter states and this fails on that intent. Theasdf_emitter_set_output_*are also are kinda copypasta of each other -- they all are basically the same just wrapping differentasdf_streamconstructors. That's fine for convenience I guess but I think we could simplify this and fix the bug. Need tests for this.The other issue stems from the comment
At the time I wrote that it was true, but now compressed writing is implemented, but I forgot to update that comment / revisit this issue. I don't think that much refactoring is really needed. In fact thinking harder about it I don't think it's that complicated. The compressed writing already writes the compressed data to a temporary buf that's copied/written to the stream. This implementation is not ideal either because sometimes even the compressed data can take up a lot of space in memory. At the time it was implemented this way in part as a quick solution, and in part because the Python asdf library already does the same thing. For now we don't have to fix that. The fact remains that if writing the compressed data to a memory stream, it will necessitate, generally, a
reallocper compressed block written. But looking at the code it should just work as-is. It only means we can't use the aforementioned pre-reallocoptimization on any compressed blocks (could still use it for non-compressed blocks). This could still be optimized heuristically later using known optimal compression ratios.There is also a TODO:
This is probably less critical but would also be worth doing to avoid an otherwise entirely avoidable
realloc