enhancement: optimize cells to html #444
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
π 8% (0.08x) speedup for
cells_to_htmlinunstructured_inference/models/tables.pyβ±οΈ Runtime :
14.0 millisecondsβ13.0 milliseconds(best of193runs)π Explanation and details
The optimized code achieves a 7% speedup through two key optimizations in the
fill_cellsfunction:1. Replaced NumPy with native Python data structures:
np.zeros()for creating a boolean grid andnp.where()for finding empty cellsset()to track filled positions withfilled.add((row, col))instead offilled[row, col] = True2. Optimized header row detection:
{row for cell in cells if cell["column header"] for row in cell["row_nums"]}with explicit loop andset.update()3. Direct iteration instead of NumPy indexing:
zip(not_filled_idx[0], not_filled_idx[1])with nestedfor row in range()loopsThe optimizations are particularly effective for small to medium tables (as shown in test results where single cells see 40-56% speedup) because:
For large dense tables (20x20), the performance is roughly equivalent, showing the optimizations don't hurt scalability while providing significant gains for typical table sizes.
β Correctness verification report:
βοΈ Existing Unit Tests and Runtime
models/test_tables.py::test_cells_to_htmlπ Generated Regression Tests and Runtime
βͺ Replay Tests and Runtime
test_pytest_test_unstructured_inference__replay_test_0.py::test_unstructured_inference_models_tables_cells_to_htmlTo edit these changes
git checkout codeflash/optimize-cells_to_html-metc0l2uand push.Note
Optimize table HTML generation by replacing NumPy grid logic with native sets/loops and minor sorting/header handling tweaks; update version and changelog.
unstructured_inference/models/tables.py)fill_cells: Replace NumPy grid/where with nativesettracking, explicit header row accumulation, and nested loops to append missing cells.cells_to_html: Precomputecells_filledandcells_sorted; adjust header detection/theadcreation; iterate over sorted cells for row building.__version__to1.0.8-dev1inunstructured_inference/__version__.py.CHANGELOG.mdwith enhancement note for optimizedcells_to_html.Written by Cursor Bugbot for commit 640b75c. This will update automatically on new commits. Configure here.