Optimize memory management in streaming parsers and encoding#2
Merged
Optimize memory management in streaming parsers and encoding#2
Conversation
Streaming parsers (StreamingParser, GeneralStreamingParser, GeneralStreamingParserNewlines) had three memory leak patterns: 1. compact_buffer() used Vec::drain() which preserves peak allocation capacity even after removing most data. Added shrink_excess() to reclaim memory when capacity exceeds 4x length. 2. take_rows() used drain().collect() leaving complete_rows at peak capacity. Now shrinks after draining. 3. finalize() left the internal buffer allocated after extracting the final rows. Now releases buffer memory since parsing is complete. Also: - Reduce rayon thread pool stack from 8 MiB to 2 MiB per thread (saves ~48 MiB virtual memory across 8 persistent threads) - Remove unnecessary field.clone() in parallel encoder's encoding path - Add ExactSizeIterator impls for RowIter, RowFieldIter, FieldIter All 95 tests pass. https://claude.ai/code/session_01QdJE1Gks1uipLWVupAwrbe
- Add thiserror for BufferOverflow: implements Display + Error traits as required for idiomatic Rust library error types - Add #[must_use] to key types: StructuralIndex, RowEnd, Newlines, BufferOverflow, StreamingParser, GeneralStreamingParser, GeneralStreamingParserNewlines, GeneralFieldBound, StreamingParserResource - Add #[must_use] to getter methods: available_rows(), has_partial(), buffer_size(), row_count(), max_pattern_len() - Fix import ordering in general.rs to pass cargo fmt - All quality gates pass: cargo fmt, clippy -D warnings, 95 tests https://claude.ai/code/session_01QdJE1Gks1uipLWVupAwrbe
Review fixes for PR #2: - Fix RowIter::next(): increment row_idx in trailing-row branch so ExactSizeIterator::len() returns 0 after exhaustion (was returning 1) - Fix RowFieldIter::next(): same trailing-row row_idx fix - Fix FieldIter::size_hint(): check done flag so len() returns 0 after last field is consumed (was returning 1) - Fix shrink_excess(): use byte-based 1 KiB floor via size_of::<T>() instead of element-count 1024 (doc said bytes, code used elements) - Add ExactSizeIterator tests for RowIter, RowFieldIter, FieldIter covering trailing rows and multi-field exhaustion - Add shrink_excess tests: threshold, floor, ratio, large-element types - Add finalize/reset memory release tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves memory efficiency across the CSV parsing and encoding pipeline by implementing proactive memory reclamation, reducing thread stack overhead, and optimizing buffer allocation patterns.
Key Changes
Memory Management Improvements
shrink_excess()helper function instreaming.rsthat reclaims excess vector capacity when it exceeds 4× the current length (with a 1 KiB floor). This prevents long-lived streaming parsers from monotonically growing to peak memory usage and never returning it to the OS.shrink_excess()consistently across all three streaming parser implementations (StreamingParser,GeneralStreamingParser,GeneralStreamingParserNewlines) in bothfeed()andtake_rows()methods to prevent unbounded memory growth.finish()method in all streaming parsers to explicitly release buffers usingVec::new()instead of relying onVec::clear(), which preserves allocations. Also properly resetpartial_row_startandscan_pos.reset()method inStreamingParserto useVec::new()instead of.clear()for actual memory release.Error Handling
BufferOverflowerror type withthiserror::Errorderive macro and proper error message, making it idiomatic for library code.#[must_use]attributes toBufferOverflowand all streaming parser structs to encourage proper error handling and prevent accidental ignoring of parser instances.Performance Optimizations
parallel.rs. CSV field extraction has shallow call stacks, and the default wastes ~48 MiB of virtual memory across 8 persistent threads.encode_string_parallel()inlib.rsto eliminate unnecessary intermediate vector allocations by directly writing to the output buffer based on encoding/quoting requirements, reducing memory pressure during parallel encoding.Iterator Improvements
ExactSizeIteratorforRowIter,RowFieldIter, andFieldIterinsimd_index.rsto provide size hints and enable more efficient iteration patterns.#[must_use]attributes to query methods (available_rows(),has_partial(),buffer_size(),row_count(),max_pattern_len()) and data structures to prevent accidental ignoring of important information.Dependencies
thiserror = "2"for idiomatic error type derivation.Implementation Details
The memory optimization strategy focuses on three areas:
shrink_excess()function uses a conservative threshold (4×) to avoid thrashing on small buffers while still reclaiming significant excess capacity.Vec::new()instead of.clear()ensures memory is actually returned to the OS, critical for long-running streaming parsers.https://claude.ai/code/session_01QdJE1Gks1uipLWVupAwrbe