Skip to content

Conversation

@jimhester
Copy link
Collaborator

Summary

  • Implements direct buffer access for numeric parsing, avoiding intermediate vroom::string allocation
  • Adds field_span struct to represent byte ranges in the source buffer
  • Updates vroom_dbl and vroom_int to use direct parsing when available
  • Falls back to string-based parsing for multi-file collections or connections

Changes

  • Add field_span struct in vroom.h representing byte boundaries for a field
  • Add get_field_span() and get_buffer() methods to index interface
  • Implement these methods in delimited_index, fixed_width_index, and index_collection
  • Add parse_value_direct() template in vroom_vec.h for direct buffer parsing
  • Update vroom_dbl and vroom_int collectors to use direct parsing when buffer is available
  • Store index pointer in vroom_vec_info to enable buffer access

Test plan

  • All existing tests pass (1163 passed, 4 skipped)
  • Direct parsing path is enabled when single file is read from memory-mapped source
  • Fallback path works when buffer is not available (connections, multi-file)

Closes #1

Implements direct buffer access for numeric parsing, avoiding the
intermediate vroom::string allocation. This provides more efficient
parsing for numeric types (double, int) when reading from memory-mapped
files.

Key changes:
- Add field_span struct to represent byte ranges in the buffer
- Add get_field_span() and get_buffer() methods to index interface
- Add parse_value_direct() template for direct buffer parsing
- Update vroom_dbl and vroom_int to use direct parsing when available
- Store index pointer in vroom_vec_info for buffer access

Falls back to string-based parsing when direct buffer access is not
available (e.g., for multi-file index collections or connections).

Closes #1

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@jennybc
Copy link
Member

jennybc commented Jan 16, 2026

Closing an O.G. issue!

ahiiid

@jimhester
Copy link
Collaborator Author

jimhester commented Jan 16, 2026

CC got a little over eager about which repo it was supposed to be opening the PR in, this can probably be ignored for now. If this experiment works out I will have some bigger PRs in the hopefully near future :)

@jimhester jimhester closed this Jan 16, 2026
@jennybc
Copy link
Member

jennybc commented Jan 16, 2026

Yeah I noticed you working in the repo previously known as simdcsv, which is mentioned in the maintenance notes here.

I have returned to some vroom work after a period of relative neglect. So it might be good to chat soon, just so I know the shape of what might be coming in these bigger PRs.

In fact, I could probably just use a good consult re: vroom and readr design and history, now that I've grown more active on it again. For example, I'm seriously considering inlining vroom into readr (i.e. all of the compiled code, in particular) and putting vroom as a standalone package into some sort of frozen/bare minimum maintenance state (with an eventual plan of archival). I have many reasons for this, which I'd be happy to discuss and kick around with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using threads for index parsing

2 participants