-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: ensure csv.parse generates a single shared array per file+options #5503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
joanlopez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @oleiade!
Note that none of the comments I left are blocking, but I'd really appreciate if we can "merge" the two "duplicated" methods 🙏🏻
|
I've moved this to the 1.6.0 milestone, as we're in the release process. |
|
I've made a pass on the underlying SharedArray code as per your concerns @codebien and found that it's already pretty well tests. However, I've added a bunch of further tests to the data package, in order to limit the risks of breaking anything with our internal changes in this PR. Hope that's satisfying 🙇🏻 |
codebien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @oleiade,
thanks for let me know.
If I understand the changes correctly, the ones in the data module are orthogonal to the csv part but not really required for the fix of the specific related issue.
The data module is very sensible, so I'd appreciate it if you could split the pull requests by their domains to keep the things clean and easy enough to be reviewed in the future without too much effort. I mean to have one pull request for csv fix and one for the data module.
|
Happy to split the PR. However, most of the changes in the data module are dependencies for the CSV fix. The standalone parts (mostly increased test coverage in data_test.go and share_test.go to safeguard SharedArray) are isolated in my last commit. I realize the An alternative to the current approach I've been tinkering with might be to extract the 'shared memory' logic into a dedicated package that provides shared data structure used by Plus, we could, if we so choose, migrate the data module to the new package down the road too. I think it could make sense, what do you folks think @codebien @joanlopez any other alternatives? |
Replace map-based JSON marshaling with anonymous structs in buildSharedArrayName to guarantee deterministic field ordering across all Go versions. While encoding/json currently sorts map keys alphabetically, this behavior is an implementation detail not guaranteed by the Go specification. Using structs with explicit field ordering ensures consistent SHA256 hashes for identical file+options combinations, preventing potential SharedArray duplication across VUs in future Go versions.
a372774 to
e7e434b
Compare
|
Hey folks 👋🏻 Thanks a lot for the constructive feedback. I went back back to a less intrusive set of changes, leveraging the mutex as opposed to introduce a new I took the liberty to rewrite the history for clarity too, as I think it serves the review in this specific case. Also removed the commit adding test coverage to |
The csv.parse function was creating a new SharedArray with a random name
(based on time.Now().Nanosecond()) for each VU, causing memory usage to
scale linearly with VU count instead of sharing data across VUs.
For a 40MB CSV file with 100 VUs, this resulted in ~6.5GB memory usage
instead of the expected ~200MB.
This commit fixes the issue by:
1. Replacing random SharedArray name generation with deterministic
hash-based naming. The name is now derived from the file path and
parser options (delimiter, skipFirstLine, fromLine, toLine,
asObjects) using SHA256, ensuring the same file+options combination
always produces the same SharedArray name.
2. Refactoring the internal sharedArrays structure to use a slot-based
system with sync.Once guarantees. This ensures the underlying data
is initialized exactly once per unique name, even when multiple VUs
call csv.parse concurrently.
3. Adding loadOrStore method that uses double-checked locking to
minimize contention while ensuring thread-safe single initialization
of shared arrays.
Closes #5493
e7e434b to
dd5943b
Compare
codebien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 I've just one request.
mstoykov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from the if @codebien pointed out
Co-authored-by: Ivan <[email protected]>
What
The csv.parse function was creating a new SharedArray with a random name (based on time.Now().Nanosecond()) for each VU, causing memory usage to scale linearly with VU count instead of sharing data across VUs. For a 40MB CSV file with 100 VUs, this resulted in ~6.5GB memory usage instead of the expected ~200MB.
This commit fixes the issue by:
Closes #5493
Demo
Given a 40MB csv file, and the following script:
Before this PR
After this PR
Checklist
make check) and all pass.Checklist: Documentation (only for k6 maintainers and if relevant)
Please do not merge this PR until the following items are filled out.
Related PR(s)/Issue(s)