-
Notifications
You must be signed in to change notification settings - Fork 4k
[python-package] do not copy column-major numpy arrays when creating Dataset #6721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This could also be done for the predict portion, i.e. LightGBM/python-package/lightgbm/basic.py Lines 1268 to 1307 in 5151fe8
Please let me know if it's ok to include it here or in a separate PR. |
StrikerRUS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for very useful improvement!
I just left a couple of non-blocking comments below for your consideration.
| if mat.flags["F_CONTIGUOUS"]: | ||
| order = "F" | ||
| else: | ||
| order = "C" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to highlight that arrays can be C-contiguous and F-contiguous simultaneously.
https://stackoverflow.com/q/60230562
Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html
Seems it's correctly handled here.
I believe it would be better to have a follow-up PR for prediction after merging this one. |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed. But I don't have any additional comments.
Thanks for working on this, and really nice tests!
| assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1 | ||
|
|
||
|
|
||
| def test_equal_datasets_from_row_major_and_col_major_data(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this test now doesn't include train(), I think it should go to test_basic.py file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Moved in 38c6786
Would you like to take a look after the recent commits? |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Sure, I'll take a look. |
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
Changes the logic to flatten the data array to consider its layout:
I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.