Notes about manual one hot enconding and random state generators. #39
Replies: 3 comments 1 reply
-
Thanks for sharing! That's a very nice alternative implementation of the one-hot encoding in NumPy As a side note, if you want to implement the one-hot encoding in PyTorch, the probably most efficient way would be >>> labels = torch.tensor([1, 2, 1, 1, 2, 3])
>>> num_classes = 5
>>> labels_onehot = torch.zeros(labels.shape[0], num_classes)
>>> labels_onehot.scatter_(1, labels.unsqueeze(1), 1.0)
tensor([[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.]]) PS: Hah, yeah, I read about the new random number generators in NumPy, but I somehow always fall back to the old ones ... Really have to force myself to adopt it some time. |
Beta Was this translation helpful? Give feedback.
-
Great, thank you! You give the idea how to simplify numpy implementation by passing constant value to set, not array filled with ones, so improved version will be the following (almost equal to your pytorch version):
|
Beta Was this translation helpful? Give feedback.
-
Hi Sebastian, I remember that we have even more easy and fast method of one hot encoding:
100 loops, best of 5: 2.79 ms per loop
1000 loops, best of 5: 93.1 µs per loop
10000 loops, best of 5: 76.8 µs per loop Probably torch version will be also effective. Thank you. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Sebastian,
While reading the book, I decided to share some thoughts about manual one hot enconding and random state generators, which I've combined in one use case and probably will be useful.
There is an implementation of one hot encoding using numpy on the page 347:
Although there are implementations in sklearn (sklearn.preprocessing.OneHotEncoder) and in torch (torch.nn.functional.one_hot), recently I found more effective vectorized solution for numpy (if it is required):
So we prepare array filled with zeros (arr), column vector based on target y which will be used as indices (indices) in each row where to put specific values, and these specific values will be array with ones (values) with the same shape as indices.
After that we can use these arrays to modify array arr in-place, using numpy function put_along_axis.
We can compare the performance of both methods for small enough array with 10000 rows and 3 classes.
Here np.random.default_rng function is used, which is recommended method for random numbers generation (instead of np.random.RandomState which is also used in the book) for numpy >=1.17.
This article contains excellent explanation and general best practices for correct usage of RNG:
https://albertcthomas.github.io/good-practices-random-number-generators/
If we run both one hot implementations and measure time, we can see that even for this array size performance difference is very significant (more than 25x improvement).
Also we can check, that both implementations output the same values.
4.55 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
158 µs ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
True
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions