Notes about manual one hot enconding and random state generators. #39

labdmitriy · 2022-03-23T08:25:22Z

labdmitriy
Mar 23, 2022

Hi Sebastian,

While reading the book, I decided to share some thoughts about manual one hot enconding and random state generators, which I've combined in one use case and probably will be useful.

There is an implementation of one hot encoding using numpy on the page 347:

import numpy as np

def int_to_onehot(y, num_labels):
    ary = np.zeros((y.shape[0], num_labels))
    for i, val in enumerate(y):
    ary[i, val] = 1
    return ary

Although there are implementations in sklearn (sklearn.preprocessing.OneHotEncoder) and in torch (torch.nn.functional.one_hot), recently I found more effective vectorized solution for numpy (if it is required):

def int_to_onehot_vectorized(y, num_labels):
    arr = np.zeros((y.shape[0], num_labels))
    indices = y.reshape(-1, 1)
    values = np.ones_like(indices)
    np.put_along_axis(arr, indices, values, axis=1)
    return arr

So we prepare array filled with zeros (arr), column vector based on target y which will be used as indices (indices) in each row where to put specific values, and these specific values will be array with ones (values) with the same shape as indices.
After that we can use these arrays to modify array arr in-place, using numpy function put_along_axis.

We can compare the performance of both methods for small enough array with 10000 rows and 3 classes.

RANDOM_STATE = 42
rng = np.random.default_rng(RANDOM_STATE)
 
n = 10000
num_labels = 3
 
y = rng.integers(num_labels, size=n)

Here np.random.default_rng function is used, which is recommended method for random numbers generation (instead of np.random.RandomState which is also used in the book) for numpy >=1.17.
This article contains excellent explanation and general best practices for correct usage of RNG:
https://albertcthomas.github.io/good-practices-random-number-generators/

If we run both one hot implementations and measure time, we can see that even for this array size performance difference is very significant (more than 25x improvement).
Also we can check, that both implementations output the same values.

%%timeit
arr = int_to_onehot(y, num_labels)

4.55 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
arr_vectorized = int_to_onehot_vectorized(y, num_labels)

158 µs ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

arr = int_to_onehot(y, num_labels)
arr_vectorized = int_to_onehot_vectorized(y, num_labels)
 
np.allclose(arr, arr_vectorized)

True

Thank you.

rasbt · 2022-03-23T14:27:04Z

rasbt
Mar 23, 2022
Maintainer

Thanks for sharing! That's a very nice alternative implementation of the one-hot encoding in NumPy

As a side note, if you want to implement the one-hot encoding in PyTorch, the probably most efficient way would be

>>> labels = torch.tensor([1, 2, 1, 1, 2, 3])
>>> num_classes = 5
>>> labels_onehot = torch.zeros(labels.shape[0], num_classes)
>>> labels_onehot.scatter_(1, labels.unsqueeze(1), 1.0)
tensor([[0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.]])

PS: Hah, yeah, I read about the new random number generators in NumPy, but I somehow always fall back to the old ones ... Really have to force myself to adopt it some time.

0 replies

labdmitriy · 2022-03-23T14:28:31Z

labdmitriy
Mar 23, 2022
Author

Great, thank you!

You give the idea how to simplify numpy implementation by passing constant value to set, not array filled with ones, so improved version will be the following (almost equal to your pytorch version):

def int_to_onehot_vectorized(y, num_labels):
    arr = np.zeros((y.shape[0], num_labels))
    indices = y.reshape(-1, 1)
    np.put_along_axis(arr, indices, 1, axis=1)
    return arr

0 replies

labdmitriy · 2022-03-26T10:21:54Z

labdmitriy
Mar 26, 2022
Author

Hi Sebastian,

I remember that we have even more easy and fast method of one hot encoding:

def int_to_onehot_indexing(y, num_labels):
    n = len(y)
    arr = np.zeros((n, num_labels))
    arr[np.arange(n), y] = 1
    return arr

%%timeit
arr = int_to_onehot(labels_arr, num_classes)

100 loops, best of 5: 2.79 ms per loop

%%timeit
arr = int_to_onehot_vectorized(labels_arr, num_classes)

1000 loops, best of 5: 93.1 µs per loop

%%timeit
arr = int_to_onehot_indexing(labels_arr, num_classes)

10000 loops, best of 5: 76.8 µs per loop

Probably torch version will be also effective.

Thank you.

1 reply

rasbt Mar 31, 2022
Maintainer

Nice, thanks for posting, I will keep this as a reference for future coding projects!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notes about manual one hot enconding and random state generators. #39

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Notes about manual one hot enconding and random state generators. #39

Uh oh!

Uh oh!

labdmitriy Mar 23, 2022

Replies: 3 comments · 1 reply

Uh oh!

rasbt Mar 23, 2022 Maintainer

Uh oh!

Uh oh!

labdmitriy Mar 23, 2022 Author

Uh oh!

labdmitriy Mar 26, 2022 Author

Uh oh!

rasbt Mar 31, 2022 Maintainer

labdmitriy
Mar 23, 2022

Replies: 3 comments 1 reply

rasbt
Mar 23, 2022
Maintainer

labdmitriy
Mar 23, 2022
Author

labdmitriy
Mar 26, 2022
Author

rasbt Mar 31, 2022
Maintainer