Skip to content

[WIP] Allow pure numpy array (not dask array) as inputs #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions dask_glm/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from scipy.optimize import fmin_l_bfgs_b


from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client
from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is safe_zeros_like coming from? I suppose you wanted to from dask.array.utils import zeros_like_safe instead, from https://github.com/dask/dask/blob/48a4d4a5c5769f6b78881adeb1b3973a950e5f43/dask/array/utils.py#L350

from dask_glm.families import Logistic
from dask_glm.regularizers import Regularizer

Expand Down Expand Up @@ -97,7 +97,7 @@ def gradient_descent(X, y, max_iter=100, tol=1e-14, family=Logistic, **kwargs):
stepSize = 1.0
recalcRate = 10
backtrackMult = firstBacktrackMult
beta = np.zeros_like(X._meta, shape=p)
beta = safe_zeros_like(X, shape=p)

for k in range(max_iter):
# how necessary is this recalculation?
Expand Down Expand Up @@ -161,7 +161,7 @@ def newton(X, y, max_iter=50, tol=1e-8, family=Logistic, **kwargs):
"""
gradient, hessian = family.gradient, family.hessian
n, p = X.shape
beta = np.zeros_like(X._meta, shape=p)
beta = safe_zeros_like(X, shape=p)
Xbeta = dot(X, beta)

iter_count = 0
Expand Down Expand Up @@ -387,7 +387,7 @@ def proximal_grad(X, y, regularizer='l1', lamduh=0.1, family=Logistic,
stepSize = 1.0
recalcRate = 10
backtrackMult = firstBacktrackMult
beta = np.zeros_like(X._meta, shape=p)
beta = safe_zeros_like(X, shape=p)
regularizer = Regularizer.get(regularizer)

for k in range(max_iter):
Expand Down
9 changes: 7 additions & 2 deletions dask_glm/tests/test_estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,14 @@ def test_pr_init(solver):


@pytest.mark.parametrize('fit_intercept', [True, False])
@pytest.mark.parametrize('is_sparse', [True, False])
def test_fit(fit_intercept, is_sparse):
@pytest.mark.parametrize('is_sparse,is_numpy', [
(True, False),
(False, False),
(False, True)])
def test_fit(fit_intercept, is_sparse, is_numpy):
X, y = make_classification(n_samples=100, n_features=5, chunksize=10, is_sparse=is_sparse)
if is_numpy:
X, y = dask.compute(X, y)
lr = LogisticRegression(fit_intercept=fit_intercept)
lr.fit(X, y)
lr.predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this test. When is is_numpy the case in a real-world example, IOW, will you ever have X and y be pure NumPy arrays that's worth testing with LogisticRegression? I assumed you'd only have Dask arrays (backed by Sparse or not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I tried to do, where both X and y are pure numpy/cupy arrays. Is that a feature we want? The current dask-glm only accepts dask arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a feature we need to support explicitly, I believe anybody using dask-glm would want to use Dask arrays rather than pure NumPy/CuPy ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll prioritize #89 then.

Expand Down
19 changes: 15 additions & 4 deletions dask_glm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def normalize_inputs(X, y, *args, **kwargs):
raise ValueError('Multiple constant columns detected!')
mean[intercept_idx] = 0
std[intercept_idx] = 1
mean = mean if len(intercept_idx[0]) else np.zeros_like(X._meta, shape=mean.shape)
mean = mean if len(intercept_idx[0]) else safe_zeros_like(X, shape=mean.shape)
Xn = (X - mean) / std
out = algo(Xn, y, *args, **kwargs).copy()
i_adj = np.sum(out * mean / std)
Expand All @@ -41,7 +41,7 @@ def sigmoid(x):

@dispatch(object)
def exp(A):
return A.exp()
return np.exp(A)


@dispatch(float)
Expand Down Expand Up @@ -91,7 +91,7 @@ def sign(A):

@dispatch(object)
def log1p(A):
return A.log1p()
return np.log1p(A)


@dispatch(np.ndarray)
Expand Down Expand Up @@ -121,7 +121,7 @@ def is_dask_array_sparse(X):
"""
Check using _meta if a dask array contains sparse arrays
"""
return isinstance(X._meta, sparse.SparseArray)
return isinstance(X, da.Array) and isinstance(X._meta, sparse.SparseArray)


@dispatch(np.ndarray)
Expand Down Expand Up @@ -149,6 +149,11 @@ def add_intercept(X):
return X_i


@dispatch(object)
def add_intercept(X):
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)
return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs from dask.array.utils import ones_like_safe.



def make_y(X, beta=np.array([1.5, -3]), chunks=2):
n, p = X.shape
z0 = X.dot(beta)
Expand Down Expand Up @@ -205,3 +210,9 @@ def get_distributed_client():
return get_client()
except ValueError:
return None


def safe_zeros_like(X, shape):
if isinstance(X, da.Array):
return np.zeros_like(X._meta, shape=shape)
return np.zeros_like(X, shape=shape)
Comment on lines +216 to +218
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(X, da.Array):
return np.zeros_like(X._meta, shape=shape)
return np.zeros_like(X, shape=shape)
return zeros_like_safe(meta_from_array(X))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to from dask.array.utils import meta_from_array at the top.

Copy link
Contributor Author

@daxiongshu daxiongshu Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I think I might misunderstand our other conversion. #89 (comment)

This PR intends to enable dask-glm to deal with pure numpy arrays. Please let me know if not so and dask-glm should only accept dask arrays.

beta = np.zeros_like(X._meta, shape=p)

Let's say the input X is a pure numpy or cupy array, not a dask array. beta = np.zeros_like(X._meta) will be an error. The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array. In contrast, da.utils.zeros_like_safe returns a dask array. In this case the beta should be a pure numpy/cupy array.

Let me know if this clears things up. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array.

That's exactly what meta_from_array does. It will return an array of the type _meta has (i.e., chunk type), so if the input is a NumPy array or a Dask array backed by NumPy, the result is an empty numpy.ndarray, and if the input is a CuPy array or a Dask array backed by CuPy, the result is an empty cupy.ndarray.

In contrast, da.utils.zeros_like_safe returns a dask array.

That isn't necessarily true, it will only return a Dask array if the reference array is a Dask array. Because we're getting the underlying chunk type with meta_from_array, the resulting array will be either a NumPy or CuPy array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that works! I will make the changes.