-
-
Notifications
You must be signed in to change notification settings - Fork 262
Populate make_classification_df date functionality with random dates #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
4b55f9a
424fca4
b740b27
24461fb
3a7c9d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -381,10 +381,11 @@ def make_classification( | |
return X, y | ||
|
||
|
||
def random_date(start, end): | ||
def random_date(start, end, random_state=None): | ||
rng_random_date = dask_ml.utils.check_random_state(random_state) | ||
delta = end - start | ||
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds | ||
random_second = np.random.randint(int_delta) | ||
random_second = rng_random_date.randint(int_delta).compute().item() | ||
return start + timedelta(seconds=random_second) | ||
|
||
|
||
|
@@ -430,6 +431,13 @@ def make_classification_df( | |
The output values. | ||
|
||
""" | ||
if ( | ||
random_state is not None | ||
or not isinstance(random_state, np.random.RandomState) | ||
or not isinstance(random_state, int) | ||
): | ||
random_state = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the block necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values" |
||
|
||
X_array, y_array = make_classification( | ||
n_samples=n_samples, | ||
flip_y=(1 - predictability), | ||
|
@@ -451,8 +459,13 @@ def make_classification_df( | |
[ | ||
X_df, | ||
dd.from_array( | ||
np.array([random_date(*dates)] * len(X_df)), | ||
chunksize=chunks, | ||
np.array( | ||
[ | ||
random_date(*dates, random_state + i) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I will have to raise a ValueError exception there, on it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use this code? rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code above will produce the same random number since the seed( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My main point: I think [ins] In [193]: def random_dates(random_state):
...: return random_state.randint(100)
...:
[ins] In [194]: rng = np.random.RandomState(42)
[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can maybe just check if random_state is one of the accepted values like this and accordingly proceed - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That runs counter to the use of If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scikit-learn's It takes those values and produces the correct random seed generator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stsievert the accepted types of |
||
for i in range(len(X_df)) | ||
] | ||
), | ||
chunksize=n_samples, | ||
columns=["date"], | ||
), | ||
], | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,10 +73,22 @@ def test_make_classification_df(): | |
dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
) | ||
|
||
X_df1, y_series1 = dask_ml.datasets.make_classification_df( | ||
n_samples=100, | ||
n_features=5, | ||
random_state=123, | ||
chunks=100, | ||
dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
) | ||
check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good, this checks the random state. Shouldn't this also check that there's more than one unique value? That's what #845 is focused on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yeah, |
||
|
||
assert X_df is not None | ||
assert y_series is not None | ||
assert "date" in X_df.columns | ||
assert len(X_df.columns) == 6 | ||
assert len(X_df) == 100 | ||
assert len(y_series) == 100 | ||
assert isinstance(y_series, dask.dataframe.core.Series) | ||
assert check_randomness.size == 1 | ||
assert check_randomness[0] is True | ||
assert np.unique(X_df["date"]).size >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
That way the
.compute()
can be avoided (especially relevant on repeated calls.).