Chapter 2: Creating a test set, Stratify

I am kindly asking for clarification in some points regarding Chapter 2.

1.  Why do we need to introduce the random seed? And if it is to have consistent train/test sets over multiple runs, then why do we need to have multiple runs.

2. If using the hash function will keep the test set consistent, can new instances be included into the test set as the hash value of its id satisfies the condition crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32? 

3. What is the point to use stratified sampling in the first place.

4. Why cant we just use the normal train_test_split method instead of StratifiedShuffleSplit?

Thank you for your kindness and your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 2: Creating a test set, Stratify #689

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Chapter 2: Creating a test set, Stratify #689

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions