-
Notifications
You must be signed in to change notification settings - Fork 132
Description
In this function, although there is a seed=777 argument in the signature like this.
chainer-chemistry/chainer_chemistry/datasets/molnet/molnet.py
Lines 24 to 28 in 56e83de
| def get_molnet_dataset(dataset_name, preprocessor=None, labels=None, | |
| split=None, frac_train=.8, frac_valid=.1, | |
| frac_test=.1, seed=777, return_smiles=False, | |
| return_pdb_id=False, target_index=None, task_index=0, | |
| **kwargs): |
But it's never passed to any splitter, in the same function, the splitter is called here without seed argument:
chainer-chemistry/chainer_chemistry/datasets/molnet/molnet.py
Lines 104 to 130 in 56e83de
| if dataset_config['dataset_type'] == 'one_file_csv': | |
| split = dataset_config['split'] if split is None else split | |
| if isinstance(split, str): | |
| splitter = split_method_dict[split]() | |
| elif isinstance(split, BaseSplitter): | |
| splitter = split | |
| else: | |
| raise TypeError("split must be None, str or instance of" | |
| " BaseSplitter, but got {}".format(type(split))) | |
| if isinstance(splitter, ScaffoldSplitter): | |
| get_smiles = True | |
| else: | |
| get_smiles = return_smiles | |
| result = parser.parse(get_molnet_filepath(dataset_name), | |
| return_smiles=get_smiles, | |
| target_index=target_index, **kwargs) | |
| dataset = result['dataset'] | |
| smiles = result['smiles'] | |
| train_ind, valid_ind, test_ind = \ | |
| splitter.train_valid_test_split(dataset, smiles_list=smiles, | |
| task_index=task_index, | |
| frac_train=frac_train, | |
| frac_valid=frac_valid, | |
| frac_test=frac_test, **kwargs) |
Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None:
chainer-chemistry/chainer_chemistry/dataset/splitters/scaffold_splitter.py
Lines 62 to 65 in 56e83de
| def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8, | |
| frac_valid=0.1, frac_test=0.1, converter=None, | |
| return_index=True, seed=None, | |
| include_chirality=False, **kwargs): |
According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs.
chainer-chemistry/chainer_chemistry/dataset/splitters/scaffold_splitter.py
Lines 23 to 35 in 56e83de
| def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1, | |
| **kwargs): | |
| numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test, | |
| 1.) | |
| seed = kwargs.get('seed', None) | |
| smiles_list = kwargs.get('smiles_list') | |
| include_chirality = kwargs.get('include_chirality') | |
| if len(dataset) != len(smiles_list): | |
| raise ValueError("The lengths of dataset and smiles_list are " | |
| "different") | |
| rng = numpy.random.RandomState(seed) | |
This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.
PS: I use Pycharm debug tool to validate above procedure.