Random seed is useless in get_molnet_dataset function

In this function, although there is a `seed=777` argument in the signature like this.
https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L24-L28

But it's never passed to any splitter, in the same function, the splitter is called here without `seed` argument:
https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L104-L130

Then, in the splitter (here the ScaffoldSplitter), the `seed` argument is still `None`:
https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L62-L65

According to the implementation, the `seed=None` means it be initialized by reading data from `/dev/urandom` according to the numpy docs.
https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L23-L35

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless. 

PS: I use Pycharm debug tool to validate above procedure.


	def get_molnet_dataset(dataset_name, preprocessor=None, labels=None,
	split=None, frac_train=.8, frac_valid=.1,
	frac_test=.1, seed=777, return_smiles=False,
	return_pdb_id=False, target_index=None, task_index=0,
	**kwargs):

	if dataset_config['dataset_type'] == 'one_file_csv':
	split = dataset_config['split'] if split is None else split

	if isinstance(split, str):
	splitter = split_method_dict[split]()
	elif isinstance(split, BaseSplitter):
	splitter = split
	else:
	raise TypeError("split must be None, str or instance of"
	" BaseSplitter, but got {}".format(type(split)))

	if isinstance(splitter, ScaffoldSplitter):
	get_smiles = True
	else:
	get_smiles = return_smiles

	result = parser.parse(get_molnet_filepath(dataset_name),
	return_smiles=get_smiles,
	target_index=target_index, **kwargs)
	dataset = result['dataset']
	smiles = result['smiles']
	train_ind, valid_ind, test_ind = \
	splitter.train_valid_test_split(dataset, smiles_list=smiles,
	task_index=task_index,
	frac_train=frac_train,
	frac_valid=frac_valid,
	frac_test=frac_test, **kwargs)

	def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8,
	frac_valid=0.1, frac_test=0.1, converter=None,
	return_index=True, seed=None,
	include_chirality=False, **kwargs):

	def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1,
	**kwargs):
	numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test,
	1.)
	seed = kwargs.get('seed', None)
	smiles_list = kwargs.get('smiles_list')
	include_chirality = kwargs.get('include_chirality')
	if len(dataset) != len(smiles_list):
	raise ValueError("The lengths of dataset and smiles_list are "
	"different")

	rng = numpy.random.RandomState(seed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random seed is useless in get_molnet_dataset function #418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Random seed is useless in get_molnet_dataset function #418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions