Skip to content

Random seed is useless in get_molnet_dataset function #418

@Minys233

Description

@Minys233

In this function, although there is a seed=777 argument in the signature like this.

def get_molnet_dataset(dataset_name, preprocessor=None, labels=None,
split=None, frac_train=.8, frac_valid=.1,
frac_test=.1, seed=777, return_smiles=False,
return_pdb_id=False, target_index=None, task_index=0,
**kwargs):

But it's never passed to any splitter, in the same function, the splitter is called here without seed argument:

if dataset_config['dataset_type'] == 'one_file_csv':
split = dataset_config['split'] if split is None else split
if isinstance(split, str):
splitter = split_method_dict[split]()
elif isinstance(split, BaseSplitter):
splitter = split
else:
raise TypeError("split must be None, str or instance of"
" BaseSplitter, but got {}".format(type(split)))
if isinstance(splitter, ScaffoldSplitter):
get_smiles = True
else:
get_smiles = return_smiles
result = parser.parse(get_molnet_filepath(dataset_name),
return_smiles=get_smiles,
target_index=target_index, **kwargs)
dataset = result['dataset']
smiles = result['smiles']
train_ind, valid_ind, test_ind = \
splitter.train_valid_test_split(dataset, smiles_list=smiles,
task_index=task_index,
frac_train=frac_train,
frac_valid=frac_valid,
frac_test=frac_test, **kwargs)

Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None:

def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8,
frac_valid=0.1, frac_test=0.1, converter=None,
return_index=True, seed=None,
include_chirality=False, **kwargs):

According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs.

def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1,
**kwargs):
numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test,
1.)
seed = kwargs.get('seed', None)
smiles_list = kwargs.get('smiles_list')
include_chirality = kwargs.get('include_chirality')
if len(dataset) != len(smiles_list):
raise ValueError("The lengths of dataset and smiles_list are "
"different")
rng = numpy.random.RandomState(seed)

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.

PS: I use Pycharm debug tool to validate above procedure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions