-
Notifications
You must be signed in to change notification settings - Fork 5
Description
@wiederm Mentioned interested in having a smaller ani2x dataset (larger than our testing set) for training examination.
@jchodera suggested limiting to molecules with C, H, O, which I think is good. This would allow us to more directly compare with PhAlkEthOH.
PhAlkEthOH has 12,271 unique molecules, ANI2x has 16,514 unique molecules. I'm not sure how many molecules are in ANI2x with only C, H, O, but if this number is less than PhAlkEthOH, we can create a smaller subset of it to match.
It might be interesting to see the overlap of these datasets. The ANI2x dataset does not contain the smiles strings for the molecules, but probably could do some other relevant comparisons. I think something as simple as looking at the overlap of molecular weight (since we are limited to CHO) would probably be good. Could also just do this as two plots, one for molecules with O, one for molecules without O.