Skip to content

[Dataset] Add BigSolDB solubility dataset#10698

Open
levakrasnovs wants to merge 4 commits into
pyg-team:masterfrom
levakrasnovs:add-bigsoldb-dataset
Open

[Dataset] Add BigSolDB solubility dataset#10698
levakrasnovs wants to merge 4 commits into
pyg-team:masterfrom
levakrasnovs:add-bigsoldb-dataset

Conversation

@levakrasnovs
Copy link
Copy Markdown

Description

Adds the BigSolDB dataset (torch_geometric.datasets.BigSolDB), a large-scale experimental solubility database covering organic compounds in diverse solvents across a wide temperature range.

Paper: BigSolDB 2.0, Scientific Data 2025
Zenodo: https://doi.org/10.5281/zenodo.18552681

Stats

  • 112,465 experimental solubility values
  • 1,525 unique organic compounds
  • 218 solvents (water + organic)
  • Temperature range: 243-425 K
  • Extracted from 1,687 peer-reviewed articles

Why BigSolDB?

BigSolDB is significantly larger and more diverse than the existing AQSOL dataset
(9,800 aqueous-only datapoints). It covers 218 solvents vs. water-only, making it
suitable for multi-solvent solubility prediction - an important task in drug
formulation, synthesis, and crystallization.

New files

  • torch_geometric/datasets/bigsoldb.py — dataset class
  • test/datasets/test_bigsoldb.py — unit tests (8 tests)

Notes

  • No official train/val/test split is provided (unlike AQSOL). Users are encouraged
    to apply scaffold-based or cold-solvent splits depending on their evaluation protocol.
  • Solvent graph is stored as x_solvent, edge_index_solvent, edge_attr_solvent.
  • Use follow_batch=['x_solvent'] in DataLoader for correct solvent pooling.

Impact

  • 3,800+ downloads on Zenodo across all versions
  • Cited in peer-reviewed works across top venues:
  1. Attia et al., Nature Communications 2025 (https://doi.org/10.1038/s41467-025-62717-7)
  2. Al Ibrahim et al., JACS 2025 (https://doi.org/10.1021/jacs.5c13746)
  3. Ramani et al., JCTC 2024 (https://doi.org/10.1021/acs.jctc.4c00382)
  4. Krzyżanowski et al., Digital Discovery 2025 (https://doi.org/10.1039/D5DD00134J)
  5. Guo et al., JCIM 2025 (https://doi.org/10.1021/acs.jcim.5c00781)
  6. Gopichand et al., ACS Omega 2025 (https://doi.org/10.1021/acsomega.5c13630)
  7. Broadbent et al., ICLR 2026 Workshop (https://openreview.net/forum?id=znY51c4s04)
  8. Pimonova et al., Digital Discovery 2026 (https://doi.org/10.1039/D5DD00443H)
  9. Fan et al., Digital Discovery 2026 (https://doi.org/10.1039/D5DD00407A)
  10. Da Vià et al., Org. Process Res. Dev. 2025 (https://doi.org/10.1021/acs.oprd.4c00384)
  11. Gomes et al., JCIM 2026 (https://doi.org/10.1021/acs.jcim.6c00606)
  12. Wang et al., Digital Discovery 2026 (https://doi.org/10.1039/D5DD00456J)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant