deepmodeling
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/quickstart.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/source/quickstart.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/source/weight.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/weight.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎requirements.txt‎
Lines changed: 5 additions & 5 deletions b/‎requirements.txt‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎setup.py‎
Lines changed: 6 additions & 6 deletions b/‎setup.py‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎tests/conftest.py‎
Lines changed: 27 additions & 0 deletions b/‎tests/conftest.py‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎tests/test_classification.py‎
Lines changed: 58 additions & 0 deletions b/‎tests/test_classification.py‎
Lines changed: 58 additions & 0 deletions
diff --git a/‎tests/test_conformer.py‎
Lines changed: 39 additions & 0 deletions b/‎tests/test_conformer.py‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎tests/test_datareader.py‎
Lines changed: 47 additions & 0 deletions b/‎tests/test_datareader.py‎
Lines changed: 47 additions & 0 deletions
@@ -53,13 +53,13 @@ python setup.py install
 
 The UniMol pretrained models can be found at [dptech/Uni-Mol-Models](https://huggingface.co/dptech/Uni-Mol-Models/tree/main).
 
-If the download is slow, you can use other mirrors, such as:
+If the download is slow, you can use a mirror, such as:
 
 ```bash
 export HF_ENDPOINT=https://hf-mirror.com
 ```
 
-Setting the `HF_ENDPOINT` environment variable specifies the mirror address for the Hugging Face Hub to use when downloading models.
+By default `unimol_tools` first tries the official Hugging Face endpoint. If that fails and `HF_ENDPOINT` is not set, it automatically retries using `https://hf-mirror.com`. Set `HF_ENDPOINT` yourself if you want to explicitly choose a mirror or the official site.
 
 ### Modify the default directory for weights
 
 
@@ -42,7 +42,7 @@ If the download is slow, you can use other mirrors, such as:
 export HF_ENDPOINT=https://hf-mirror.com
 ```
 
-Setting the `HF_ENDPOINT` environment variable specifies the mirror address for the Hugging Face Hub to use when downloading models.
+By default `unimol_tools` first tries the official Hugging Face endpoint. If that fails and `HF_ENDPOINT` is not set, it automatically retries using `https://hf-mirror.com`. Set `HF_ENDPOINT` to use a specific endpoint.
 
 ## Bohrium notebook
 
 
@@ -187,11 +187,11 @@ export MASTER_PORT='19198'
 Currently unimol_tools supports five types of fine-tuning tasks: `classification`, `regression`, `multiclass`, `multilabel_classification`, `multilabel_regression`.
 
 The datasets used in the examples are all open source and available, including
-- Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results.
-- ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds.
-- Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/
-- Solvation free energy (FreeSolv). SMILES are provided.
-- Vector-QM24 (VQM24) dataset. Quantum chemistry dataset of ~836 thousand small organic and inorganic molecules.
+- Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results. The dataset is available at https://weilab.math.msu.edu/DataLibrary/2D/.
+- ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. The dataset is available at https://weilab.math.msu.edu/DataLibrary/2D/ and https://huggingface.co/datasets/HR-machine/ESol.
+- Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/. The datasets is available at https://moleculenet.org/datasets-1 and https://www.kaggle.com/datasets/maksiamiogan/tox21-dataset.
+- Solvation free energy (FreeSolv). SMILES are provided. The dataset is available at https://weilab.math.msu.edu/DataLibrary/2D/.
+- Vector-QM24 (VQM24) dataset. Quantum chemistry dataset of ~836 thousand small organic and inorganic molecules. The dataset is available at https://zenodo.org/records/15442257.
 
 ### Example of classification
 You can use a dictionary as input. The default smiles column name is **'SMILES'** and the target column name is **'target'**. You can also customize it with `smiles_col` and `target_cols`.
@@ -411,7 +411,7 @@ predictor = MolPredict(load_model='./exp')
 pred = predictor.predict(test_df_dict['smiles'])
 ```
 
-It also supports directly using the sdf file path as input.
+It also supports directly using the sdf file path as input. The following example reads it in advance due to preprocessing missing values.
 
 ```python
 from unimol_tools import MolTrain, MolPredict
 
@@ -22,7 +22,7 @@ If the download is slow, you can use other mirrors, such as:
 
     export HF_ENDPOINT=https://hf-mirror.com
 
-Setting the ``HF_ENDPOINT`` environment variable specifies the mirror address for the Hugging Face Hub to use when downloading models.
+By default ``unimol_tools`` first tries the official Hugging Face endpoint. If that fails and ``HF_ENDPOINT`` is not set, it automatically retries with ``https://hf-mirror.com``. Set the variable yourself to choose a specific endpoint.
 
 `unimol_tools.weights.weight_hub.py <https://github.com/deepmodeling/unimol_tools/blob/main/unimol_tools/weights/weighthub.py>`_ control the logger.
 
 
@@ -1,9 +1,9 @@
-numpy==1.22.4
-pandas==1.4.0
-scikit-learn==1.5.0
-torch
+numpy>=2.0.0
+pandas>=2.2.2
+scikit-learn>=1.5.0
+torch>=2.4.0
 joblib
-rdkit
+rdkit>=2024.3.4
 pyyaml
 addict
 tqdm
@@ -23,18 +23,18 @@
         ],
     ),
     install_requires=[
-        "numpy<2.0.0,>=1.22.4",
-        "pandas<2.0.0",
-        "torch",
+        "numpy<2.3.0,>=2.0.0",
+        "pandas>=2.2.2",
+        "torch>=2.4.0",
         "joblib",
-        "rdkit",
+        "rdkit>=2024.3.4",
         "pyyaml",
         "addict",
-        "scikit-learn",
+        "scikit-learn>=1.5.0",
         "numba",
         "tqdm",
     ],
-    python_requires=">=3.6",
+    python_requires=">=3.9",
     include_package_data=True,
     classifiers=[
         "Development Status :: 5 - Production/Stable",
 
@@ -0,0 +1,27 @@
+import os
+import pytest
+
+@pytest.fixture(scope="session", autouse=True)
+def set_unimol_weight_dir(tmp_path_factory):
+    """Ensure UNIMOL_WEIGHT_DIR is set to a temporary directory for tests."""
+    weight_dir = tmp_path_factory.mktemp("weights")
+    original = os.environ.get("UNIMOL_WEIGHT_DIR")
+    os.environ["UNIMOL_WEIGHT_DIR"] = str(weight_dir)
+    yield
+    if original is None:
+        os.environ.pop("UNIMOL_WEIGHT_DIR", None)
+    else:
+        os.environ["UNIMOL_WEIGHT_DIR"] = original
+
+
+def pytest_addoption(parser):
+    parser.addoption("--run-network", action="store_true", help="run tests that need network")
+
+
+def pytest_collection_modifyitems(config, items):
+    if config.getoption("--run-network"):
+        return
+    skip_marker = pytest.mark.skip(reason="need --run-network to run")
+    for item in items:
+        if "network" in item.keywords:
+            item.add_marker(skip_marker)
@@ -0,0 +1,58 @@
+import os
+import zipfile
+import pandas as pd
+import pytest
+from utils_net import download_for_test
+
+from unimol_tools import MolTrain, MolPredict
+
+DATA_URL = 'https://weilab.math.msu.edu/DataLibrary/2D/Downloads/Ames_smi.zip'
+
+
+@pytest.mark.network
+def test_classification_train_predict(tmp_path):
+    # ensure any pretrained weights are written to a temporary directory
+    os.environ.setdefault('UNIMOL_WEIGHT_DIR', str(tmp_path / 'weights'))
+    zip_path = tmp_path / 'Ames_smi.zip'
+    download_for_test(
+        DATA_URL,
+        zip_path,
+        timeout=(5, 60),
+        max_retries=5,
+        backoff_factor=0.5,
+        allow_resume=True,
+        skip_on_failure=True,
+    )
+    with zipfile.ZipFile(zip_path, 'r') as zf:
+        zf.extractall(tmp_path)
+    csv_path = tmp_path / 'Ames.csv'
+    if not csv_path.exists():
+        pytest.skip('Dataset missing after extraction')
+    df = pd.read_csv(csv_path)
+    df = df.drop(columns=['CAS_NO']).rename(columns={'Activity': 'target'})
+    # take 100 samples for testing
+    df = df.sample(n=100, random_state=42)
+    train_df = df.sample(frac=0.8, random_state=42)
+    test_df = df.drop(train_df.index)
+    train_data = train_df.to_dict(orient='list')
+    test_smiles = test_df['Canonical_Smiles'].tolist()
+
+    exp_dir = tmp_path / 'exp'
+    clf = MolTrain(
+        task='classification',
+        data_type='molecule',
+        epochs=1,
+        batch_size=2,
+        kfold=2,
+        metrics='auc',
+        smiles_col='Canonical_Smiles',
+        save_path=str(exp_dir),
+    )
+    try:
+        clf.fit(train_data)
+    except Exception as e:
+        pytest.skip(f"Training failed: {e}")
+
+    predictor = MolPredict(load_model=str(exp_dir))
+    preds = predictor.predict(test_smiles)
+    assert len(preds) == len(test_smiles)
@@ -0,0 +1,39 @@
+import numpy as np
+from unimol_tools.data.conformer import (
+    inner_coords,
+    coords2unimol,
+    inner_smi2coords,
+    create_mol_from_atoms_and_coords,
+)
+from unimol_tools.data.dictionary import Dictionary
+
+
+def test_inner_coords_and_coords2unimol():
+    atoms = ['C', 'H', 'O']
+    coords = [[0, 0, 0], [0, 0, 1], [1, 0, 0]]
+    no_h_atoms, no_h_coords = inner_coords(atoms, coords, remove_hs=True)
+    assert 'H' not in no_h_atoms
+    d = Dictionary()
+    for a in ['C', 'O']:
+        if a not in d:
+            d.add_symbol(a)
+    feat = coords2unimol(no_h_atoms, no_h_coords, d)
+    assert feat['src_tokens'].dtype == int
+    assert feat['src_coord'].shape[1] == 3
+
+
+def test_inner_smi2coords_returns_mol():
+    mol = inner_smi2coords('CC', return_mol=True)
+    from rdkit.Chem import Mol
+
+    assert isinstance(mol, Mol)
+
+
+def test_create_mol_from_atoms_and_coords():
+    atoms = ['C', 'O']
+    coords = [[0, 0, 0], [1, 0, 0]]
+    mol = create_mol_from_atoms_and_coords(atoms, coords)
+    from rdkit.Chem import Mol
+
+    assert isinstance(mol, Mol)
+    assert mol.GetNumAtoms() == 2
@@ -0,0 +1,47 @@
+import pandas as pd
+import numpy as np
+import pytest
+from unimol_tools.data.datareader import MolDataReader
+
+
+def test_read_data_from_smiles_list():
+    smiles = ["CCO", "C"]
+    reader = MolDataReader()
+    result = reader.read_data(smiles)
+    assert result["smiles"] == smiles
+    assert len(result["scaffolds"]) == len(smiles)
+    assert result["raw_data"].shape[0] == len(smiles)
+
+
+def test_check_smiles_behavior():
+    reader = MolDataReader()
+    # invalid SMILES should return False during training when not strict
+    assert reader.check_smiles("invalid", is_train=True, smi_strict=False) is False
+    # invalid SMILES should raise in strict mode
+    with pytest.raises(ValueError):
+        reader.check_smiles("invalid", is_train=True, smi_strict=True)
+
+
+def test_convert_numeric_columns():
+    from rdkit import Chem
+    df = pd.DataFrame({
+        "ROMol": [Chem.MolFromSmiles("CCO")],
+        "num": ["1"],
+        "alpha": ["a"],
+    })
+    reader = MolDataReader()
+    out = reader._convert_numeric_columns(df.copy())
+    assert pd.api.types.is_numeric_dtype(out["num"])
+    assert not pd.api.types.is_numeric_dtype(out["alpha"])
+    assert out["ROMol"].iloc[0] == df["ROMol"].iloc[0]
+
+
+def test_anomaly_clean_regression():
+    df = pd.DataFrame({
+        "SMILES": ["C"] * 11,
+        "TARGET": [1] * 10 + [100],
+    })
+    reader = MolDataReader()
+    cleaned = reader.anomaly_clean_regression(df, ["TARGET"])
+    assert 100 not in cleaned["TARGET"].values
+    assert len(cleaned) == 10