Merge pull request #452 from datamol-io/baselines

DomInvivo · web-flow · commit c211daca67a8 · 2023-08-31T08:21:42.000-04:00
Added ultralarge baselines
diff --git a/docs/baseline.md b/docs/baseline.md
@@ -7,19 +7,19 @@ One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate fr
 | Dataset   | Model | MAE ↓     | Pearson ↑ | R² ↑     | MAE ↓   | Pearson ↑ | R² ↑   |
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
-| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
+|
 | **QM9**   | GCN   | 0.102 ± 0.0003 | 0.958 ± 0.0007 | 0.920 ± 0.002 | 0.119 ± 0.01 | 0.955 ± 0.001 | 0.915 ± 0.001 |
 |           | GIN   | 0.0976 ± 0.0006 | **0.959 ± 0.0002** | **0.922 ± 0.0004** | 0.117 ± 0.01 | 0.950 ± 0.002 | 0.908 ± 0.003 |
 |           | GINE  | **0.0959 ± 0.0002** | 0.955 ± 0.002 | 0.918 ± 0.004 | 0.102 ± 0.01 | 0.956 ± 0.0009 | 0.918 ± 0.002 |
-| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
+|
 | **Zinc12k** | GCN   | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 |
 |           | GIN   | 0.303 ± 0.007 | 0.950 ± 0.003 | 0.889 ± 0.003 | 0.189 ± 0.004 | 0.978 ± 0.006 | 0.953 ± 0.002 |
 |           | GINE  | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | **0.147 ± 0.009** | **0.987 ± 0.001** | **0.971 ± 0.003** |
 
 |           |       | BCE ↓     | AUROC ↑ | AP ↑     | BCE ↓   | AUROC ↑ | AP ↑   |
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
-| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
+|
 | **Tox21**   | GCN   | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | **0.176 ± 0.001** | **0.850 ± 0.006** | 0.446 ± 0.01 |
 |           | GIN   | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 |
 |           | GINE  | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | **0.455 ± 0.008** |
@@ -36,11 +36,11 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
 | Dataset   | Model | MAE ↓     | Pearson ↑ | R² ↑     | MAE ↓   | Pearson ↑ | R² ↑   |
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
-| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
-| Pcqm4m_g25 | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
+|
+| **Pcqm4m_g25** | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
 |               | GIN | 0.2270 ± 0.0003 | 0.8854 ± 0.0004 | 0.7912 ± 0.0006 | 0.2352 ± 0.0006 | 0.8802 ± 0.0007 | 0.7827 ± 0.0005 |
 |               | GINE| **0.2223 ± 0.0007** | **0.8874 ± 0.0003** | **0.7949 ± 0.0001** | 0.2315 ± 0.0002 | 0.8823 ± 0.0002 | 0.7864 ± 0.0008 |
-| Pcqm4m_n4 | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
+| **Pcqm4m_n4** | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
 |               | GIN | 0.1912 ± 0.0027 | **0.6138 ± 0.0088** | **0.3688 ± 0.0116** | 0.1966 ± 0.0003 | 0.5198 ± 0.0008 | 0.2602 ± 0.0012 |
 |               | GINE| **0.1910 ± 0.0001** | 0.6127 ± 0.0003 | 0.3666 ± 0.0008 | 0.1941 ± 0.0003 | 0.5303 ± 0.0023 | 0.2701 ± 0.0034 |
 
@@ -49,13 +49,13 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
 | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
-| Pcba\_1328    | GCN      | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
+| **Pcba\_1328**    | GCN      | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
 |               | GIN      | 0.0324 ± 0.0000 | 0.7941 ± 0.0018 | 0.3328 ± 0.0019 | 0.0342 ± 0.0001 | 0.7747 ± 0.0025 | 0.2650 ± 0.0020 |
 |               | GINE      | 0.0320 ± 0.0001 | 0.7944 ± 0.0023 | 0.3337 ± 0.0027 | 0.0341 ± 0.0001 | 0.7737 ± 0.0007 | 0.2611 ± 0.0043 |
-| L1000\_vcap   | GCN      | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
+| **L1000\_vcap**   | GCN      | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
 |               | GIN      | 0.1909 ± 0.0005 | 0.5734 ± 0.0029 | 0.3731 ± 0.0014 | 0.1870 ± 0.0010 | 0.6351 ± 0.0014 | 0.4062 ± 0.0001 |
 |               | GINE      | 0.1907 ± 0.0006 | 0.5708 ± 0.0079 | 0.3705 ± 0.0015 | **0.1862 ± 0.0007** | **0.6398 ± 0.0043** | **0.4068 ± 0.0023** |
-| L1000\_mcf7   | GCN      | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
+| **L1000\_mcf7**   | GCN      | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
 |               | GIN      | 0.1862 ± 0.0003 | 0.6202 ± 0.0091 | 0.3876 ± 0.0017 | 0.1874 ± 0.0013 | 0.6367 ± 0.0066 | **0.4198 ± 0.0036** |
 |               | GINE      | **0.1856 ± 0.0005** | 0.6166 ± 0.0017 | 0.3892 ± 0.0035 | 0.1873 ± 0.0009 | 0.6347 ± 0.0048 | 0.4177 ± 0.0024 |
 
@@ -67,27 +67,70 @@ This is not surprising as they contain two orders of magnitude more datapoints a
 
 |            |       | CE or MSE loss in single-task $\downarrow$ | CE or MSE loss in multi-task $\downarrow$ |
 |------------|-------|-----------------------------------------|-----------------------------------------|
-|            |       |                                       |                                       |
+|
 | **Pcqm4m\_g25**    | GCN   | **0.2660 ± 0.0005** | 0.2767 ± 0.0015 |
 |             | GIN   | **0.2439 ± 0.0004** | 0.2595 ± 0.0016 |
 |             | GINE  | **0.2424 ± 0.0007** | 0.2568 ± 0.0012 |
-|            |       |                                       |                                       |
+|
 | **Pcqm4m\_n4**    | GCN   | **0.2515 ± 0.0002** | 0.2613 ± 0.0008 |
 |             | GIN   | **0.2317 ± 0.0003** | 0.2512 ± 0.0008 |
 |             | GINE  | **0.2272 ± 0.0001** | 0.2483 ± 0.0004 |
-|            |       |                                       |                                       |
+|
 | **Pcba\_1328**    | GCN   | **0.0284 ± 0.0010** | 0.0382 ± 0.0005 |
 |             | GIN   | **0.0249 ± 0.0017** | 0.0359 ± 0.0011 |
 |             | GINE  | **0.0258 ± 0.0017** | 0.0361 ± 0.0008 |
-|            |       |                                       |                                       |
+|
 | **L1000\_vcap**   | GCN   | 0.1906 ± 0.0036 | **0.1854 ± 0.0148** |
 |             | GIN   | 0.1854 ± 0.0030 | **0.1833 ± 0.0185** |
 |             | GINE  | **0.1860 ± 0.0025** | 0.1887 ± 0.0200 |
-|            |       |                                       |                                       |
+|
 | **L1000\_mcf7**   | GCN   | 0.1902 ± 0.0038 | **0.1829 ± 0.0095** |
 |             | GIN   | 0.1873 ± 0.0033 | **0.1701 ± 0.0142** |
 |             | GINE  | 0.1883 ± 0.0039 | **0.1771 ± 0.0010** |
 
 # UltraLarge Baseline
-Coming soon!
+
+## UltraLarge test set metrics
+
+For `UltraLarge`, we provide results for the same GNN baselines as for
+`LargeMix`. Each model is trained for 50 epochs and results are averaged over 3 seeds. The remaining
+setup is the same as for TOYMIX (Section E.1), reporting metrics on the Single Dataset and Multi Dataset using the same performance metrics. We further use the same models (in terms of size) as used for `LargeMix`.
+
+For now, we report only the results for a subset representing 5% of the total dataset due to computational constraint, but aim to provide the full results soon.
+
+Results discussion. `UltraLarge` results can be found in Table 6. Interestingly, on both graph- and node-level tasks we observe that there is no advantage of multi-tasking in terms of performance. We
+expect that for this ultra-large dataset, significantly larger models are needed to successfully leverage the multi-task setup. This could be attributed to underfitting, as already demonstrated for `LargeMix`. Nonetheless, our baselines set the stage for large-scale pre-training on `UltraLarge`.
+
+The results presented used approximately 500 GPU hours of compute, with
+more compute used for development and hyperparameter search.
+
+We further note that the graph-level tasks results are very strong. Regarding the node-level tasks, they are expected to underperform in low-parameters regime, due to clear signs of underfitting, a very large amount of labels to learn, and susceptibility to over-smoothing from traditional GNNs.
+
+
+| Dataset          | Model | MAE ↓             | Pearson ↑         | R² ↑          | MAE ↓             | Pearson ↑         | R² ↑              |
+|------------------|-------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
+|    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
+| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
+| **Pm6_83m_g62**      | GCN   | .2606 ± .0011     | .9004 ± .0003     | .7997 ± .0009       | .2625 ± .0011     | .8896 ± .0001     | .7982 ± .0001     |
+|                  | GIN   | .2546 ± .0021     | .9051 ± .0019     | .8064 ± .0037       | .2562 ± .0000     | .8901 ± .0000     | .806 ± .0000      |
+|                  | GINE  | **.2538 ± .0006** | **.9059 ± .0010** | **.8082 ± .0015**   | .258 ± .0011      | .904 ± .0000      | .8048 ± .0001     |
+|
+| **Pm6_83m_n7**       | GCN   | .5803 ± .0001     | .3372 ± .0004     | .1191 ± .0002      | .5971 ± .0002     | .3164 ± .0001     | .1019 ± .0011     |
+|                  | GIN   | .573 ± .0002      | .3478 ± .0001     | **.1269 ± .0002**  | .5831 ± .0001     | .3315 ± .0005     | .1141 ± .0000     |
+|                  | GINE  | **.572 ± .0004**  | **.3487 ± .0002** | .1266 ± .0001      | .5839 ± .0004     | .3294 ± .0002     | .1104 ± .0000     |
+
+## UltraLarge training set loss
+
+In the table below, we observe that the multi-task model slightly underfits the single-task model, indicating that parameters can be efficiently shared between the node-level and graph-level tasks. We further note that the training loss and the test MAE are almost equal for all tasks, indicating further benefits as we scale both the model and the data.
+
+|                  |       | **MAE loss in single-task ↓** | **MAE loss in multi-task ↓** |
+|------------------|-------|---------------------------|--------------------------|
+|
+| Pm6_83m_g62      | GCN   | **.2679 ± .0020**        | .2713 ± .0017           |
+|                  | GIN   | **.2582 ± .0018**        | .2636 ± .0014           |
+|                  | GINE  | **.2567 ± .0036**        | .2603 ± .0021           |
+|
+| Pm6_83m_n7       | GCN   | **.5818 ± .0021**        | .5955 ± .0023           |
+|                  | GIN   | **.5707 ± .0019**        | .5851 ± .0038           |
+|                  | GINE  | **.5724 ± .0015**        | .5832 ± .0027           |