Skip to content

Commit c211dac

Browse files
authored
Merge pull request #452 from datamol-io/baselines
Added ultralarge baselines
2 parents ce45285 + 6eb7906 commit c211dac

File tree

1 file changed

+58
-15
lines changed

1 file changed

+58
-15
lines changed

docs/baseline.md

Lines changed: 58 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,19 @@ One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate fr
77
| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
88
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
99
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
10-
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
10+
|
1111
| **QM9** | GCN | 0.102 ± 0.0003 | 0.958 ± 0.0007 | 0.920 ± 0.002 | 0.119 ± 0.01 | 0.955 ± 0.001 | 0.915 ± 0.001 |
1212
| | GIN | 0.0976 ± 0.0006 | **0.959 ± 0.0002** | **0.922 ± 0.0004** | 0.117 ± 0.01 | 0.950 ± 0.002 | 0.908 ± 0.003 |
1313
| | GINE | **0.0959 ± 0.0002** | 0.955 ± 0.002 | 0.918 ± 0.004 | 0.102 ± 0.01 | 0.956 ± 0.0009 | 0.918 ± 0.002 |
14-
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
14+
|
1515
| **Zinc12k** | GCN | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 |
1616
| | GIN | 0.303 ± 0.007 | 0.950 ± 0.003 | 0.889 ± 0.003 | 0.189 ± 0.004 | 0.978 ± 0.006 | 0.953 ± 0.002 |
1717
| | GINE | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | **0.147 ± 0.009** | **0.987 ± 0.001** | **0.971 ± 0.003** |
1818

1919
| | | BCE ↓ | AUROC ↑ | AP ↑ | BCE ↓ | AUROC ↑ | AP ↑ |
2020
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
2121
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
22-
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
22+
|
2323
| **Tox21** | GCN | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | **0.176 ± 0.001** | **0.850 ± 0.006** | 0.446 ± 0.01 |
2424
| | GIN | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 |
2525
| | GINE | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | **0.455 ± 0.008** |
@@ -36,11 +36,11 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
3636
| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
3737
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
3838
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
39-
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
40-
| Pcqm4m_g25 | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
39+
|
40+
| **Pcqm4m_g25** | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
4141
| | GIN | 0.2270 ± 0.0003 | 0.8854 ± 0.0004 | 0.7912 ± 0.0006 | 0.2352 ± 0.0006 | 0.8802 ± 0.0007 | 0.7827 ± 0.0005 |
4242
| | GINE| **0.2223 ± 0.0007** | **0.8874 ± 0.0003** | **0.7949 ± 0.0001** | 0.2315 ± 0.0002 | 0.8823 ± 0.0002 | 0.7864 ± 0.0008 |
43-
| Pcqm4m_n4 | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
43+
| **Pcqm4m_n4** | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
4444
| | GIN | 0.1912 ± 0.0027 | **0.6138 ± 0.0088** | **0.3688 ± 0.0116** | 0.1966 ± 0.0003 | 0.5198 ± 0.0008 | 0.2602 ± 0.0012 |
4545
| | GINE| **0.1910 ± 0.0001** | 0.6127 ± 0.0003 | 0.3666 ± 0.0008 | 0.1941 ± 0.0003 | 0.5303 ± 0.0023 | 0.2701 ± 0.0034 |
4646

@@ -49,13 +49,13 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
4949
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
5050
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
5151
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
52-
| Pcba\_1328 | GCN | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
52+
| **Pcba\_1328** | GCN | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
5353
| | GIN | 0.0324 ± 0.0000 | 0.7941 ± 0.0018 | 0.3328 ± 0.0019 | 0.0342 ± 0.0001 | 0.7747 ± 0.0025 | 0.2650 ± 0.0020 |
5454
| | GINE | 0.0320 ± 0.0001 | 0.7944 ± 0.0023 | 0.3337 ± 0.0027 | 0.0341 ± 0.0001 | 0.7737 ± 0.0007 | 0.2611 ± 0.0043 |
55-
| L1000\_vcap | GCN | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
55+
| **L1000\_vcap** | GCN | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
5656
| | GIN | 0.1909 ± 0.0005 | 0.5734 ± 0.0029 | 0.3731 ± 0.0014 | 0.1870 ± 0.0010 | 0.6351 ± 0.0014 | 0.4062 ± 0.0001 |
5757
| | GINE | 0.1907 ± 0.0006 | 0.5708 ± 0.0079 | 0.3705 ± 0.0015 | **0.1862 ± 0.0007** | **0.6398 ± 0.0043** | **0.4068 ± 0.0023** |
58-
| L1000\_mcf7 | GCN | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
58+
| **L1000\_mcf7** | GCN | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
5959
| | GIN | 0.1862 ± 0.0003 | 0.6202 ± 0.0091 | 0.3876 ± 0.0017 | 0.1874 ± 0.0013 | 0.6367 ± 0.0066 | **0.4198 ± 0.0036** |
6060
| | GINE | **0.1856 ± 0.0005** | 0.6166 ± 0.0017 | 0.3892 ± 0.0035 | 0.1873 ± 0.0009 | 0.6347 ± 0.0048 | 0.4177 ± 0.0024 |
6161

@@ -67,27 +67,70 @@ This is not surprising as they contain two orders of magnitude more datapoints a
6767

6868
| | | CE or MSE loss in single-task $\downarrow$ | CE or MSE loss in multi-task $\downarrow$ |
6969
|------------|-------|-----------------------------------------|-----------------------------------------|
70-
| | | | |
70+
|
7171
| **Pcqm4m\_g25** | GCN | **0.2660 ± 0.0005** | 0.2767 ± 0.0015 |
7272
| | GIN | **0.2439 ± 0.0004** | 0.2595 ± 0.0016 |
7373
| | GINE | **0.2424 ± 0.0007** | 0.2568 ± 0.0012 |
74-
| | | | |
74+
|
7575
| **Pcqm4m\_n4** | GCN | **0.2515 ± 0.0002** | 0.2613 ± 0.0008 |
7676
| | GIN | **0.2317 ± 0.0003** | 0.2512 ± 0.0008 |
7777
| | GINE | **0.2272 ± 0.0001** | 0.2483 ± 0.0004 |
78-
| | | | |
78+
|
7979
| **Pcba\_1328** | GCN | **0.0284 ± 0.0010** | 0.0382 ± 0.0005 |
8080
| | GIN | **0.0249 ± 0.0017** | 0.0359 ± 0.0011 |
8181
| | GINE | **0.0258 ± 0.0017** | 0.0361 ± 0.0008 |
82-
| | | | |
82+
|
8383
| **L1000\_vcap** | GCN | 0.1906 ± 0.0036 | **0.1854 ± 0.0148** |
8484
| | GIN | 0.1854 ± 0.0030 | **0.1833 ± 0.0185** |
8585
| | GINE | **0.1860 ± 0.0025** | 0.1887 ± 0.0200 |
86-
| | | | |
86+
|
8787
| **L1000\_mcf7** | GCN | 0.1902 ± 0.0038 | **0.1829 ± 0.0095** |
8888
| | GIN | 0.1873 ± 0.0033 | **0.1701 ± 0.0142** |
8989
| | GINE | 0.1883 ± 0.0039 | **0.1771 ± 0.0010** |
9090

9191
# UltraLarge Baseline
92-
Coming soon!
92+
93+
## UltraLarge test set metrics
94+
95+
For `UltraLarge`, we provide results for the same GNN baselines as for
96+
`LargeMix`. Each model is trained for 50 epochs and results are averaged over 3 seeds. The remaining
97+
setup is the same as for TOYMIX (Section E.1), reporting metrics on the Single Dataset and Multi Dataset using the same performance metrics. We further use the same models (in terms of size) as used for `LargeMix`.
98+
99+
For now, we report only the results for a subset representing 5% of the total dataset due to computational constraint, but aim to provide the full results soon.
100+
101+
Results discussion. `UltraLarge` results can be found in Table 6. Interestingly, on both graph- and node-level tasks we observe that there is no advantage of multi-tasking in terms of performance. We
102+
expect that for this ultra-large dataset, significantly larger models are needed to successfully leverage the multi-task setup. This could be attributed to underfitting, as already demonstrated for `LargeMix`. Nonetheless, our baselines set the stage for large-scale pre-training on `UltraLarge`.
103+
104+
The results presented used approximately 500 GPU hours of compute, with
105+
more compute used for development and hyperparameter search.
106+
107+
We further note that the graph-level tasks results are very strong. Regarding the node-level tasks, they are expected to underperform in low-parameters regime, due to clear signs of underfitting, a very large amount of labels to learn, and susceptibility to over-smoothing from traditional GNNs.
108+
109+
110+
| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
111+
|------------------|-------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
112+
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
113+
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
114+
| **Pm6_83m_g62** | GCN | .2606 ± .0011 | .9004 ± .0003 | .7997 ± .0009 | .2625 ± .0011 | .8896 ± .0001 | .7982 ± .0001 |
115+
| | GIN | .2546 ± .0021 | .9051 ± .0019 | .8064 ± .0037 | .2562 ± .0000 | .8901 ± .0000 | .806 ± .0000 |
116+
| | GINE | **.2538 ± .0006** | **.9059 ± .0010** | **.8082 ± .0015** | .258 ± .0011 | .904 ± .0000 | .8048 ± .0001 |
117+
|
118+
| **Pm6_83m_n7** | GCN | .5803 ± .0001 | .3372 ± .0004 | .1191 ± .0002 | .5971 ± .0002 | .3164 ± .0001 | .1019 ± .0011 |
119+
| | GIN | .573 ± .0002 | .3478 ± .0001 | **.1269 ± .0002** | .5831 ± .0001 | .3315 ± .0005 | .1141 ± .0000 |
120+
| | GINE | **.572 ± .0004** | **.3487 ± .0002** | .1266 ± .0001 | .5839 ± .0004 | .3294 ± .0002 | .1104 ± .0000 |
121+
122+
## UltraLarge training set loss
123+
124+
In the table below, we observe that the multi-task model slightly underfits the single-task model, indicating that parameters can be efficiently shared between the node-level and graph-level tasks. We further note that the training loss and the test MAE are almost equal for all tasks, indicating further benefits as we scale both the model and the data.
125+
126+
| | | **MAE loss in single-task ↓** | **MAE loss in multi-task ↓** |
127+
|------------------|-------|---------------------------|--------------------------|
128+
|
129+
| Pm6_83m_g62 | GCN | **.2679 ± .0020** | .2713 ± .0017 |
130+
| | GIN | **.2582 ± .0018** | .2636 ± .0014 |
131+
| | GINE | **.2567 ± .0036** | .2603 ± .0021 |
132+
|
133+
| Pm6_83m_n7 | GCN | **.5818 ± .0021** | .5955 ± .0023 |
134+
| | GIN | **.5707 ± .0019** | .5851 ± .0038 |
135+
| | GINE | **.5724 ± .0015** | .5832 ± .0027 |
93136

0 commit comments

Comments
 (0)