Update README.md

yashaektefaie · web-flow · commit f743a1afcf3c · 2025-03-07T10:56:00.000-05:00
diff --git a/README.md b/README.md
@@ -201,7 +201,7 @@ If there are any other tutorials of interest feel free to raise an issue!
 
 ## Background
 
-SPECTRA is from a preprint, for more information on the preprint, the method behind SPECTRA, and the initials studies conducted with SPECTRA, check out the paper folder. 
+SPECTRA is [published](rdcu.be/d2D0z) in Nature Machine Intelligence. For more code about the method behind SPECTRA and the initials studies conducted with SPECTRA, check out the paper folder. 
 
 ## Discussion and Development
 
@@ -223,15 +223,15 @@ All development discussions take place on GitHub in this repo in the issue track
 
 2. *I have a foundation model that is pre-trained on a large amount of data. It is not feasible to do pairwise calculations of SPECTRA properties. How can I use SPECTRA?*
 
-    It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
+    It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](rdcu.be/d2D0z), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
 
 3. *I have a foundation model that is pre-trained on a large amount of data and **I do not have access to the pre-training data**. How can I use SPECTRA?*
 
     This is a bit more tricky but there are [recent publications](https://arxiv.org/abs/2402.03563) that show these foundation models can represent uncertainty in the hidden representations they produce and a model can be trained to predict uncertainty from these representations. This uncertainty could represent the spectral property comparison between the pre-training and evaluation datasets. Though more work needs to be done, porting this work over would allow the application of SPECTRA in these settings. Again if there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
 
 4. *SPECTRA takes a long time to run is it worth it?*
 
-    The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.
+    The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](rdcu.be/d2D0z), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.
 
 If there are any other questions please raise them in the issues and I can address them. I'll keep adding to the FAQ as common questions begin to surface.
 
@@ -244,15 +244,20 @@ SPECTRA is under the MIT license found in the LICENSE file in this GitHub reposi
 Please cite this paper when referring to SPECTRA.
 
 ```
-@article {spectra,
-	author = {Yasha Ektefaie and Andrew Shen and Daria Bykova and Maximillian Marin and Marinka Zitnik and Maha R Farhat},
-	title = {Evaluating generalizability of artificial intelligence models for molecular datasets},
-	elocation-id = {2024.02.25.581982},
-	year = {2024},
-	doi = {10.1101/2024.02.25.581982},
-	URL = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982},
-	eprint = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982.full.pdf},
-	journal = {bioRxiv}
+@ARTICLE{Ektefaie2024,
+  title     = "Evaluating generalizability of artificial intelligence models
+               for molecular datasets",
+  author    = "Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin,
+               Maximillian G and Zitnik, Marinka and Farhat, Maha",
+  journal   = "Nat. Mach. Intell.",
+  publisher = "Springer Science and Business Media LLC",
+  volume    =  6,
+  number    =  12,
+  pages     = "1512--1524",
+  month     =  dec,
+  year      =  2024,
+  copyright = "https://www.springernature.com/gp/researchers/text-and-data-mining",
+  language  = "en"
 }
 ```