You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+67-29Lines changed: 67 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,34 @@ Or alternatively run this command:
44
44
45
45
Please note there is another package called spectra which is not related to this tool. Spectrae (which stands for spectral evaluation) implements the spectral framework for model evaluation.
46
46
47
+
## Definition of terms
48
+
49
+
This work and GitHub repository use terms related to the **spectral framework for model evaluation**. Below is a quick refresher on these key concepts.
50
+
51
+
### **Spectral Property**
52
+
Every dataset has an underlying property that, as it changes, causes model performance to decrease. This is referred to as the **spectral property**.
53
+
54
+
However, **not every property qualifies as a spectral property**.
55
+
For example:
56
+
- When predicting protein structure, the performance of a protein folding model does **not** change based on the number of **M** amino acids in a sequence.
57
+
- Instead, model performance **does** change based on **structural similarity**—this is an example of a **spectral property**.
58
+
59
+
### **Spectral Property Graph (SPG)**
60
+
For a given dataset, a **spectral property graph (SPG)** is defined as:
61
+
-**Nodes**: Samples in the dataset.
62
+
-**Edges**: Connections between samples that share a spectral property.
63
+
64
+
Every SPG is defined by a flattened adjacency matrix, this saves memory and allowed SPECTRA to utilize GPUs to speed up computation.
65
+
66
+
### **Spectral Parameter**
67
+
The **spectral parameter** can be thought of as a **sparsification probability**.
68
+
69
+
When SPECTRA runs on an SPG:
70
+
1. It selects a random node.
71
+
2. It decides whether to **delete edges** with a certain probability—this probability is the **spectral parameter**.
72
+
3. The closer the spectral parameter is to **1**, the **stricter** the splits generated by SPECTRA will be.
73
+
74
+
47
75
## How to use spectra
48
76
49
77
### Step 1: Define the spectral property, cross-split overlap, and the spectra dataset wrapper
@@ -86,7 +114,7 @@ class [Name]_Dataset(SpectraDataset):
86
114
pass
87
115
```
88
116
89
-
Spectra implements the user definition of the spectra property and cross split overlap.
117
+
Spectra implements the user definition of the spectra property.
90
118
91
119
92
120
```python
@@ -103,52 +131,62 @@ class [Name]_spectra(spectra):
103
131
'''
104
132
return similarity
105
133
106
-
defcross_split_overlap(self, train, test):
107
-
'''
108
-
Define this function to return the overlap between a list of train and test samples.
134
+
```
135
+
### Step 2: Initialize SPECTRA and calculate the flattened adjacency matrix
109
136
110
-
Example: Average pairwise similarity between train and test set protein sequences.
137
+
1.**Initialize SPECTRA**
138
+
- Initially, pass in no spectral property graph.
111
139
112
-
'''
113
-
140
+
2.**Pass SPECTRA and dataset into the `Spectra_Property_Graph_Constructor`**
141
+
- Additional arguments:
142
+
-**`num_chunks`**: If your dataset is very large, you can split up the construction into chunks to allow multiple jobs to compute similarity. This parameter controls the number of chunks.
143
+
-**`binary`**: If `True`, the similarity returns either `0` or `1`; otherwise, it returns a floating-point number.
114
144
115
-
return cross_split_overlap
116
-
```
117
-
### Step 2: Initialize SPECTRA and precalculate pairwise spectral properties
145
+
3.**Call `create_adjacency_matrix`**
146
+
- This function takes in the **chunk number** to calculate:
147
+
- If `num_chunks = 0`, the pairwise similarity is calculated in one go, so the input to `create_adjacency_matrix` should be `0`.
148
+
- If `num_chunks = 10`, the input should be the chunk number you want to calculate (e.g., `0` to `9`).
149
+
150
+
4.**Combine the adjacency matrices**
151
+
- Call `combine_adjacency_matrices()` in the graph constructor to combine all the adjacency matrices into a single matrix.
118
152
119
-
Initialize SPECTRA, passing in True or False to the binary argument if the spectral property returns a binary or continuous value. Then precalculate the pairwise spectral properties.
### Step 3: Initialize SPECTRA and precalculate pairwise spectral properties
126
161
127
-
Generate SPECTRA splits. The ```generate_spectra_splits``` function takes in 4 important parameters:
128
-
1.```number_repeats```: the number of times to rerun SPECTRA for the same spectral parameter, the number of repeats must equal the number of seeds as each rerun uses a different seed.
129
-
2.```random_seed```: the random seeds used by each SPECTRA rerun, [42, 44] indicates two reruns the first of which will use a random seed of 42, the second will use 44.
130
-
3.```spectra_parameters```: the spectral parameters to run on, they must range from 0 to 1 and be string formatted to the correct number of significant figures to avoid float formatting errors.
131
-
4.```force_reconstruct```: True to force the model to regenerate SPECTRA splits even if they have already been generated.
132
162
163
+
### Step 3: Generate SPECTRA Splits
133
164
134
-
```python
135
-
spectra_parameters = {'number_repeats': 3,
136
-
'random_seed': [42, 44, 46],
137
-
'spectral_parameters': ["{:.2f}".format(i) for i in np.arange(0, 1.05, 0.05)],
138
-
'force_reconstruct': True,
139
-
}
165
+
1.**Initialize the Spectral Property Graph**
166
+
- Pass in the flattened adjacency matrix you just generated to the Spectral_Property_Graph to create the spectral property graph.
After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials.
186
+
After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials. The path_to_save should be the same path you used in the previous step.
0 commit comments