Requires module networkx. Tested with verion 1.9.1
Requires module pickle
Requires modules sys, time, and math, which should be installed with python by default.
Requires module numpy
See http://bib.oxfordjournals.org/content/13/5/569.full for definitions metric definitions.
This function takes in the file name for a GO ontology file (obo format). The file must be of a specific format:
! comments
[Term]
id: GO_term
...
is_a: GO_term
is_a: GO_term
[Term]
id: GO_term
....
[Typedef]
...
Note: The [Typedef] tag signals the end of GO terms, and is required (otherwise, the parser will fail to record the final GO term in the ontology file)
Please see example_go.obo for a full example file.
parse_go_file returns two objects as a tuple:
-
go_graph: A networkx DiGraph object.go_graphrepresents the ontology as a DiGraph, where eachis_arelationship is represented as an edge. -
alt_ids: GO ontology files provide alternate IDs for some terms (represented byalt_id:lines in the ontology file).alt_idsis a mapping from alternate IDs to the IDs stored ingo_graph.alt_idsis a python dictionary, where keys are GO IDs not ingo_graph, and values are corresponding GO IDs ingo_graph
This function takes in a file name for a pre-processed annotation corpus file of a specific format:
-
protein_name
GO_term
GO_term
GO_term
-
...
-
Note: File must both start and end with a -line. Please see example_corpus.stripped for a full example of a pre-processed annotation corpus file.
parse_annotation_corpus returns two objects in a tuple:
prot_to_gos: This is a python dictionary mapping protein names to GO terms. Keys are protein names, values are python lists of GO terms associated with the key (from the annotation corpus)go_to_prots: This is a python dictionary mapping GO terms to protein names. Keys are GO terms, and values are python lists of protein names labeled with the key (from the annotation corpus)
If alt_ids is provided, then any keys in alt_ids that appear in the annotation corpus will be stored as their associated values in alt_ids.
This function takes in a file path to a pickled SemSimCalculator object.
It returns an unpickled SemSimCalculator object
The SemSimCalculator class takes an ontology and an annotation corpus. It parses and uses these to calculate various semantic similarity metrics between terms, groups of terms, and proteins.
All class variables are technically public, but should be treated as private. Use the getter functions explained below to access them. Class variables:
-
_go_graphnetworkx DiGraph representing the GO ontology, as parsed/returned byparse_go_file -
_alt_listpython dictionary represting alternate GO term IDs, as parsed/returned byparse_go_file -
_prot_to_gospython dictionary mapping protein names to their GO term labels, as parsed/returned byparse_annotation_corpus -
_go_to_protspython dictionary mapping GO terms to the proteins which they label, as parsed/returned byparse_annotation_corpus -
_proteinspython list of proteins names. Contains names of all proteins that have labels -
_num_proteinsinteger. Size ofproteins -
_ic_valsdictionary mapping GO term to its IC (information content) value. Initialized empty. Used for memoization -
_go_termslist of all GO terms in the graph of the ontology -
_mica_storereference to aMicaStoreinstance. Initialized asNone, must be set manually
Creates new instance. Call semsimcalc.SemSimCalculator(file_name, file_name) to use. Will return a SemSimCalculator object.
Takes in file names for GO ontology file (obo format) and annotation corpus file (pre-processed file of the same format that parse_annotation_corpus takes, as explained above).
Initializes go_graph, alt_list, prot_to_gos, go_to_prots, proteins, and num_proteins.
Creates ic_vals as an empty dictionary.
Saves a reference to a MicaStore instance
Removes _mica_store reference (sets to None)
Pickles and saves self to filepath
Return copy of _go_graph
Return copy of `_alt_list
Return copy of _prot_to_gos
Return copy of _go_to_prots
Return copy of _ic_vals
Note: get_ic_vals does not inherently calculate IC values. Use precompute_ic_vals first if you need all IC values.
Return copy of _go_terms
Return reference to _mica_store
Takes in a GO term as a string. Calculates and returns the probability of that term or any of that term's descendants (in the GO DiGraph) occuring in the annotation corpus. That is: [number of proteins labeled with term or a descendant of term] / [number of labeled proteins in annotation corpus]
Takes in a GO term as a string. Calculates and returns the information content of that term. Information content is defined (within this implementation) as:
-ln(prob(term))
Where prob(term) is the same as the result of calling calc_term_probability(term)
Note: Once an IC is calculated, it is stored in _ic_vals. Subsequent calls for the IC of the same term only look up the recorded value.
Fills the _ic_vals dictionary used for information content memoization.
Runs IC on all terms in the GO ontology.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Maximum Informative Common Ancestor. (Returns a GO term as a string)
The MICA of two terms is the common ancestor of both terms with the highest information content value.
Note: For this implementation, if left and right are the same, they are included in the list of "common ancestors."
If a MicaStore instance is linked (through link_mica_store), MICA first queries the MicaStore instance. Only if the MicaStore instance does not return a GO term does MICA calculate a result from the GO graph and annotation corpus.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the resnik score of the two terms. (Returns a float)
simRes is defined as the information content of the MICA of two terms. See here for more details.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Lin score for the two terms. (Returns a flot)
simLin is defined as the simRes of two terms divided by the sum of the information contents for each term (left and right). See here for more details.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Jiang-Conrath score for two terms is defined as:
1 - IC(left) + IC(right) - 2 * simRes(left, right)
See here for more details.
Note: Currently untested
Example for proper function calls:
Assume calc is a SemSimCalculator instance:
calc.pairwise_average_term_comp(left_term, right_term, calc.simRes)
Will calculate and return the average of all pairwise resnik scores for the given lists of GO terms, left_terms and right_terms.
Takes in two python lists of GO terms (lefts, rights) and a comparison metric (ex. any function from the "Comparison Metrics" section). metric must take in two ontology terms and return a numeric score.
Returns the average of all pairwise term comparisons, using metric.
Takes in two python lists of GO terms (lefts, rights) and a comparison metric (ex. any function from the "Comparison Metrics" section). metric must take in two ontology terms and return a numeric score.
Returns the max of all pairwise term comparisons, using metric.
Example for proper function calls:
Assume calc is a SemSimCalculator instance:
calc.average_protein_comp(left_prot, right_prot, calc.simRes)
Will calculate and return the average of all pairwise resnik scores for the go terms associated with the two protein names, left_prot and right_prot.
Takes in two protein names as strings (left_prot, right_prot) and a reference to a comparison metric (ex. any function from the "Comparison Metrics" section). metric must take in two ontology terms and return a numeric score.
Returns the average of all pairwise term comparisons for the GO terms associated with the proteins left_prot and right_prot using metric.
Takes in two protein names as strings (left_prot, right_prot) and a reference to a comparison metric (ex. any function from the "Comparison Metrics" section). metric must take in two ontology terms and return a numeric score.
Returns the max of all pairwise term comparisons for the GO terms associated with the proteins left_prot and right_prot using metric.
Wrapper class for a numpy matrix of MICA values.
Takes a numpy matrix of MICA values and an ordering of GO terms (indices in the matrix).
Class variables:
-
_micasnumpymatrix of mica values -
_go_to_indexdictionary mapping GO terms to indices in the matrix (matrix must be symmetrical)
Loads numpy matrix from matrix_filename into _micas.
Parses ordered list of GO terms from ordering_filename (one term per line).
Returns reference to numpy array _micas
Returns copy of _go_to_index dictionary
If term is in _go_to_index, return _go_to_index[term], which corresponds to term's index in _micas
If term is not in _go_to_index, return None
Attempts to look up a MICA value from _micas.
If MICA value cannot be found (or left or right are not in _go_to_index), returns None
Standalone script to strip down a Swiss-Prot text file (".dat"). See http://www.uniprot.org/downloads for download location.
Only tested on Swiss-Prot currently.
To run:
python strip_ac.py -i [filename] -o [filename]
Where:
-itakes the file name for the original Swiss-Prot .dat file-otakes the file name for the output (stripped) file.
The output of this script is compatible with semsimcal.parse_annotation_corpus
These files are examples of format. example_corpus.dat corresponds with example_corpus.stripped, but not with example_go.obo