Skip to content

Find sets

Matthew Reyna edited this page Jan 2, 2018 · 2 revisions

We search for sets M of genes with significantly exclusive, co-occurring, or other patterns of mutations given the processed mutation data and mutation probabilities using the find_sets.py script. This script enumerates sets of size k and, using the saddlepoint approximation for a weighted row (WR) statistic, computes their p-values, runtimes, and FDR.

We implemented the following statistics, which you can choose using the -s/--statistic argument of the find_sets.py script.

  1. Exclusivity (exclusivity): the number of samples with mutually exclusive mutations (exactly one mutated gene in a gene set M)
  2. Any co-occurrence (any-co-occurrence): the number of samples with any co-occurring mutations (at least two mutated genes in a gene set M)
  3. All co-occurrence (all-co-occurrence): the number of samples with all co-occurring mutations (exactly k mutated genes in a gene set M of size k)

We defined these statistics in the check_condition function of the wext/saddlepoint.py script. You can add your own statistics by changing this function. In particular, the condition variable is a string that describes the statistic and the state variable is a binary vector of length k that indicates the presence or absence of mutations in a gene set M of size k for each sample, so the following block of code tests for mutual exclusivity:

    if condition=='exclusivity':
        if sum(state)==1:
            return True
        else:
            return False

Arguments

The find_sets.py script requires several arguments:

python find_sets.py [GENERAL_ARGUMENTS]
Argument Required (Default) Description
-mf/--mutation_files True Path to mutation file(s), generated by process_mutations.py.
-wf/--weights_file True Path to mutation probabilities (weights) file generated by compute_mutation_probabilities.py.
-k/--gene_set_size True Gene set size.
-s/--statistic True Exclusivity, any-co-occurrence, all-co-occurrence statistic.
-o/--output_prefix True Path to output prefix.
-f/--min_frequency False (1) Genes mutated in fewer than the given number of samples will be excluded.
-c/--num_cores False (1) Number of cores to utlilize using multiprocessing.
-r/--report_invalids False Report sets with p-values computed as NaN or less than -1E-3 or greater than 1 + 1E-3 to stderr.
-h/--help False Display usage instructions.
-v/--verbose False (0) Choices: 0, 1, 2, 3, 4, 5. Higher values correspond to more verbose output.

Last modified: 1:43 PM Tuesday, Jan 2, 2017 (EST)

Clone this wiki locally