-
Notifications
You must be signed in to change notification settings - Fork 5
Find sets
We search for sets M of genes with significantly exclusive, co-occurring, or other patterns of mutations given the processed mutation data and mutation probabilities using the find_sets.py script. This script enumerates sets of size k and, using the saddlepoint approximation for a weighted row (WR) statistic, computes their p-values, runtimes, and FDR.
We implemented the following statistics, which you can choose using the -s/--statistic argument of the find_sets.py script.
- Exclusivity (
exclusivity): the number of samples with mutually exclusive mutations (exactly one mutated gene in a gene set M) - Any co-occurrence (
any-co-occurrence): the number of samples with any co-occurring mutations (at least two mutated genes in a gene set M) - All co-occurrence (
all-co-occurrence): the number of samples with all co-occurring mutations (exactly k mutated genes in a gene set M of size k)
We defined these statistics in the check_condition function of the wext/saddlepoint.py script. You can add your own statistics by changing this function. In particular, the condition variable is a string that describes the statistic and the state variable is a binary vector of length k that indicates the presence or absence of mutations in a gene set M of size k for each sample, so the following block of code tests for mutual exclusivity:
if condition=='exclusivity':
if sum(state)==1:
return True
else:
return False
The find_sets.py script requires several arguments:
python find_sets.py [GENERAL_ARGUMENTS]
| Argument | Required (Default) | Description |
|---|---|---|
| -mf/--mutation_files | True | Path to mutation file(s), generated by process_mutations.py. |
| -wf/--weights_file | True | Path to mutation probabilities (weights) file generated by compute_mutation_probabilities.py. |
| -k/--gene_set_size | True | Gene set size. |
| -s/--statistic | True | Exclusivity, any-co-occurrence, all-co-occurrence statistic. |
| -o/--output_prefix | True | Path to output prefix. |
| -f/--min_frequency | False (1) | Genes mutated in fewer than the given number of samples will be excluded. |
| -c/--num_cores | False (1) | Number of cores to utlilize using multiprocessing. |
| -r/--report_invalids | False | Report sets with p-values computed as NaN or less than -1E-3 or greater than 1 + 1E-3 to stderr. |
| -h/--help | False | Display usage instructions. |
| -v/--verbose | False (0) | Choices: 0, 1, 2, 3, 4, 5. Higher values correspond to more verbose output. |
Last modified: 1:43 PM Tuesday, Jan 2, 2017 (EST)