Skip to content

Commit 0c33196

Browse files
authored
Merge pull request #59 from matsengrp/process-chen-pa-dms
Process Chen et al. PA DMS data and integrate into pipeline
2 parents c31d26d + c8f5e79 commit 0c33196

File tree

4 files changed

+1177
-100
lines changed

4 files changed

+1177
-100
lines changed

README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,13 @@ Processes raw deep mutational scanning (DMS) data from external sources into sta
237237
2. Aligns the DMS NEP sequence (PR8 strain) to the reference using MUSCLE; uses actual DMS site numbers for coordinate mapping to handle non-consecutive site numbering
238238
3. Outputs `results/dms_data/Teo_NEP/processed_dms_data.csv` — NEP DMS data with tree reference site numbering
239239

240+
**`process_dms_data_chen_pa.ipynb`** (Chen et al., PA):
241+
1. Parses the AA mutation column to extract wildtype AA, DMS site, and mutant AA; excludes indel and stop codon mutations
242+
2. Averages the two no-drug fitness replicates (P1NO-1-fit, P1NO-2-fit) and groups by AA mutation (multiple nucleotide changes can produce the same AA change)
243+
3. Computes log-scale DMS effects: `log(mean fitness without drug)`
244+
4. Aligns the DMS sequence (first 240 AA of PA) to the PA tree reference using MUSCLE to establish site numbering correspondence
245+
5. Outputs `results/dms_data/Chen_PA/processed_dms_data.csv` — PA DMS data with tree reference site numbering
246+
240247
### Step 11: Analyze Site-Specific Rates
241248

242249
Executes an analysis notebook that:
@@ -450,6 +457,7 @@ Located in the `results/` root directory:
450457
- `Li_PB1/processed_dms_data.csv` - PB1 DMS fitness effects (Li et al. 2023) with tree reference site numbering
451458
- `Hom_M1/processed_dms_data.csv` - M1 DMS fitness effects (Hom et al. 2019) with tree reference site numbering
452459
- `Teo_NEP/processed_dms_data.csv` - NEP DMS fitness effects (Teo et al. 2024) with tree reference site numbering
460+
- `Chen_PA/processed_dms_data.csv` - PA DMS fitness effects (Chen et al. 2024) for first 240 AA with tree reference site numbering
453461

454462
### Output Structure Example
455463

@@ -494,7 +502,9 @@ results/
494502
│ │ └── processed_dms_data.csv
495503
│ ├── Hom_M1/
496504
│ │ └── processed_dms_data.csv
497-
│ └── Teo_NEP/
505+
│ ├── Teo_NEP/
506+
│ │ └── processed_dms_data.csv
507+
│ └── Chen_PA/
498508
│ └── processed_dms_data.csv
499509
├── .process_dms_data_soh_pb2.done
500510
├── .summarize_filter_logs.done

Snakefile

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ final_outputs.extend([
116116
f"{config['output_dir']}/dms_data/Li_PB1/processed_dms_data.csv",
117117
f"{config['output_dir']}/dms_data/Hom_M1/processed_dms_data.csv",
118118
f"{config['output_dir']}/dms_data/Teo_NEP/processed_dms_data.csv",
119+
f"{config['output_dir']}/dms_data/Chen_PA/processed_dms_data.csv",
119120
])
120121

121122
# Add analyze_fitness_effects to final targets
@@ -487,6 +488,28 @@ rule process_dms_data_teo_nep:
487488
-p data_dir $data_dir &> ../{log}
488489
"""
489490

491+
# Process PA DMS data from Chen et al.
492+
rule process_dms_data_chen_pa:
493+
input:
494+
notebook="notebooks/process_dms_data_chen_pa.ipynb",
495+
pa_data="data/dms_data/Chen_PA/fitness calculation.xlsx"
496+
output:
497+
pa_dms="{output_dir}/dms_data/Chen_PA/processed_dms_data.csv"
498+
params:
499+
data_dir=config["data_dir"]
500+
log:
501+
"{output_dir}/logs/process_dms_data_chen_pa.log"
502+
shell:
503+
"""
504+
data_dir=$(realpath {params.data_dir}) && \
505+
cd notebooks && \
506+
papermill \
507+
process_dms_data_chen_pa.ipynb \
508+
process_dms_data_chen_pa.ipynb \
509+
-p data_dir $data_dir &> ../{log}
510+
"""
511+
512+
490513
# Analyze fitness effects and compare to DMS data
491514
rule analyze_fitness_effects:
492515
input:
@@ -499,7 +522,8 @@ rule analyze_fitness_effects:
499522
na_dms="{output_dir}/dms_data/Wang_NA/processed_dms_data.csv",
500523
pb1_dms="{output_dir}/dms_data/Li_PB1/processed_dms_data.csv",
501524
m1_dms="{output_dir}/dms_data/Hom_M1/processed_dms_data.csv",
502-
nep_dms="{output_dir}/dms_data/Teo_NEP/processed_dms_data.csv"
525+
nep_dms="{output_dir}/dms_data/Teo_NEP/processed_dms_data.csv",
526+
pa_dms="{output_dir}/dms_data/Chen_PA/processed_dms_data.csv"
503527
output:
504528
touch("{output_dir}/.analyze_fitness_effects.done")
505529
log:

notebooks/analyze_fitness_effects.ipynb

Lines changed: 288 additions & 98 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)