You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **regulatory genomics, RNA biology, and machine learning**.
11
+
I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **RNA biology, regulatory genomics, and machine learning**.
12
12
13
-
My work focuses on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, gene regulation, and regulatory sequence function, with a primary focus on **immune cell systems** including **T cells, B cells, and macrophages**.
13
+
My work centers on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, regulatory sequence function, and immune cell type–specific gene regulation.
14
14
15
15
I am primarily targeting roles in **computational biology, machine learning for genomics, bioinformatics, and scientific software for biology**.
16
16
17
17
---
18
18
19
-
## Selected Work
19
+
## What I Work On
20
20
21
-
**Sequence-to-Function Modeling for Splicing**
22
-
Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity from sequence.
21
+
**Sequence-to-Function Modeling**
22
+
Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity directly from sequence.
23
23
24
24
**RNA-seq Splicing Data Systems**
25
25
Building end-to-end pipelines integrating STAR, StringTie, and rMATS to generate reproducible splicing labels, PSI estimates, and training-ready features from heterogeneous datasets.
26
26
27
27
**Single-cell and Multi-omic Analysis**
28
-
Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on experience with experimental workflows.
28
+
Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on familiarity with experimental workflows.
29
29
30
30
**Model Interpretation and Motif Discovery**
31
31
Using attribution methods, TF-MoDISco, and downstream motif analysis to connect model predictions back to testable regulatory hypotheses.
@@ -36,13 +36,13 @@ Using attribution methods, TF-MoDISco, and downstream motif analysis to connect
36
36
37
37
My dissertation, **“Programmed Delayed Splicing and Immune Cell Type–Specific Alternative Splicing,”** integrates experimental RNA biology with deep learning to identify regulatory elements controlling splicing kinetics and cell-type specificity.
38
38
39
-
My first-author manuscript on **programmed splice delays in inflammatory gene regulation** is currently **in review at eLife**.
39
+
My first-author manuscript on **programmed delayed splicing in inflammatory gene regulation** is currently **in review at eLife**.
40
40
41
41
Recent published collaborative work includes contributions to *Cell Reports* (2025) and *ACS Nano* (2024).
Copy file name to clipboardExpand all lines: ml.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,20 +5,21 @@ permalink: /engineering/
5
5
author_profile: true
6
6
---
7
7
8
-
Engineering systems and training workflows behind my ML work, for readers interested in how models are built, trained, and interpreted—not just what they predict.
8
+
This page focuses on the technical side of my work: model design, training workflows, interpretability tooling, and the engineering systems that support large-scale genomics analysis.
9
9
10
10
Tooling for sequence-to-function modeling is available in [**`sequence_to_function_model_tools`**](https://github.com/jsdearbo/sequence_to_function_model_tools).
11
11
12
12
---
13
13
14
-
## Model Architecture
14
+
## Modeling Focus
15
15
16
-
I build sequence-to-function models that map raw genomic sequence to splicing outcomes such as PSI and intron retention, including cell-type–aware prediction heads.
16
+
I build sequence-to-function models that map raw genomic sequence to splicing-related outcomes such as PSI and intron retention.
17
17
18
-
Current architecture work:
18
+
Current areas of model development:
19
19
- Transformer and CNN-hybrid architectures for long-range sequence context
20
20
- Task-specific fine-tuning of foundation models such as **Borzoi** via **LoRA/PEFT**
21
21
- Single-task and multitask prediction across B cell, T cell, and macrophage contexts
22
+
- Evaluation of model behavior through attribution, perturbation, and embedding-space analyses
22
23
23
24
---
24
25
@@ -35,26 +36,27 @@ Training pipelines designed for HPC use and iterative model development:
35
36
36
37
---
37
38
38
-
## Interpretability Pipeline
39
+
## Interpretability and Analysis
39
40
40
41
Model predictions are most useful when they support testable biological hypotheses.
41
42
42
43
-**DeepLIFT / DeepSHAP** via Captum for per-nucleotide attribution
43
44
-**TF-MoDISco** motif discovery from attribution maps
44
45
-**SEA / FIMO** (MEME Suite) for motif scanning and validation
45
46
-**In silico mutagenesis** — motif scrambling/deletion with batch prediction on perturbed sequences
46
-
- Embedding-space visualization with **UMAP** for exploratory analysis and interpretation
47
+
- Embedding-space visualization with **UMAP** for exploratory analysis and representation-level interpretation
47
48
48
49
---
49
50
50
-
## Code Examples
51
+
## Code and Workflow Examples
51
52
52
53
> *Annotated notebooks — coming soon*
53
54
54
55
Planned examples:
55
56
- Fine-tuning Borzoi for PSI prediction: LoRA setup, data loading, and evaluation
56
57
- Running TF-MoDISco from Captum attribution outputs
57
58
- CoSI calculation from kinetic RNA-seq
59
+
- Config-driven experiment setup for reproducible model training
Copy file name to clipboardExpand all lines: portfolio.md
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ permalink: /portfolio/
5
5
author_profile: true
6
6
---
7
7
8
-
Selected case studies linking the biological question to the computational system built to answer it. Each project is driven by a real scientific problem and designed for reproducibility, interpretability, and scale.
8
+
Selected case studies linking biological questions to the computational systems built to answer them. Each project is grounded in a concrete scientific problem and emphasizes reproducibility, interpretability, and analytical rigor.
9
9
10
10
---
11
11
@@ -24,13 +24,13 @@ An end-to-end Python pipeline to quantify intron excision dynamics from kinetic
24
24
- Custom interval engineering for intron-level quantification
25
25
- Adapted **Completed Splicing Index (CoSI)** to quantify splicing completion per intron and timepoint
26
26
- Python-based aggregation, normalization, and visualization across replicates and conditions
27
-
-Fine-tuning of a genomic foundation model to prioritize candidate regulatory sequence features associated with delayed splicing
27
+
-Follow-up sequence modeling to prioritize candidate regulatory features associated with delayed splicing
28
28
29
29
**The Engineering Challenge**
30
30
Extracting accurate intron-level kinetics requires precise interval handling, isoform-aware quantification, and robust normalization across temporal datasets at a resolution standard RNA-seq workflows are not designed to support out of the box.
31
31
32
32
**Outcome**
33
-
Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
33
+
Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while downstream model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
34
34
35
35
> *[Annotated notebook — coming soon]*
36
36
@@ -51,7 +51,6 @@ A unified data engineering, modeling, and interpretation framework for immune ce
51
51
- PSI extraction from rMATS and integration of exon inclusion / intron retention labels into genome-scale training targets
52
52
- Automated labeling of exons, introns, and intergenic regions with reproducible YAML-driven configuration
53
53
- Fine-tuning of Borzoi for splicing prediction using both single-task and multitask training strategies
54
-
- PyTorch Lightning–based training workflows with mixed precision, checkpointing, and HPC execution
55
54
- Attribution-based interpretation with DeepSHAP / TF-MoDISco to identify **cis-regulatory motifs** associated with lineage-specific splicing behavior
56
55
- In silico perturbation analyses to test the functional importance of discovered sequence elements
57
56
@@ -65,17 +64,17 @@ Produced a scalable training dataset and modeling framework for immune cell type
65
64
66
65
---
67
66
68
-
## Technical Systems: Splicing Data Infrastructure & HPC Orchestration
67
+
## Supporting Systems: Reproducible Splicing Data Infrastructure
69
68
70
69
**What it does**
71
-
Provides the reproducible computational backbone supporting large-scale splicing analysis and model training across projects.
70
+
Provides the computational backbone supporting large-scale splicing analysis and model training across projects.
72
71
73
72
**How it's built**
74
73
- SLURM-based orchestration with multi-GPU and CPU fallback strategies
75
74
- Conda environment isolation per project
76
75
- YAML/JSON-driven configuration for reproducible pipeline execution
77
76
- BigWig-based data handling and custom Python utilities for genomic interval processing
78
-
- Checkpointed model training and restartable analysis workflows for long-running jobs
77
+
- Checkpointed training and restartable analysis workflows for long-running jobs
79
78
80
79
**Why it matters**
81
80
Genomics workflows are computationally heavy, I/O-bound, and often difficult to reproduce across environments. Treating infrastructure as a first-class part of the work makes the biological conclusions more auditable, portable, and scalable.
0 commit comments