Skip to content

Commit 59f59f7

Browse files
committed
minor content edits
1 parent 96d1769 commit 59f59f7

3 files changed

Lines changed: 24 additions & 23 deletions

File tree

index.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,26 @@ author_profile: true
66
---
77

88
# Jake Dearborn, PhD Candidate (Defending June 2026)
9-
**Machine Learning · Regulatory Genomics · RNA Biology**
9+
**Computational Biology · Regulatory Genomics · Machine Learning**
1010

11-
I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **regulatory genomics, RNA biology, and machine learning**.
11+
I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **RNA biology, regulatory genomics, and machine learning**.
1212

13-
My work focuses on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, gene regulation, and regulatory sequence function, with a primary focus on **immune cell systems** including **T cells, B cells, and macrophages**.
13+
My work centers on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, regulatory sequence function, and immune cell type–specific gene regulation.
1414

1515
I am primarily targeting roles in **computational biology, machine learning for genomics, bioinformatics, and scientific software for biology**.
1616

1717
---
1818

19-
## Selected Work
19+
## What I Work On
2020

21-
**Sequence-to-Function Modeling for Splicing**
22-
Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity from sequence.
21+
**Sequence-to-Function Modeling**
22+
Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity directly from sequence.
2323

2424
**RNA-seq Splicing Data Systems**
2525
Building end-to-end pipelines integrating STAR, StringTie, and rMATS to generate reproducible splicing labels, PSI estimates, and training-ready features from heterogeneous datasets.
2626

2727
**Single-cell and Multi-omic Analysis**
28-
Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on experience with experimental workflows.
28+
Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on familiarity with experimental workflows.
2929

3030
**Model Interpretation and Motif Discovery**
3131
Using attribution methods, TF-MoDISco, and downstream motif analysis to connect model predictions back to testable regulatory hypotheses.
@@ -36,13 +36,13 @@ Using attribution methods, TF-MoDISco, and downstream motif analysis to connect
3636

3737
My dissertation, **“Programmed Delayed Splicing and Immune Cell Type–Specific Alternative Splicing,”** integrates experimental RNA biology with deep learning to identify regulatory elements controlling splicing kinetics and cell-type specificity.
3838

39-
My first-author manuscript on **programmed splice delays in inflammatory gene regulation** is currently **in review at eLife**.
39+
My first-author manuscript on **programmed delayed splicing in inflammatory gene regulation** is currently **in review at eLife**.
4040

4141
Recent published collaborative work includes contributions to *Cell Reports* (2025) and *ACS Nano* (2024).
4242

4343
---
4444

45-
## Technical Focus
45+
## Technical Areas
4646

4747
**Machine Learning**
4848
- PyTorch, PyTorch Lightning

ml.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,21 @@ permalink: /engineering/
55
author_profile: true
66
---
77

8-
Engineering systems and training workflows behind my ML work, for readers interested in how models are built, trained, and interpreted—not just what they predict.
8+
This page focuses on the technical side of my work: model design, training workflows, interpretability tooling, and the engineering systems that support large-scale genomics analysis.
99

1010
Tooling for sequence-to-function modeling is available in [**`sequence_to_function_model_tools`**](https://github.com/jsdearbo/sequence_to_function_model_tools).
1111

1212
---
1313

14-
## Model Architecture
14+
## Modeling Focus
1515

16-
I build sequence-to-function models that map raw genomic sequence to splicing outcomes such as PSI and intron retention, including cell-type–aware prediction heads.
16+
I build sequence-to-function models that map raw genomic sequence to splicing-related outcomes such as PSI and intron retention.
1717

18-
Current architecture work:
18+
Current areas of model development:
1919
- Transformer and CNN-hybrid architectures for long-range sequence context
2020
- Task-specific fine-tuning of foundation models such as **Borzoi** via **LoRA/PEFT**
2121
- Single-task and multitask prediction across B cell, T cell, and macrophage contexts
22+
- Evaluation of model behavior through attribution, perturbation, and embedding-space analyses
2223

2324
---
2425

@@ -35,26 +36,27 @@ Training pipelines designed for HPC use and iterative model development:
3536

3637
---
3738

38-
## Interpretability Pipeline
39+
## Interpretability and Analysis
3940

4041
Model predictions are most useful when they support testable biological hypotheses.
4142

4243
- **DeepLIFT / DeepSHAP** via Captum for per-nucleotide attribution
4344
- **TF-MoDISco** motif discovery from attribution maps
4445
- **SEA / FIMO** (MEME Suite) for motif scanning and validation
4546
- **In silico mutagenesis** — motif scrambling/deletion with batch prediction on perturbed sequences
46-
- Embedding-space visualization with **UMAP** for exploratory analysis and interpretation
47+
- Embedding-space visualization with **UMAP** for exploratory analysis and representation-level interpretation
4748

4849
---
4950

50-
## Code Examples
51+
## Code and Workflow Examples
5152

5253
> *Annotated notebooks — coming soon*
5354
5455
Planned examples:
5556
- Fine-tuning Borzoi for PSI prediction: LoRA setup, data loading, and evaluation
5657
- Running TF-MoDISco from Captum attribution outputs
5758
- CoSI calculation from kinetic RNA-seq
59+
- Config-driven experiment setup for reproducible model training
5860

5961
---
6062

portfolio.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ permalink: /portfolio/
55
author_profile: true
66
---
77

8-
Selected case studies linking the biological question to the computational system built to answer it. Each project is driven by a real scientific problem and designed for reproducibility, interpretability, and scale.
8+
Selected case studies linking biological questions to the computational systems built to answer them. Each project is grounded in a concrete scientific problem and emphasizes reproducibility, interpretability, and analytical rigor.
99

1010
---
1111

@@ -24,13 +24,13 @@ An end-to-end Python pipeline to quantify intron excision dynamics from kinetic
2424
- Custom interval engineering for intron-level quantification
2525
- Adapted **Completed Splicing Index (CoSI)** to quantify splicing completion per intron and timepoint
2626
- Python-based aggregation, normalization, and visualization across replicates and conditions
27-
- Fine-tuning of a genomic foundation model to prioritize candidate regulatory sequence features associated with delayed splicing
27+
- Follow-up sequence modeling to prioritize candidate regulatory features associated with delayed splicing
2828

2929
**The Engineering Challenge**
3030
Extracting accurate intron-level kinetics requires precise interval handling, isoform-aware quantification, and robust normalization across temporal datasets at a resolution standard RNA-seq workflows are not designed to support out of the box.
3131

3232
**Outcome**
33-
Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
33+
Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while downstream model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
3434

3535
> *[Annotated notebook — coming soon]*
3636
@@ -51,7 +51,6 @@ A unified data engineering, modeling, and interpretation framework for immune ce
5151
- PSI extraction from rMATS and integration of exon inclusion / intron retention labels into genome-scale training targets
5252
- Automated labeling of exons, introns, and intergenic regions with reproducible YAML-driven configuration
5353
- Fine-tuning of Borzoi for splicing prediction using both single-task and multitask training strategies
54-
- PyTorch Lightning–based training workflows with mixed precision, checkpointing, and HPC execution
5554
- Attribution-based interpretation with DeepSHAP / TF-MoDISco to identify **cis-regulatory motifs** associated with lineage-specific splicing behavior
5655
- In silico perturbation analyses to test the functional importance of discovered sequence elements
5756

@@ -65,17 +64,17 @@ Produced a scalable training dataset and modeling framework for immune cell type
6564
6665
---
6766

68-
## Technical Systems: Splicing Data Infrastructure & HPC Orchestration
67+
## Supporting Systems: Reproducible Splicing Data Infrastructure
6968

7069
**What it does**
71-
Provides the reproducible computational backbone supporting large-scale splicing analysis and model training across projects.
70+
Provides the computational backbone supporting large-scale splicing analysis and model training across projects.
7271

7372
**How it's built**
7473
- SLURM-based orchestration with multi-GPU and CPU fallback strategies
7574
- Conda environment isolation per project
7675
- YAML/JSON-driven configuration for reproducible pipeline execution
7776
- BigWig-based data handling and custom Python utilities for genomic interval processing
78-
- Checkpointed model training and restartable analysis workflows for long-running jobs
77+
- Checkpointed training and restartable analysis workflows for long-running jobs
7978

8079
**Why it matters**
8180
Genomics workflows are computationally heavy, I/O-bound, and often difficult to reproduce across environments. Treating infrastructure as a first-class part of the work makes the biological conclusions more auditable, portable, and scalable.

0 commit comments

Comments
 (0)