minor content edits

jsdearbo · jsdearbo · commit 59f59f7b2450 · 2026-03-29T21:28:57.000-04:00
diff --git a/index.md b/index.md
@@ -6,26 +6,26 @@ author_profile: true
 ---
 
 # Jake Dearborn, PhD Candidate (Defending June 2026)
-**Machine Learning · Regulatory Genomics · RNA Biology**
+**Computational Biology · Regulatory Genomics · Machine Learning**
 
-I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **regulatory genomics, RNA biology, and machine learning**.
+I am a PhD candidate in **Molecular Biology** at the **University of Vermont** working at the intersection of **RNA biology, regulatory genomics, and machine learning**.
 
-My work focuses on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, gene regulation, and regulatory sequence function, with a primary focus on **immune cell systems** including **T cells, B cells, and macrophages**.
+My work centers on building **sequence-to-function models** and **computational pipelines for RNA-seq and single-cell genomics** to study alternative splicing, regulatory sequence function, and immune cell type–specific gene regulation.
 
 I am primarily targeting roles in **computational biology, machine learning for genomics, bioinformatics, and scientific software for biology**.
 
 ---
 
-## Selected Work
+## What I Work On
 
-**Sequence-to-Function Modeling for Splicing**  
-Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity from sequence.
+**Sequence-to-Function Modeling**  
+Developing and fine-tuning deep learning models, including genomic foundation models, to predict splicing behavior and regulatory activity directly from sequence.
 
 **RNA-seq Splicing Data Systems**  
 Building end-to-end pipelines integrating STAR, StringTie, and rMATS to generate reproducible splicing labels, PSI estimates, and training-ready features from heterogeneous datasets.
 
 **Single-cell and Multi-omic Analysis**  
-Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on experience with experimental workflows.
+Designing analysis workflows for Drop-seq, 10x, and CITE-seq data, including QC, normalization, clustering, and differential analysis, informed by direct hands-on familiarity with experimental workflows.
 
 **Model Interpretation and Motif Discovery**  
 Using attribution methods, TF-MoDISco, and downstream motif analysis to connect model predictions back to testable regulatory hypotheses.
@@ -36,13 +36,13 @@ Using attribution methods, TF-MoDISco, and downstream motif analysis to connect
 
 My dissertation, **“Programmed Delayed Splicing and Immune Cell Type–Specific Alternative Splicing,”** integrates experimental RNA biology with deep learning to identify regulatory elements controlling splicing kinetics and cell-type specificity.
 
-My first-author manuscript on **programmed splice delays in inflammatory gene regulation** is currently **in review at eLife**.
+My first-author manuscript on **programmed delayed splicing in inflammatory gene regulation** is currently **in review at eLife**.
 
 Recent published collaborative work includes contributions to *Cell Reports* (2025) and *ACS Nano* (2024).
 
 ---
 
-## Technical Focus
+## Technical Areas
 
 **Machine Learning**
 - PyTorch, PyTorch Lightning
diff --git a/ml.md b/ml.md
@@ -5,20 +5,21 @@ permalink: /engineering/
 author_profile: true
 ---
 
-Engineering systems and training workflows behind my ML work, for readers interested in how models are built, trained, and interpreted—not just what they predict.
+This page focuses on the technical side of my work: model design, training workflows, interpretability tooling, and the engineering systems that support large-scale genomics analysis.
 
 Tooling for sequence-to-function modeling is available in [**`sequence_to_function_model_tools`**](https://github.com/jsdearbo/sequence_to_function_model_tools).
 
 ---
 
-## Model Architecture
+## Modeling Focus
 
-I build sequence-to-function models that map raw genomic sequence to splicing outcomes such as PSI and intron retention, including cell-type–aware prediction heads.
+I build sequence-to-function models that map raw genomic sequence to splicing-related outcomes such as PSI and intron retention.
 
-Current architecture work:
+Current areas of model development:
 - Transformer and CNN-hybrid architectures for long-range sequence context
 - Task-specific fine-tuning of foundation models such as **Borzoi** via **LoRA/PEFT**
 - Single-task and multitask prediction across B cell, T cell, and macrophage contexts
+- Evaluation of model behavior through attribution, perturbation, and embedding-space analyses
 
 ---
 
@@ -35,26 +36,27 @@ Training pipelines designed for HPC use and iterative model development:
 
 ---
 
-## Interpretability Pipeline
+## Interpretability and Analysis
 
 Model predictions are most useful when they support testable biological hypotheses.
 
 - **DeepLIFT / DeepSHAP** via Captum for per-nucleotide attribution
 - **TF-MoDISco** motif discovery from attribution maps
 - **SEA / FIMO** (MEME Suite) for motif scanning and validation
 - **In silico mutagenesis** — motif scrambling/deletion with batch prediction on perturbed sequences
-- Embedding-space visualization with **UMAP** for exploratory analysis and interpretation
+- Embedding-space visualization with **UMAP** for exploratory analysis and representation-level interpretation
 
 ---
 
-## Code Examples
+## Code and Workflow Examples
 
 > *Annotated notebooks — coming soon*
 
 Planned examples:
 - Fine-tuning Borzoi for PSI prediction: LoRA setup, data loading, and evaluation
 - Running TF-MoDISco from Captum attribution outputs
 - CoSI calculation from kinetic RNA-seq
+- Config-driven experiment setup for reproducible model training
 
 ---
 
diff --git a/portfolio.md b/portfolio.md
@@ -5,7 +5,7 @@ permalink: /portfolio/
 author_profile: true
 ---
 
-Selected case studies linking the biological question to the computational system built to answer it. Each project is driven by a real scientific problem and designed for reproducibility, interpretability, and scale.
+Selected case studies linking biological questions to the computational systems built to answer them. Each project is grounded in a concrete scientific problem and emphasizes reproducibility, interpretability, and analytical rigor.
 
 ---
 
@@ -24,13 +24,13 @@ An end-to-end Python pipeline to quantify intron excision dynamics from kinetic
 - Custom interval engineering for intron-level quantification
 - Adapted **Completed Splicing Index (CoSI)** to quantify splicing completion per intron and timepoint
 - Python-based aggregation, normalization, and visualization across replicates and conditions
-- Fine-tuning of a genomic foundation model to prioritize candidate regulatory sequence features associated with delayed splicing
+- Follow-up sequence modeling to prioritize candidate regulatory features associated with delayed splicing
 
 **The Engineering Challenge**  
 Extracting accurate intron-level kinetics requires precise interval handling, isoform-aware quantification, and robust normalization across temporal datasets at a resolution standard RNA-seq workflows are not designed to support out of the box.
 
 **Outcome**  
-Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
+Identified a class of "bottleneck introns" that delay inflammatory gene expression. Minigene assays experimentally validated that weak 5' splice donors drive this delay in a subset of targets, while downstream model interpretation highlighted additional putative non-canonical regulatory motifs for future follow-up. *(First-author manuscript currently in review at eLife.)*
 
 > *[Annotated notebook — coming soon]*
 
@@ -51,7 +51,6 @@ A unified data engineering, modeling, and interpretation framework for immune ce
 - PSI extraction from rMATS and integration of exon inclusion / intron retention labels into genome-scale training targets
 - Automated labeling of exons, introns, and intergenic regions with reproducible YAML-driven configuration
 - Fine-tuning of Borzoi for splicing prediction using both single-task and multitask training strategies
-- PyTorch Lightning–based training workflows with mixed precision, checkpointing, and HPC execution
 - Attribution-based interpretation with DeepSHAP / TF-MoDISco to identify **cis-regulatory motifs** associated with lineage-specific splicing behavior
 - In silico perturbation analyses to test the functional importance of discovered sequence elements
 
@@ -65,17 +64,17 @@ Produced a scalable training dataset and modeling framework for immune cell type
 
 ---
 
-## Technical Systems: Splicing Data Infrastructure & HPC Orchestration
+## Supporting Systems: Reproducible Splicing Data Infrastructure
 
 **What it does**  
-Provides the reproducible computational backbone supporting large-scale splicing analysis and model training across projects.
+Provides the computational backbone supporting large-scale splicing analysis and model training across projects.
 
 **How it's built**
 - SLURM-based orchestration with multi-GPU and CPU fallback strategies
 - Conda environment isolation per project
 - YAML/JSON-driven configuration for reproducible pipeline execution
 - BigWig-based data handling and custom Python utilities for genomic interval processing
-- Checkpointed model training and restartable analysis workflows for long-running jobs
+- Checkpointed training and restartable analysis workflows for long-running jobs
 
 **Why it matters**  
 Genomics workflows are computationally heavy, I/O-bound, and often difficult to reproduce across environments. Treating infrastructure as a first-class part of the work makes the biological conclusions more auditable, portable, and scalable.