Before multi-omics integration, data must be harmonized to ensure compatibility across data types. This includes normalization, batch correction, feature alignment, and handling missing values.
BiocManager::install(c('MultiAssayExperiment', 'sva', 'limma'))pip install combat numpy pandasTell your AI agent what you want to do:
- "Prepare my RNA-seq and proteomics data for MOFA2 integration"
- "Batch correct and normalize my multi-omics experiment"
"Normalize my RNA-seq counts with VST and proteomics intensities with log2 median centering"
"Apply appropriate normalization to each of my omics layers before integration"
"Remove batch effects from my expression data using ComBat before integrating with proteomics"
"Apply limma removeBatchEffect to correct for sequencing batch across samples"
"Match samples across my RNA, protein, and methylation datasets by sample ID"
"Identify which samples have complete data across all my omics assays"
"Map protein IDs to gene symbols to align with my RNA-seq data"
"Aggregate protein-level measurements to gene-level for integration"
"Impute missing values in my proteomics data using MinProb before integration"
"Filter features with more than 30% missing values across samples"
- Assess data quality per omics layer
- Apply assay-specific normalization
- Detect and correct batch effects
- Align feature identifiers across omics
- Handle missing values (filter or impute)
- Scale features for integration
- Create MultiAssayExperiment object if using R
| Data Type | Recommended Method |
|---|---|
| RNA-seq counts | VST, rlog, or TMM |
| Proteomics intensity | Log2 + median centering |
| Methylation beta | M-value transform |
| Metabolomics | Log + pareto scaling |
- Always normalize within assay before scaling across assays
- Use ComBat for known batches; SVA for unknown confounders
- Map features to common namespace (gene symbols) when possible
- Filter features with >30-50% missing before imputation
- Methods like MOFA2 tolerate missing views; SNF requires complete overlap
- Z-score scaling (mean=0, sd=1) is standard before most integration methods
- MultiAssayExperiment: doi:10.1158/0008-5472.CAN-17-0344
- ComBat: doi:10.1093/biostatistics/kxj037