Skip to content

c1au6i0/align-condathis-targets

Repository files navigation

align-condathis-targets

This repo showcases the use of targets and condathis to build a pure R pipeline that stays reproducible while also using non-R command-line interface (CLI) tools. The targets package is “a Make-like pipeline tool for statistics and data science in R. The condathis package is a CRAN package that lets you run CLI tools in R, ensuring reproducibility across systems and environmental isolation. With condathis, there’s no need for users to manually ensure CLI tools are installed on their systems.

The pipeline in this example processes RNA-seq FASTQ files by running quality control (QC), trimming adapters, aligning the sequences, and ultimately producing sorted BAM files.

Motivation

As bioinformaticians working with omics data, R is incredibly rich in packages. However, most tools for the initial phases of omics analysis are not R-based. The targets package is fantastic for building R pipelines. By combining targets with condathis, we can create pipelines that integrate both R tools and other CLI tools in an R environment, while still maintaining reproducibility.

How condathis is used

The package condathis is used to:

  1. Create environments with specific versions of the CLI tools needed. See example below.
## Create an environment with gsutil v5.35
condathis::create_env("gsutil==5.35", env_name = "gsutil-env")
  1. Interact with those environments and launch the CLI commands. See example below.
url_cloude_storage <- "gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta"

# Same as processx::run but needs to indicate the envname
condathis::run(
  "gsutil", "-m", "cp", url_cloude_storage, here::here("data", "outputs"), # where to save the data
  env_name = "gsutil-env"
)

–> R wrapper functions that incorporate specific condathis CLI commands are created to return paths that can be used by tar_files. <–

Folders and files

  • All the wrapper functions are stored in code/targets_functions.R.
  • config/: Contains the mapping.csv that maps subjects with corresponding files.
  • data/raw/: Contains some example FASTQ files (note that the files have been trimmed to contain only a few reads for easy processing).
  • data/outputs/: This is where the outputs of the pipeline are created.

Pipeline Overview

  • fastp: Quality of FASTQ files is checked and adapters are trimmed.
  • gsutil: Reference genome is downloaded.
  • minimap2: Aligns the files to the reference.
  • samtools: Transforms SAM files to BAM files and sorts

To restore all the project dependencies, run: renv::restore().

All outputs will be generated in data/outputs.

Run Pipeline with targets

library(targets)
# tar_dir()
targets::tar_make()
#> Loading required package: parallelly
#> here() starts at /Users/luciorq/workspaces/temp/align-condathis-targets
#> + samtools_env dispatched
#> ✔ samtools_env completed [7s, 65 B]
#> + minimap_env dispatched
#> ✔ minimap_env completed [3s, 64 B]
#> + fastp_env dispatched
#> ✔ fastp_env completed [2.9s, 62 B]
#> + gsutil_env dispatched
#> ✔ gsutil_env completed [10.5s, 63 B]
#> + wget_env dispatched
#> ✔ wget_env completed [2.9s, 61 B]
#> + trimmed_fastq_qc declared [2 branches]
#> ✔ trimmed_fastq_qc completed [441ms, 936.66 kB]
#> + mapped_sams declared [2 branches]
#> ✔ mapped_sams completed [11s, 5.33 kB]
#> + bam_files declared [2 branches]
#> + sorted_bams declared [2 branches]
#> ✔ ended pipeline [42.6s, 9 completed, 10 skipped]

Check output

sorted_bams_path <- targets::tar_read(sorted_bams)

bam_header <- condathis::run(
  "samtools", "view", "-H", sorted_bams_path[1],
  env_name = "samtools-env",
  verbose = "silent"
)

cat(bam_header$stdout)
#> @HD  VN:1.6  SO:coordinate
#> @SQ  SN:1    LN:249250621
#> @SQ  SN:2    LN:243199373
#> @SQ  SN:3    LN:198022430
#> @SQ  SN:4    LN:191154276
#> @SQ  SN:5    LN:180915260
#> @SQ  SN:6    LN:171115067
#> @SQ  SN:7    LN:159138663
#> @SQ  SN:8    LN:146364022
#> @SQ  SN:9    LN:141213431
#> @SQ  SN:10   LN:135534747
#> @SQ  SN:11   LN:135006516
#> @SQ  SN:12   LN:133851895
#> @SQ  SN:13   LN:115169878
#> @SQ  SN:14   LN:107349540
#> @SQ  SN:15   LN:102531392
#> @SQ  SN:16   LN:90354753
#> @SQ  SN:17   LN:81195210
#> @SQ  SN:18   LN:78077248
#> @SQ  SN:19   LN:59128983
#> @SQ  SN:20   LN:63025520
#> @SQ  SN:21   LN:48129895
#> @SQ  SN:22   LN:51304566
#> @SQ  SN:X    LN:155270560
#> @SQ  SN:Y    LN:59373566
#> @SQ  SN:MT   LN:16569
#> @SQ  SN:GL000207.1   LN:4262
#> @SQ  SN:GL000226.1   LN:15008
#> @SQ  SN:GL000229.1   LN:19913
#> @SQ  SN:GL000231.1   LN:27386
#> @SQ  SN:GL000210.1   LN:27682
#> @SQ  SN:GL000239.1   LN:33824
#> @SQ  SN:GL000235.1   LN:34474
#> @SQ  SN:GL000201.1   LN:36148
#> @SQ  SN:GL000247.1   LN:36422
#> @SQ  SN:GL000245.1   LN:36651
#> @SQ  SN:GL000197.1   LN:37175
#> @SQ  SN:GL000203.1   LN:37498
#> @SQ  SN:GL000246.1   LN:38154
#> @SQ  SN:GL000249.1   LN:38502
#> @SQ  SN:GL000196.1   LN:38914
#> @SQ  SN:GL000248.1   LN:39786
#> @SQ  SN:GL000244.1   LN:39929
#> @SQ  SN:GL000238.1   LN:39939
#> @SQ  SN:GL000202.1   LN:40103
#> @SQ  SN:GL000234.1   LN:40531
#> @SQ  SN:GL000232.1   LN:40652
#> @SQ  SN:GL000206.1   LN:41001
#> @SQ  SN:GL000240.1   LN:41933
#> @SQ  SN:GL000236.1   LN:41934
#> @SQ  SN:GL000241.1   LN:42152
#> @SQ  SN:GL000243.1   LN:43341
#> @SQ  SN:GL000242.1   LN:43523
#> @SQ  SN:GL000230.1   LN:43691
#> @SQ  SN:GL000237.1   LN:45867
#> @SQ  SN:GL000233.1   LN:45941
#> @SQ  SN:GL000204.1   LN:81310
#> @SQ  SN:GL000198.1   LN:90085
#> @SQ  SN:GL000208.1   LN:92689
#> @SQ  SN:GL000191.1   LN:106433
#> @SQ  SN:GL000227.1   LN:128374
#> @SQ  SN:GL000228.1   LN:129120
#> @SQ  SN:GL000214.1   LN:137718
#> @SQ  SN:GL000221.1   LN:155397
#> @SQ  SN:GL000209.1   LN:159169
#> @SQ  SN:GL000218.1   LN:161147
#> @SQ  SN:GL000220.1   LN:161802
#> @SQ  SN:GL000213.1   LN:164239
#> @SQ  SN:GL000211.1   LN:166566
#> @SQ  SN:GL000199.1   LN:169874
#> @SQ  SN:GL000217.1   LN:172149
#> @SQ  SN:GL000216.1   LN:172294
#> @SQ  SN:GL000215.1   LN:172545
#> @SQ  SN:GL000205.1   LN:174588
#> @SQ  SN:GL000219.1   LN:179198
#> @SQ  SN:GL000224.1   LN:179693
#> @SQ  SN:GL000223.1   LN:180455
#> @SQ  SN:GL000195.1   LN:182896
#> @SQ  SN:GL000212.1   LN:186858
#> @SQ  SN:GL000222.1   LN:186861
#> @SQ  SN:GL000200.1   LN:187035
#> @SQ  SN:GL000193.1   LN:189789
#> @SQ  SN:GL000194.1   LN:191469
#> @SQ  SN:GL000225.1   LN:211173
#> @SQ  SN:GL000192.1   LN:547496
#> @SQ  SN:NC_007605    LN:171823
#> @SQ  SN:hs37d5   LN:35477943
#> @PG  ID:minimap2 PN:minimap2 VN:2.30-r1287   CL:minimap2 -ax sr -t 2 /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/reference/Homo_sapiens_assembly19_1000genomes_decoy.mmi /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/trimmed/subj1_L001_R1_001_trimmed.fastq.gz /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/trimmed/subj1_L001_R2_001_trimmed.fastq.gz
#> @PG  ID:samtools PN:samtools PP:minimap2 VN:1.22.1   CL:samtools view -hb -@ 2 /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/aligned_sam/subj11.sam
#> @PG  ID:samtools.1   PN:samtools PP:samtools VN:1.22.1   CL:samtools sort -T /Users/luciorq/workspaces/temp/align-condathis-targets/data/tmp -@ 2 -o /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/sorted_bams/sorted_subj11.bam /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/bams/subj11.bam
#> @PG  ID:samtools.2   PN:samtools PP:samtools.1   VN:1.22.1   CL:samtools view -H /Users/luciorq/workspaces/temp/align-condathis-targets/data/outputs/sorted_bams/sorted_subj11.bam

About

Example pipeline that uses condathis and targets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages