Description
This bug has two sides - one of which was reported as #92 - but there seems to be more to this, so I'm opening a new issue.
Bug description
1) Creation of List columns
Currently the reduce_ranges
function implicitly creates list-columns when reducing any metadata column without any other functions (reduce_ranges(col1 = col1)
).
If one tries to specifically create list columns (reduce_ranges(col1 = list(col1))
), as one would do in i.e. tidyverse functions, then each entry of the list will contain the whole metadata column. This is also the case even when using group_by
.
Based on the tutorial referred to in #92 I assume that this current behavior has not always been the case and is unintended or a bug. Given that plyranges::summarise does create list columns this way when used after group_by, but not when used alone, there is at least some inconsistency here, even if the current behavior is intended.
2) Trying to create list columns crashes when no ranges are reduced
Since reduce_ranges(col1 = col1)
can create list-columns I tried to use it that way. However, this leads to an error when nothing would be reduced but metadata columns need to be handled somehow.
Reproducible code example
suppressMessages(library(plyranges))
gr <- data.frame(seqnames = "chr1",
start = c(1, 5, 25, 45, 60, 65),
end = c(10, 20, 35, 55, 70, 75),
strand = c("*", "*", "*", "*", "*", "*"),
gene=c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF"),
genetype = c('coding', 'coding', 'coding',
'coding', 'coding', 'non-coding')) %>%
as_granges()
## This creates list columns from the overlapping genes
gr %>% reduce_ranges(gene = gene)
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <CharacterList>
#> [1] chr1 1-20 * | geneA,geneB
#> [2] chr1 25-35 * | geneC
#> [3] chr1 45-55 * | geneD
#> [4] chr1 60-75 * | geneE,geneF
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## This puts all genes into each list entry
gr %>% reduce_ranges(gene = list(gene))
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <List>
#> [1] chr1 1-20 * | geneA,geneB,geneC,geneD,...
#> [2] chr1 25-35 * | geneA,geneB,geneC,geneD,...
#> [3] chr1 45-55 * | geneA,geneB,geneC,geneD,...
#> [4] chr1 60-75 * | geneA,geneB,geneC,geneD,...
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Nothing overlaps anymore: error
gr[2:5] %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE
## without metadata columns this does not happen
gr[2:5] %>% reduce_ranges()
#> GRanges object with 4 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 5-20 *
#> [2] chr1 25-35 *
#> [3] chr1 45-55 *
#> [4] chr1 60-70 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## group_by works
gr %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> GRanges object with 5 ranges and 2 metadata columns:
#> seqnames ranges strand | genetype gene
#> <Rle> <IRanges> <Rle> | <character> <CharacterList>
#> [1] chr1 1-20 * | coding geneA,geneB
#> [2] chr1 25-35 * | coding geneC
#> [3] chr1 45-55 * | coding geneD
#> [4] chr1 60-70 * | coding geneE
#> [5] chr1 65-75 * | non-coding geneF
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## but has the same issue when explicitly creating lists
gr %>% group_by(genetype) %>% reduce_ranges(gene = list(gene))
#> GRanges object with 5 ranges and 2 metadata columns:
#> seqnames ranges strand | genetype gene
#> <Rle> <IRanges> <Rle> | <character> <List>
#> [1] chr1 1-20 * | coding geneA,geneB,geneC,geneD,...
#> [2] chr1 25-35 * | coding geneA,geneB,geneC,geneD,...
#> [3] chr1 45-55 * | coding geneA,geneB,geneC,geneD,...
#> [4] chr1 60-70 * | coding geneA,geneB,geneC,geneD,...
#> [5] chr1 65-75 * | non-coding geneA,geneB,geneC,geneD,...
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## if nothing is reduced due to groups, the error is still there
gr[2:6] %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE
## without metadata columns again no error
gr[2:6] %>% group_by(genetype) %>% reduce_ranges()
#> GRanges object with 5 ranges and 1 metadata column:
#> seqnames ranges strand | genetype
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 5-20 * | coding
#> [2] chr1 25-35 * | coding
#> [3] chr1 45-55 * | coding
#> [4] chr1 60-70 * | coding
#> [5] chr1 65-75 * | non-coding
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Created on 2023-03-01 with reprex v2.0.2
R session information
suppressMessages(library(plyranges))
options(width = 120)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.3 (2022-03-10)
#> os Ubuntu 22.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2023-03-01
#> pandoc 2.19.2 @ /home/vonkunic_c/miniconda3/envs/cnv-pipeline/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.3)
#> Biobase 2.54.0 2021-10-26 [1] Bioconductor
#> BiocGenerics * 0.40.0 2021-10-26 [1] Bioconductor
#> BiocIO 1.4.0 2021-10-26 [1] Bioconductor
#> BiocParallel 1.28.3 2021-12-09 [1] Bioconductor
#> Biostrings 2.62.0 2021-10-26 [1] Bioconductor
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.3)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.1.3)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.1.3)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.3)
#> DelayedArray 0.20.0 2021-10-26 [1] Bioconductor
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.1.3)
#> dplyr 1.0.10 2022-09-01 [1] CRAN (R 4.1.3)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.1.3)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.1.3)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.3)
#> fs 1.6.0 2023-01-23 [1] CRAN (R 4.1.3)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.3)
#> GenomeInfoDb * 1.30.1 2022-01-30 [1] Bioconductor
#> GenomeInfoDbData 1.2.7 2023-01-26 [1] Bioconductor
#> GenomicAlignments 1.30.0 2021-10-26 [1] Bioconductor
#> GenomicRanges * 1.46.1 2021-11-18 [1] Bioconductor
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.1.3)
#> IRanges * 2.28.0 2021-10-26 [1] Bioconductor
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.1.3)
#> lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.3)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
#> Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.1.3)
#> MatrixGenerics 1.6.0 2021-10-26 [1] Bioconductor
#> matrixStats 0.63.0 2022-11-18 [1] CRAN (R 4.1.3)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.1.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.3)
#> plyranges * 1.14.0 2021-10-26 [1] Bioconductor
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.3)
#> RCurl 1.98-1.9 2022-10-03 [1] CRAN (R 4.1.3)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.1.3)
#> restfulr 0.0.15 2022-06-16 [1] CRAN (R 4.1.3)
#> rjson 0.2.21 2022-01-09 [1] CRAN (R 4.1.3)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.3)
#> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.1.3)
#> Rsamtools 2.10.0 2021-10-26 [1] Bioconductor
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.1.3)
#> rtracklayer 1.54.0 2021-10-26 [1] Bioconductor
#> S4Vectors * 0.32.4 2022-03-24 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.3)
#> SummarizedExperiment 1.24.0 2021-10-26 [1] Bioconductor
#> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.1.3)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.3)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.3)
#> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.1.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
#> xfun 0.36 2022-12-21 [1] CRAN (R 4.1.3)
#> XML 3.99-0.13 2022-12-04 [1] CRAN (R 4.1.3)
#> XVector 0.34.0 2021-10-26 [1] Bioconductor
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.1.3)
#> zlibbioc 1.40.0 2021-10-26 [1] Bioconductor
#>
#> [1] /home/vonkunic_c/miniconda3/envs/cnv-pipeline/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Created on 2023-03-01 with reprex v2.0.2