Skip to content

[BUG] reduce_ranges silently creates list columns and fails to do so so if nothing is reduced #101

Open
@Nicolai-vKuegelgen

Description

@Nicolai-vKuegelgen

This bug has two sides - one of which was reported as #92 - but there seems to be more to this, so I'm opening a new issue.

Bug description

1) Creation of List columns

Currently the reduce_ranges function implicitly creates list-columns when reducing any metadata column without any other functions (reduce_ranges(col1 = col1)).
If one tries to specifically create list columns (reduce_ranges(col1 = list(col1))), as one would do in i.e. tidyverse functions, then each entry of the list will contain the whole metadata column. This is also the case even when using group_by.

Based on the tutorial referred to in #92 I assume that this current behavior has not always been the case and is unintended or a bug. Given that plyranges::summarise does create list columns this way when used after group_by, but not when used alone, there is at least some inconsistency here, even if the current behavior is intended.

2) Trying to create list columns crashes when no ranges are reduced

Since reduce_ranges(col1 = col1) can create list-columns I tried to use it that way. However, this leads to an error when nothing would be reduced but metadata columns need to be handled somehow.

Reproducible code example

suppressMessages(library(plyranges))

gr <- data.frame(seqnames = "chr1",
                 start = c(1, 5, 25,  45, 60, 65),
                 end = c(10, 20, 35, 55, 70, 75),
                 strand = c("*", "*", "*", "*", "*", "*"),
                 gene=c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF"),
                 genetype = c('coding', 'coding', 'coding', 
                             'coding', 'coding', 'non-coding')) %>%
    as_granges()

## This creates list columns from the overlapping genes
gr %>% reduce_ranges(gene = gene)
#> GRanges object with 4 ranges and 1 metadata column:
#>       seqnames    ranges strand |            gene
#>          <Rle> <IRanges>  <Rle> | <CharacterList>
#>   [1]     chr1      1-20      * |     geneA,geneB
#>   [2]     chr1     25-35      * |           geneC
#>   [3]     chr1     45-55      * |           geneD
#>   [4]     chr1     60-75      * |     geneE,geneF
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## This puts all genes into each list entry
gr %>% reduce_ranges(gene = list(gene))
#> GRanges object with 4 ranges and 1 metadata column:
#>       seqnames    ranges strand |                        gene
#>          <Rle> <IRanges>  <Rle> |                      <List>
#>   [1]     chr1      1-20      * | geneA,geneB,geneC,geneD,...
#>   [2]     chr1     25-35      * | geneA,geneB,geneC,geneD,...
#>   [3]     chr1     45-55      * | geneA,geneB,geneC,geneD,...
#>   [4]     chr1     60-75      * | geneA,geneB,geneC,geneD,...
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Nothing overlaps anymore: error
gr[2:5] %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE

## without metadata columns this does not happen
gr[2:5] %>% reduce_ranges()
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      5-20      *
#>   [2]     chr1     25-35      *
#>   [3]     chr1     45-55      *
#>   [4]     chr1     60-70      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## group_by works 
gr %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |    genetype            gene
#>          <Rle> <IRanges>  <Rle> | <character> <CharacterList>
#>   [1]     chr1      1-20      * |      coding     geneA,geneB
#>   [2]     chr1     25-35      * |      coding           geneC
#>   [3]     chr1     45-55      * |      coding           geneD
#>   [4]     chr1     60-70      * |      coding           geneE
#>   [5]     chr1     65-75      * |  non-coding           geneF
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## but has the same issue when explicitly creating lists
gr %>% group_by(genetype) %>% reduce_ranges(gene = list(gene))
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |    genetype                        gene
#>          <Rle> <IRanges>  <Rle> | <character>                      <List>
#>   [1]     chr1      1-20      * |      coding geneA,geneB,geneC,geneD,...
#>   [2]     chr1     25-35      * |      coding geneA,geneB,geneC,geneD,...
#>   [3]     chr1     45-55      * |      coding geneA,geneB,geneC,geneD,...
#>   [4]     chr1     60-70      * |      coding geneA,geneB,geneC,geneD,...
#>   [5]     chr1     65-75      * |  non-coding geneA,geneB,geneC,geneD,...
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## if nothing is reduced due to groups, the error is still there
gr[2:6] %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE

## without metadata columns again no error
gr[2:6] %>% group_by(genetype) %>% reduce_ranges()
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |    genetype
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1      5-20      * |      coding
#>   [2]     chr1     25-35      * |      coding
#>   [3]     chr1     45-55      * |      coding
#>   [4]     chr1     60-70      * |      coding
#>   [5]     chr1     65-75      * |  non-coding
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2023-03-01 with reprex v2.0.2

R session information

suppressMessages(library(plyranges))
options(width = 120)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Ubuntu 22.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-03-01
#>  pandoc   2.19.2 @ /home/vonkunic_c/miniconda3/envs/cnv-pipeline/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  assertthat             0.2.1     2019-03-21 [1] CRAN (R 4.1.3)
#>  Biobase                2.54.0    2021-10-26 [1] Bioconductor
#>  BiocGenerics         * 0.40.0    2021-10-26 [1] Bioconductor
#>  BiocIO                 1.4.0     2021-10-26 [1] Bioconductor
#>  BiocParallel           1.28.3    2021-12-09 [1] Bioconductor
#>  Biostrings             2.62.0    2021-10-26 [1] Bioconductor
#>  bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.1.3)
#>  cli                    3.6.0     2023-01-09 [1] CRAN (R 4.1.3)
#>  crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.1.3)
#>  DBI                    1.1.3     2022-06-18 [1] CRAN (R 4.1.3)
#>  DelayedArray           0.20.0    2021-10-26 [1] Bioconductor
#>  digest                 0.6.31    2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr                  1.0.10    2022-09-01 [1] CRAN (R 4.1.3)
#>  evaluate               0.20      2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi                  1.0.4     2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap                1.1.0     2021-01-25 [1] CRAN (R 4.1.3)
#>  fs                     1.6.0     2023-01-23 [1] CRAN (R 4.1.3)
#>  generics               0.1.3     2022-07-05 [1] CRAN (R 4.1.3)
#>  GenomeInfoDb         * 1.30.1    2022-01-30 [1] Bioconductor
#>  GenomeInfoDbData       1.2.7     2023-01-26 [1] Bioconductor
#>  GenomicAlignments      1.30.0    2021-10-26 [1] Bioconductor
#>  GenomicRanges        * 1.46.1    2021-11-18 [1] Bioconductor
#>  glue                   1.6.2     2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools              0.5.4     2022-12-07 [1] CRAN (R 4.1.3)
#>  IRanges              * 2.28.0    2021-10-26 [1] Bioconductor
#>  knitr                  1.42      2023-01-25 [1] CRAN (R 4.1.3)
#>  lattice                0.20-45   2021-09-22 [1] CRAN (R 4.1.3)
#>  lifecycle              1.0.3     2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr               2.0.3     2022-03-30 [1] CRAN (R 4.1.3)
#>  Matrix                 1.5-3     2022-11-11 [1] CRAN (R 4.1.3)
#>  MatrixGenerics         1.6.0     2021-10-26 [1] Bioconductor
#>  matrixStats            0.63.0    2022-11-18 [1] CRAN (R 4.1.3)
#>  pillar                 1.8.1     2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.1.3)
#>  plyranges            * 1.14.0    2021-10-26 [1] Bioconductor
#>  R6                     2.5.1     2021-08-19 [1] CRAN (R 4.1.3)
#>  RCurl                  1.98-1.9  2022-10-03 [1] CRAN (R 4.1.3)
#>  reprex                 2.0.2     2022-08-17 [1] CRAN (R 4.1.3)
#>  restfulr               0.0.15    2022-06-16 [1] CRAN (R 4.1.3)
#>  rjson                  0.2.21    2022-01-09 [1] CRAN (R 4.1.3)
#>  rlang                  1.0.6     2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown              2.20      2023-01-19 [1] CRAN (R 4.1.3)
#>  Rsamtools              2.10.0    2021-10-26 [1] Bioconductor
#>  rstudioapi             0.14      2022-08-22 [1] CRAN (R 4.1.3)
#>  rtracklayer            1.54.0    2021-10-26 [1] Bioconductor
#>  S4Vectors            * 0.32.4    2022-03-24 [1] Bioconductor
#>  sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.1.3)
#>  SummarizedExperiment   1.24.0    2021-10-26 [1] Bioconductor
#>  tibble                 3.1.8     2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8                   1.2.2     2021-07-24 [1] CRAN (R 4.1.3)
#>  vctrs                  0.5.2     2023-01-23 [1] CRAN (R 4.1.3)
#>  withr                  2.5.0     2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun                   0.36      2022-12-21 [1] CRAN (R 4.1.3)
#>  XML                    3.99-0.13 2022-12-04 [1] CRAN (R 4.1.3)
#>  XVector                0.34.0    2021-10-26 [1] Bioconductor
#>  yaml                   2.3.7     2023-01-23 [1] CRAN (R 4.1.3)
#>  zlibbioc               1.40.0    2021-10-26 [1] Bioconductor
#> 
#>  [1] /home/vonkunic_c/miniconda3/envs/cnv-pipeline/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Created on 2023-03-01 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions