Skip to content

separate_wider_delim renames original column when col_remove=FALSE and names= not specified #1499

Open
@tszberkowitz

Description

@tszberkowitz

When splitting a delimited character variable using the newer separate_wider_delim() function from the tidyr package (v 1.3.0), if you:

  • specify the names_sep= argument,
  • do NOT specify the names= argument, and
  • specify cols_remove=FALSE,

then the original variable is retained in the output data set (as expected) but:

  1. the original variable name has been duplicated using the value specified in the names_sep= argument such that, e.g., names_sep='_' with cols=varname produces a variable named varname_varname in the output data, and
  2. the original variable is located after the new separated columns, which is different from how the older separate() function behaves (placing the original column before the new columns).

Note that the first point above (variable renaming) is the major issue. The second point is just something that I was not unexpecting.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(reprex)

# Create test data set
## 1 character variable (`v`):
##  * semicolon-delimited values,
##  * includes NA,
##  * inconsistent/unpredictable number of delimiters per value
test <- tibble(
  v = c('a;b', 'c', NA, 'd;e;f', 'g;h')
)

# specifying `names` (not `names_sep`)
# `cols_remove` is TRUE => behaves as expected (original column name unchanged)
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names = c('v_1', 'v_2', 'v_3'),
  too_few = 'align_start',
  cols_remove = FALSE
)
#> # A tibble: 5 × 4
#>   v_1   v_2   v_3   v    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     <NA>  a;b  
#> 2 c     <NA>  <NA>  c    
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d     e     f     d;e;f
#> 5 g     h     <NA>  g;h

# specifying `names_sep` only
# `cols_remove` is TRUE (default) => behaves as expected
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names_sep = '_',
  too_few = 'align_start',
  cols_remove = TRUE
)
#> # A tibble: 5 × 3
#>   v_1   v_2   v_3  
#>   <chr> <chr> <chr>
#> 1 a     b     <NA> 
#> 2 c     <NA>  <NA> 
#> 3 <NA>  <NA>  <NA> 
#> 4 d     e     f    
#> 5 g     h     <NA>

# specifying `names_sep` only
# `cols_remove` is FALSE => **unexpected renaming of original variable**
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names_sep = '_',
  too_few = 'align_start',
  cols_remove = FALSE
)
#> # A tibble: 5 × 4
#>   v_1   v_2   v_3   v_v  
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     <NA>  a;b  
#> 2 c     <NA>  <NA>  c    
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d     e     f     d;e;f
#> 5 g     h     <NA>  g;h

## Expected output from previous code chunk:
##  * note original column name unchanged
# # A tibble: 5 × 4
#   v_1   v_2   v_3   v    
#   <chr> <chr> <chr> <chr>
# 1 a     b     <NA>  a;b  
# 2 c     <NA>  <NA>  c    
# 3 <NA>  <NA>  <NA>  <NA> 
# 4 d     e     f     d;e;f 
# 5 g     h     <NA>  g;h   


# old behavior (with `separate()`)
# * original variable located before new `separate()`d columns
separate(
  data = test,
  col = v,
  into = c('v_1', 'v_2', 'v_3'),
  sep = ';',
  remove = FALSE,
  fill = 'right'
)
#> # A tibble: 5 × 4
#>   v     v_1   v_2   v_3  
#>   <chr> <chr> <chr> <chr>
#> 1 a;b   a     b     <NA> 
#> 2 c     c     <NA>  <NA> 
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d;e;f d     e     f    
#> 5 g;h   g     h     <NA>

Created on 2023-05-14 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] reprex_2.0.2 tidyr_1.3.0  dplyr_1.1.2 
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.2       cli_3.6.1         knitr_1.42        rlang_1.1.1      
#>  [5] xfun_0.39         stringi_1.7.12    purrr_1.0.1       styler_1.9.1     
#>  [9] generics_0.1.3    glue_1.6.2        htmltools_0.5.5   fansi_1.0.4      
#> [13] rmarkdown_2.21    R.cache_0.16.0    tibble_3.2.1      evaluate_0.21    
#> [17] fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3   stringr_1.5.0    
#> [21] compiler_4.3.0    fs_1.6.2          pkgconfig_2.0.3   rstudioapi_0.14  
#> [25] R.oo_1.25.0       R.utils_2.12.2    digest_0.6.31     R6_2.5.1         
#> [29] tidyselect_1.2.0  utf8_1.2.3        pillar_1.9.0      magrittr_2.0.3   
#> [33] R.methodsS3_1.8.2 tools_4.3.0       withr_2.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behaviorstrings 🎻

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions