Skip to content

Conversation

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Dec 18, 2025

Part of #2129

Foundational method that takes a vector that meets vctrs's newly written up native storage requirements, and strips away all extraneous attributes not natively handled by vctrs methods.

Not being used in place of vec_data() quite yet, but that is the goal. We will then soft-deprecate vec_data() and start to move away from it in favor of this here in vctrs and in dplyr/tidyr.

It will also be used in vec_proxy() on the output of a user's proxy method. This ensures that:

  • The proxy is stripped of any extraneous attributes (in particular, classes that could interfere with S3 dispatch)
  • The proxy is of the correct native storage type. @lionel- this is a side benefit I had not thought of until now, but this would help ensure that a developer's proxy methods returns objects of the right storage type.

It also seems likely that there is room for vctrs::vec_unstructure() and rlang::unstructure()

vctrs::vec_unstructure() rules:

  • Atomic vectors retain names
  • Lists retain names
  • Arrays retain dim and dimnames[[1]] (note, only row names)
  • Data frames are "native" types, and retain names, row.names, and a class of "data.frame"
  • All other types result in an error

rlang::unstructure() rules:

  • Atomic vectors retain names
  • Lists retain names
  • Expression vectors retain names
  • Arrays retain dim and dimnames (note, all dimnames)
  • All other types retain 0 attributes

Notable differences between the two:

  • Only the row names part of dimnames are kept in vec_unstructure(), but all of dimnames are kept in unstructure(), because base R operations propagate all of dimnames
  • Data frames are native in vec_unstructure() but are treated like lists in unstructure()
  • NULL is allowed in unstructure() but not vec_unstructure()
  • environment and all other types are allowed in unstructure() but not vec_unstructure(). Rationale for allowing them in unstructure() is that in structure() you can pass in an environment and add attributes to it, so there should be a way to remove them as well. But no attributes on an environment are ever "critical", so you just clear them.

For practical usage of rlang::unstructure():

  • In rray, implementing +, where you'd want to strip off the rray class but retain all of dimnames before delegating to base R's own + method, where dimnames are propagated
  • dplyr:::dplyr_new_list() and tidyr:::tidyr_new_list(), where I often pass in a data frame and expect this to unstructure that into a named list with no extra attributes

It is quite fast, we might be able to get away without vec_proxy_unsafe(), not sure yet.

Notably using R's ALTREP wrapper types here to avoid a copy of large objects (since only attributes are being manipulated).

# No attributes to strip
x <- 1:5
bench::mark(.Call(ffi_vec_unstructure, x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 .Call(ffi_vec_unstructure, x)        0   41.2ns 15303728.        0B     107.

# Some attributes to strip
x <- structure(1:5, foo = "bar", class = "myclass")
bench::mark(.Call(ffi_vec_unstructure, x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 .Call(ffi_vec_unstructure, x)    533ns    656ns  1402683.        0B     44.9

# Does not copy the vector!! R's ALTREP wrapper class is being used.
x <- structure(1:1e7, foo = "bar", class = "myclass")
bench::mark(.Call(ffi_vec_unstructure, x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 .Call(ffi_vec_unstructure, x)    533ns    656ns  1368452.        0B     45.2

# A tibble -> data.frame case
x <- tibble::tibble(x = 1, y = 2, z = 3)
bench::mark(.Call(ffi_vec_unstructure, x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 .Call(ffi_vec_unstructure, x)    656ns    779ns  1194149.        0B     45.4

But proxy methods were already quite fast, so maybe not.

x <- 1:5
bench::mark(vec_proxy(x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_proxy(x)    369ns    451ns  1989766.        0B     37.8

x <- tibble::tibble(x = 1, y = 2, z = 3)
bench::mark(vec_proxy(x), iterations = 1000000)
#> # A tibble: 1 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_proxy(x)    533ns    656ns  1462016.        0B     27.8

I imagine that in something like vec_c() we would use vec_proxy() on the out object we create (because we want to vec_restore() at the end), but we'd use vec_proxy_unsafe() on all of the elements before copying them over (because we don't care about their extraneous attributes, we just want the C compatible form that we can copy from).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants