refactor: 🔥 remove convert_register() (#275)

signekb · web-flow · commit b5382a0f1ee8 · 2026-04-22T23:23:44.000+02:00
# Description

Since we want to utilise the parallel workers in the targets pipeline,
we don't really need this function for converting multiple SAS files. By
removing this, we also have less functionality to maintain and keep
aligned with any changes in the targets pipeline.

Needs a quick review.

## Checklist

- [X] Ran `just run-all`
diff --git a/NAMESPACE b/NAMESPACE
@@ -1,7 +1,6 @@
 # Generated by roxygen2: do not edit by hand
 
 export(convert_file)
-export(convert_register)
 export(list_sas_files)
 export(read_parquet_file)
 export(read_parquet_partition)
diff --git a/R/convert.R b/R/convert.R
@@ -1,77 +1,3 @@
-#' Convert register SAS file(s) and save to Parquet format
-#'
-#' @description
-#' This function reads one or more SAS files for a given register, and saves the
-#' data in Parquet format. It expects the input SAS files to come from the same
-#' register, e.g., different years of the same register. The function checks
-#' that all files belong to the same register by comparing the alphabetic
-#' characters in the file name(s).
-#'
-#' The function looks for a year (1900-2099) in the file
-#' names in `path` to use the year as partition, see `vignette("design")`
-#' for more information about the partitioning.
-#'
-#' If a year is found, the data is saved as a partition by year in the output
-#' directory, e.g., `output_dir/register_name/year=2020/part-ad5b.parquet`
-#' (the ending being a UUID). If no year is found in the file name, the data
-#' is saved in a
-#' `year=__HIVE_DEFAULT_PARTITION__` partition, which is the standard Hive
-#' convention for missing partition values.
-#'
-#' Two columns are added to the output: `source_file` (the original SAS file
-#' path) and `year` (extracted from the file name, used as partition key).
-#'
-#' To be able to handle larger-than-memory SAS files, this function uses
-#' `convert_file()` internally and only converts one file at a time in chunks.
-#' As a result, identical rows are not deduplicated.
-#'
-#' @param path Paths to SAS files for one register. See [list_sas_files()].
-#' @param output_dir Directory to save the Parquet output to. Must not include
-#'  the register name as this will be extracted from `path` to create the
-#'  register folder.
-#' @param chunk_size Number of rows to read and convert at a time.
-#'
-#' @returns `output_dir`, invisibly.
-#'
-#' @export
-#' @examples
-#' sas_file_directory <- fs::path_package("fastreg", "extdata")
-#' convert_register(
-#'   path = list_sas_files(sas_file_directory),
-#'   output_dir = fs::path_temp("path/to/output/register/")
-#' )
-convert_register <- function(
-  path,
-  output_dir,
-  chunk_size = 10000000L
-) {
-  # Check that register dir is empty (if exists) to avoid duplicating data
-  # since parts are named with UUIDs.
-  # Get register name checks that only one register is in `path`.
-  register_dir <- fs::path(output_dir, get_register_name(path))
-  if (fs::dir_exists(register_dir) && length(fs::dir_ls(register_dir)) > 0) {
-    cli::cli_abort(c(
-      "Output directory is not empty: {.path {register_dir}}",
-      "i" = "Delete the directory manually before re-running."
-    ))
-  }
-
-  # Convert files.
-  purrr::walk(path, \(p) {
-    convert_file(p, output_dir, chunk_size)
-    gc()
-  })
-
-  # Success message.
-  cli::cli_alert_success("Successfully converted {length(path)} file{?s}.")
-  cli::cli_bullets(c(
-    "*" = "Input: {.val {fs::path_file(path)}}",
-    "*" = "Output: Register files in {.path {fs::path(output_dir, get_register_name(path))}}"
-  ))
-
-  invisible(output_dir)
-}
-
 #' Convert a single register SAS file to Parquet
 #'
 #' To be able to handle larger-than-memory files, the SAS file is converted in
@@ -80,7 +6,10 @@ convert_register <- function(
 #' exists in the directory, since files are saved with UUIDs in their names.
 #'
 #' @param path Path to a single SAS file.
-#' @inheritParams convert_register
+#' @param output_dir Directory to save the Parquet output to. Must not include
+#'  the register name as this will be extracted from `path` to create the
+#'  register folder.
+#' @param chunk_size Number of rows to read and convert at a time.
 #'
 #' @returns `output_dir`, invisibly.
 #'
diff --git a/README.md b/README.md
@@ -79,17 +79,6 @@ convert_file(
 )
 ```
 
-Use `convert_register()` to convert several SAS files from the same
-register into a Hive partitioned Parquet dataset. To list all SAS files
-in a directory, you can use the helper function `list_sas_files()`:
-
-``` r
-convert_register(
-  path = list_sas_files("path/to/sas_register/"),
-  output_dir = "path/to/output_dir/"
-)
-```
-
 Use `use_targets_template()` to copy a
 [targets](https://books.ropensci.org/targets/) template that converts
 multiple registers in parallel into your project:
diff --git a/README.qmd b/README.qmd
@@ -79,17 +79,6 @@ convert_file(
 )
 ```
 
-Use `convert_register()` to convert several SAS files from the same
-register into a Hive partitioned Parquet dataset. To list all SAS files
-in a directory, you can use the helper function `list_sas_files()`:
-
-```{r, eval = FALSE}
-convert_register(
-  path = list_sas_files("path/to/sas_register/"),
-  output_dir = "path/to/output_dir/"
-)
-```
-
 Use `use_targets_template()` to copy a
 [targets](https://books.ropensci.org/targets/) template that converts
 multiple registers in parallel into your project:
diff --git a/man/convert_register.Rd b/man/convert_register.Rd
diff --git a/tests/testthat/test-convert.R b/tests/testthat/test-convert.R
@@ -126,106 +126,3 @@ test_that("convert_file() creates expected n parts when chunk_size < nrow", {
   ))
   expect_equal(n_actual, n_expected)
 })
-
-# Test convert_register() ------------------------------------------------------
-
-# Setup: Convert register
-register_path <- fs::path_temp("parquet_register")
-register_output <- convert_register(
-  path = sas_bef,
-  output_dir = register_path
-)
-
-test_that("convert_register() returns output_dir", {
-  expect_equal(register_output, register_path)
-})
-
-test_that("convert_register() partitions by year based on file names", {
-  expected <- fs::path(
-    register_output,
-    register_name,
-    c("year=__HIVE_DEFAULT_PARTITION__", "year=1999", "year=2020")
-  )
-
-  expect_all_true(fs::dir_exists(expected))
-  # Same number of created files as input files.
-  expect_length(
-    fs::dir_ls(expected),
-    length(sas_bef)
-  )
-})
-
-test_that("convert_register() errors when paths are from different registers", {
-  temp_different_register <- fs::path_temp("other_2020.sas7bdat")
-  suppressWarnings(haven::write_sas(
-    bef_list[[1]],
-    temp_different_register
-  ))
-  expect_error(
-    convert_register(
-      path = c(sas_bef, temp_different_register),
-      output_dir = fs::path_temp("register_different")
-    ),
-    regexp = "Multiple register names"
-  )
-})
-
-test_that("convert_register() errors when output directory is not empty", {
-  output_dir <- fs::path_temp("register_nonempty")
-  convert_register(path = sas_bef, output_dir = output_dir)
-  expect_error(
-    convert_register(
-      path = sas_bef,
-      output_dir = output_dir
-    ),
-    regexp = "not empty"
-  )
-})
-
-test_that("convert_register() converts larger files with chunking", {
-  skip_on_cran()
-
-  # n = 1.1 million to test chunking with chunk_size = 1 million.
-  bef_list_large <- simulate_register(
-    "bef",
-    c("1999", "2020"),
-    n = 1100000
-  )
-  sas_path_large <- fs::path_temp("sas_bef_large")
-  save_as_sas(bef_list_large, sas_path_large)
-  sas_bef_large <- fs::dir_ls(sas_path_large)
-  output_dir_large <- fs::path_temp("parquet_path_large")
-  chunk_size_large <- 1000000L
-
-  convert_register(
-    path = sas_bef_large,
-    output_dir = output_dir_large,
-    chunk_size = chunk_size_large
-  )
-
-  n_expected <- sum(ceiling(
-    purrr::map_int(bef_list_large, nrow) / chunk_size_large
-  ))
-  n_actual <- length(fs::dir_ls(
-    output_dir_large,
-    recurse = TRUE,
-    type = "file"
-  ))
-  expect_equal(n_actual, n_expected)
-})
-
-test_that("convert_register() doesn't error with incompatible schemas", {
-  # Create a bef file where numeric columns are changed to character, so
-  # the schema is incompatible with the other bef files.
-  incompatible_data <- bef_list[[1]] |>
-    dplyr::mutate(dplyr::across(where(is.numeric), as.character))
-
-  incompatible_sas_path <- fs::path_temp("sas_schema_incompatible")
-  save_as_sas(list(bef2099 = incompatible_data), incompatible_sas_path)
-  sas_incompatible <- c(sas_bef, fs::dir_ls(incompatible_sas_path))
-
-  expect_no_error(convert_register(
-    path = sas_incompatible,
-    output_dir = fs::path_temp("incompatible_schemas")
-  ))
-})
diff --git a/tests/testthat/test-read.R b/tests/testthat/test-read.R
@@ -6,8 +6,10 @@ save_as_sas(bef_list, sas_path)
 sas_bef <- fs::dir_ls(sas_path)
 output_dir <- fs::path_temp("output_dir")
 
-# Use convert_register() for conversion
-convert_register(path = sas_bef, output_dir = output_dir)
+# Convert files.
+purrr::walk(sas_bef, \(path) {
+  convert_file(path, output_dir)
+})
 
 # Test read_register() ---------------------------------------------------------
 
@@ -116,7 +118,11 @@ test_that("read_register() reads files with different columns", {
   sas_diff_cols <- c(sas_bef, fs::dir_ls(lmdb_sas_path))
 
   diff_cols_output <- fs::path_temp("diff_cols")
-  convert_register(path = sas_diff_cols, output_dir = diff_cols_output)
+
+  # Convert files.
+  purrr::walk(sas_diff_cols, \(path) {
+    convert_file(path, diff_cols_output)
+  })
 
   # Define expected columns.
   expected <- purrr::map(c("bef", "lmdb"), \(x) {
@@ -144,7 +150,10 @@ test_that("read_register() errors with incompatible schemas", {
   sas_incompatible <- c(sas_bef, fs::dir_ls(incompatible_sas_path))
 
   incompatible_output <- fs::path_temp("incompatible")
-  convert_register(path = sas_incompatible, output_dir = incompatible_output)
+  # Convert files.
+  purrr::walk(sas_incompatible, \(path) {
+    convert_file(path, incompatible_output)
+  })
 
   expect_error(read_register(incompatible_output), "incompatible")
 })
diff --git a/vignettes/design.qmd b/vignettes/design.qmd
@@ -51,24 +51,24 @@ For a list of all the public functions, see the
 page.
 :::
 
-### Converting SAS files from a single register
+### Converting one SAS file
 
 ```{mermaid}
 %%| label: fig-flow
-%%| fig-cap: "Expected workflow for converting SAS files from a single register using `convert_register()`."
+%%| fig-cap: "Expected workflow for converting one SAS file using `convert_file()`."
 %%| fig-alt: "A flowchart showing the expected flow of converting register SAS files to Parquet files."
 flowchart TD
     identify_paths("Identify register path(s)<br>with list_sas_files(path)")
     path[/"path<br>[Character vector]"/]
     output_dir[/"output_dir<br>[Character scalar]"/]
     chunk_size[/"chunk_size<br>[Integer scalar]"/]
-    convert_register("convert_register()")
+    convert_file("convert_file()")
     output[/"Parquet file(s)<br>written to output_dir"/]
 
     %% Edges
-    identify_paths -.-> path --> convert_register
-    output_dir & chunk_size --> convert_register
-    convert_register --> output
+    identify_paths -.-> path --> convert_file
+    output_dir & chunk_size --> convert_file
+    convert_file --> output
 
     %% Style
     style identify_paths fill:#FFFFFF, color:#000000, stroke-dasharray: 5 5
@@ -95,16 +95,16 @@ flowchart TD
 
 ::: callout-warning
 `convert_file()`, the core function behind converting SAS files to
-Parquet and used within `convert_register()` and the targets template,
-creates an Arrow schema with data types based on the first file chunk.
-This means that data type schemas are defined *within* files only. As a
-result, if there's a drift in data types across SAS files in the same
-register, this may not be identified in the conversion process, but will
-become evident when attempting to read the register.
-
-We use this design to ensure that subsequent chunks follow the same schema
-as the first, as we don't want to have different data types across chunks
-of the same partition (e.g. `part-*.parquet`).
+Parquet used within the targets template, creates an Arrow schema with
+data types based on the first file chunk. This means that data type
+schemas are defined *within* files only. As a result, if there's a drift
+in data types across SAS files in the same register, this may not be
+identified in the conversion process, but will become evident when
+attempting to read the register.
+
+We use this design to ensure that subsequent chunks follow the same
+schema as the first, as we don't want to have different data types
+across chunks of the same partition (e.g. `part-*.parquet`).
 :::
 
 ### Reading a Parquet register
diff --git a/vignettes/fastreg.qmd b/vignettes/fastreg.qmd