Skip to content

Commit b5382a0

Browse files
authored
refactor: 🔥 remove convert_register() (#275)
# Description Since we want to utilise the parallel workers in the targets pipeline, we don't really need this function for converting multiple SAS files. By removing this, we also have less functionality to maintain and keep aligned with any changes in the targets pipeline. Needs a quick review. ## Checklist - [X] Ran `just run-all`
1 parent 04da470 commit b5382a0

9 files changed

Lines changed: 39 additions & 330 deletions

File tree

NAMESPACE

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Generated by roxygen2: do not edit by hand
22

33
export(convert_file)
4-
export(convert_register)
54
export(list_sas_files)
65
export(read_parquet_file)
76
export(read_parquet_partition)

R/convert.R

Lines changed: 4 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,3 @@
1-
#' Convert register SAS file(s) and save to Parquet format
2-
#'
3-
#' @description
4-
#' This function reads one or more SAS files for a given register, and saves the
5-
#' data in Parquet format. It expects the input SAS files to come from the same
6-
#' register, e.g., different years of the same register. The function checks
7-
#' that all files belong to the same register by comparing the alphabetic
8-
#' characters in the file name(s).
9-
#'
10-
#' The function looks for a year (1900-2099) in the file
11-
#' names in `path` to use the year as partition, see `vignette("design")`
12-
#' for more information about the partitioning.
13-
#'
14-
#' If a year is found, the data is saved as a partition by year in the output
15-
#' directory, e.g., `output_dir/register_name/year=2020/part-ad5b.parquet`
16-
#' (the ending being a UUID). If no year is found in the file name, the data
17-
#' is saved in a
18-
#' `year=__HIVE_DEFAULT_PARTITION__` partition, which is the standard Hive
19-
#' convention for missing partition values.
20-
#'
21-
#' Two columns are added to the output: `source_file` (the original SAS file
22-
#' path) and `year` (extracted from the file name, used as partition key).
23-
#'
24-
#' To be able to handle larger-than-memory SAS files, this function uses
25-
#' `convert_file()` internally and only converts one file at a time in chunks.
26-
#' As a result, identical rows are not deduplicated.
27-
#'
28-
#' @param path Paths to SAS files for one register. See [list_sas_files()].
29-
#' @param output_dir Directory to save the Parquet output to. Must not include
30-
#' the register name as this will be extracted from `path` to create the
31-
#' register folder.
32-
#' @param chunk_size Number of rows to read and convert at a time.
33-
#'
34-
#' @returns `output_dir`, invisibly.
35-
#'
36-
#' @export
37-
#' @examples
38-
#' sas_file_directory <- fs::path_package("fastreg", "extdata")
39-
#' convert_register(
40-
#' path = list_sas_files(sas_file_directory),
41-
#' output_dir = fs::path_temp("path/to/output/register/")
42-
#' )
43-
convert_register <- function(
44-
path,
45-
output_dir,
46-
chunk_size = 10000000L
47-
) {
48-
# Check that register dir is empty (if exists) to avoid duplicating data
49-
# since parts are named with UUIDs.
50-
# Get register name checks that only one register is in `path`.
51-
register_dir <- fs::path(output_dir, get_register_name(path))
52-
if (fs::dir_exists(register_dir) && length(fs::dir_ls(register_dir)) > 0) {
53-
cli::cli_abort(c(
54-
"Output directory is not empty: {.path {register_dir}}",
55-
"i" = "Delete the directory manually before re-running."
56-
))
57-
}
58-
59-
# Convert files.
60-
purrr::walk(path, \(p) {
61-
convert_file(p, output_dir, chunk_size)
62-
gc()
63-
})
64-
65-
# Success message.
66-
cli::cli_alert_success("Successfully converted {length(path)} file{?s}.")
67-
cli::cli_bullets(c(
68-
"*" = "Input: {.val {fs::path_file(path)}}",
69-
"*" = "Output: Register files in {.path {fs::path(output_dir, get_register_name(path))}}"
70-
))
71-
72-
invisible(output_dir)
73-
}
74-
751
#' Convert a single register SAS file to Parquet
762
#'
773
#' To be able to handle larger-than-memory files, the SAS file is converted in
@@ -80,7 +6,10 @@ convert_register <- function(
806
#' exists in the directory, since files are saved with UUIDs in their names.
817
#'
828
#' @param path Path to a single SAS file.
83-
#' @inheritParams convert_register
9+
#' @param output_dir Directory to save the Parquet output to. Must not include
10+
#' the register name as this will be extracted from `path` to create the
11+
#' register folder.
12+
#' @param chunk_size Number of rows to read and convert at a time.
8413
#'
8514
#' @returns `output_dir`, invisibly.
8615
#'

README.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,6 @@ convert_file(
7979
)
8080
```
8181

82-
Use `convert_register()` to convert several SAS files from the same
83-
register into a Hive partitioned Parquet dataset. To list all SAS files
84-
in a directory, you can use the helper function `list_sas_files()`:
85-
86-
``` r
87-
convert_register(
88-
path = list_sas_files("path/to/sas_register/"),
89-
output_dir = "path/to/output_dir/"
90-
)
91-
```
92-
9382
Use `use_targets_template()` to copy a
9483
[targets](https://books.ropensci.org/targets/) template that converts
9584
multiple registers in parallel into your project:

README.qmd

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,6 @@ convert_file(
7979
)
8080
```
8181

82-
Use `convert_register()` to convert several SAS files from the same
83-
register into a Hive partitioned Parquet dataset. To list all SAS files
84-
in a directory, you can use the helper function `list_sas_files()`:
85-
86-
```{r, eval = FALSE}
87-
convert_register(
88-
path = list_sas_files("path/to/sas_register/"),
89-
output_dir = "path/to/output_dir/"
90-
)
91-
```
92-
9382
Use `use_targets_template()` to copy a
9483
[targets](https://books.ropensci.org/targets/) template that converts
9584
multiple registers in parallel into your project:

man/convert_register.Rd

Lines changed: 0 additions & 52 deletions
This file was deleted.

tests/testthat/test-convert.R

Lines changed: 0 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -126,106 +126,3 @@ test_that("convert_file() creates expected n parts when chunk_size < nrow", {
126126
))
127127
expect_equal(n_actual, n_expected)
128128
})
129-
130-
# Test convert_register() ------------------------------------------------------
131-
132-
# Setup: Convert register
133-
register_path <- fs::path_temp("parquet_register")
134-
register_output <- convert_register(
135-
path = sas_bef,
136-
output_dir = register_path
137-
)
138-
139-
test_that("convert_register() returns output_dir", {
140-
expect_equal(register_output, register_path)
141-
})
142-
143-
test_that("convert_register() partitions by year based on file names", {
144-
expected <- fs::path(
145-
register_output,
146-
register_name,
147-
c("year=__HIVE_DEFAULT_PARTITION__", "year=1999", "year=2020")
148-
)
149-
150-
expect_all_true(fs::dir_exists(expected))
151-
# Same number of created files as input files.
152-
expect_length(
153-
fs::dir_ls(expected),
154-
length(sas_bef)
155-
)
156-
})
157-
158-
test_that("convert_register() errors when paths are from different registers", {
159-
temp_different_register <- fs::path_temp("other_2020.sas7bdat")
160-
suppressWarnings(haven::write_sas(
161-
bef_list[[1]],
162-
temp_different_register
163-
))
164-
expect_error(
165-
convert_register(
166-
path = c(sas_bef, temp_different_register),
167-
output_dir = fs::path_temp("register_different")
168-
),
169-
regexp = "Multiple register names"
170-
)
171-
})
172-
173-
test_that("convert_register() errors when output directory is not empty", {
174-
output_dir <- fs::path_temp("register_nonempty")
175-
convert_register(path = sas_bef, output_dir = output_dir)
176-
expect_error(
177-
convert_register(
178-
path = sas_bef,
179-
output_dir = output_dir
180-
),
181-
regexp = "not empty"
182-
)
183-
})
184-
185-
test_that("convert_register() converts larger files with chunking", {
186-
skip_on_cran()
187-
188-
# n = 1.1 million to test chunking with chunk_size = 1 million.
189-
bef_list_large <- simulate_register(
190-
"bef",
191-
c("1999", "2020"),
192-
n = 1100000
193-
)
194-
sas_path_large <- fs::path_temp("sas_bef_large")
195-
save_as_sas(bef_list_large, sas_path_large)
196-
sas_bef_large <- fs::dir_ls(sas_path_large)
197-
output_dir_large <- fs::path_temp("parquet_path_large")
198-
chunk_size_large <- 1000000L
199-
200-
convert_register(
201-
path = sas_bef_large,
202-
output_dir = output_dir_large,
203-
chunk_size = chunk_size_large
204-
)
205-
206-
n_expected <- sum(ceiling(
207-
purrr::map_int(bef_list_large, nrow) / chunk_size_large
208-
))
209-
n_actual <- length(fs::dir_ls(
210-
output_dir_large,
211-
recurse = TRUE,
212-
type = "file"
213-
))
214-
expect_equal(n_actual, n_expected)
215-
})
216-
217-
test_that("convert_register() doesn't error with incompatible schemas", {
218-
# Create a bef file where numeric columns are changed to character, so
219-
# the schema is incompatible with the other bef files.
220-
incompatible_data <- bef_list[[1]] |>
221-
dplyr::mutate(dplyr::across(where(is.numeric), as.character))
222-
223-
incompatible_sas_path <- fs::path_temp("sas_schema_incompatible")
224-
save_as_sas(list(bef2099 = incompatible_data), incompatible_sas_path)
225-
sas_incompatible <- c(sas_bef, fs::dir_ls(incompatible_sas_path))
226-
227-
expect_no_error(convert_register(
228-
path = sas_incompatible,
229-
output_dir = fs::path_temp("incompatible_schemas")
230-
))
231-
})

tests/testthat/test-read.R

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@ save_as_sas(bef_list, sas_path)
66
sas_bef <- fs::dir_ls(sas_path)
77
output_dir <- fs::path_temp("output_dir")
88

9-
# Use convert_register() for conversion
10-
convert_register(path = sas_bef, output_dir = output_dir)
9+
# Convert files.
10+
purrr::walk(sas_bef, \(path) {
11+
convert_file(path, output_dir)
12+
})
1113

1214
# Test read_register() ---------------------------------------------------------
1315

@@ -116,7 +118,11 @@ test_that("read_register() reads files with different columns", {
116118
sas_diff_cols <- c(sas_bef, fs::dir_ls(lmdb_sas_path))
117119

118120
diff_cols_output <- fs::path_temp("diff_cols")
119-
convert_register(path = sas_diff_cols, output_dir = diff_cols_output)
121+
122+
# Convert files.
123+
purrr::walk(sas_diff_cols, \(path) {
124+
convert_file(path, diff_cols_output)
125+
})
120126

121127
# Define expected columns.
122128
expected <- purrr::map(c("bef", "lmdb"), \(x) {
@@ -144,7 +150,10 @@ test_that("read_register() errors with incompatible schemas", {
144150
sas_incompatible <- c(sas_bef, fs::dir_ls(incompatible_sas_path))
145151

146152
incompatible_output <- fs::path_temp("incompatible")
147-
convert_register(path = sas_incompatible, output_dir = incompatible_output)
153+
# Convert files.
154+
purrr::walk(sas_incompatible, \(path) {
155+
convert_file(path, incompatible_output)
156+
})
148157

149158
expect_error(read_register(incompatible_output), "incompatible")
150159
})

vignettes/design.qmd

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -51,24 +51,24 @@ For a list of all the public functions, see the
5151
page.
5252
:::
5353

54-
### Converting SAS files from a single register
54+
### Converting one SAS file
5555

5656
```{mermaid}
5757
%%| label: fig-flow
58-
%%| fig-cap: "Expected workflow for converting SAS files from a single register using `convert_register()`."
58+
%%| fig-cap: "Expected workflow for converting one SAS file using `convert_file()`."
5959
%%| fig-alt: "A flowchart showing the expected flow of converting register SAS files to Parquet files."
6060
flowchart TD
6161
identify_paths("Identify register path(s)<br>with list_sas_files(path)")
6262
path[/"path<br>[Character vector]"/]
6363
output_dir[/"output_dir<br>[Character scalar]"/]
6464
chunk_size[/"chunk_size<br>[Integer scalar]"/]
65-
convert_register("convert_register()")
65+
convert_file("convert_file()")
6666
output[/"Parquet file(s)<br>written to output_dir"/]
6767
6868
%% Edges
69-
identify_paths -.-> path --> convert_register
70-
output_dir & chunk_size --> convert_register
71-
convert_register --> output
69+
identify_paths -.-> path --> convert_file
70+
output_dir & chunk_size --> convert_file
71+
convert_file --> output
7272
7373
%% Style
7474
style identify_paths fill:#FFFFFF, color:#000000, stroke-dasharray: 5 5
@@ -95,16 +95,16 @@ flowchart TD
9595

9696
::: callout-warning
9797
`convert_file()`, the core function behind converting SAS files to
98-
Parquet and used within `convert_register()` and the targets template,
99-
creates an Arrow schema with data types based on the first file chunk.
100-
This means that data type schemas are defined *within* files only. As a
101-
result, if there's a drift in data types across SAS files in the same
102-
register, this may not be identified in the conversion process, but will
103-
become evident when attempting to read the register.
104-
105-
We use this design to ensure that subsequent chunks follow the same schema
106-
as the first, as we don't want to have different data types across chunks
107-
of the same partition (e.g. `part-*.parquet`).
98+
Parquet used within the targets template, creates an Arrow schema with
99+
data types based on the first file chunk. This means that data type
100+
schemas are defined *within* files only. As a result, if there's a drift
101+
in data types across SAS files in the same register, this may not be
102+
identified in the conversion process, but will become evident when
103+
attempting to read the register.
104+
105+
We use this design to ensure that subsequent chunks follow the same
106+
schema as the first, as we don't want to have different data types
107+
across chunks of the same partition (e.g. `part-*.parquet`).
108108
:::
109109

110110
### Reading a Parquet register

0 commit comments

Comments
 (0)