Skip to content

Commit 7f78919

Browse files
authored
Merge branch 'main' into docs/conversion-report
2 parents 2201a85 + 63d7bd7 commit 7f78919

15 files changed

Lines changed: 275 additions & 343 deletions

.cz.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[tool.commitizen]
2-
version = "0.9.0"
2+
version = "0.10.0"
33
bump_message = "build(version): :bookmark: update version from $current_version to $new_version"
44
version_schema = "semver"
55
version_files = [

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: fastreg
22
Title: Fast Conversion and Querying of Danish Registers with 'Parquet'
3-
Version: 0.9.0
3+
Version: 0.10.0
44
Authors@R: c(
55
person("Signe Kirk", "Brødbæk", , "signekb@clin.au.dk", role = c("aut", "cre"),
66
comment = c(ORCID = "0009-0000-2208-7088")),

NAMESPACE

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Generated by roxygen2: do not edit by hand
22

33
export(convert_file)
4-
export(convert_register)
54
export(list_sas_files)
65
export(read_parquet_file)
76
export(read_parquet_partition)

NEWS.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,18 @@ individual release will not have many changes within it. Below is a list
1616
of the releases we've made so far, along with what was changed within
1717
each release.
1818

19+
## 0.10.0 (2026-04-23)
20+
21+
### Feat
22+
23+
- :sparkles: helper `get_*()` for project IDs and directories (#251)
24+
25+
## 0.9.1 (2026-04-22)
26+
27+
### Refactor
28+
29+
- 🔥 remove `convert_register()` (#275)
30+
1931
## 0.9.0 (2026-04-20)
2032

2133
### Feat

R/convert.R

Lines changed: 4 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,3 @@
1-
#' Convert register SAS file(s) and save to Parquet format
2-
#'
3-
#' @description
4-
#' This function reads one or more SAS files for a given register, and saves the
5-
#' data in Parquet format. It expects the input SAS files to come from the same
6-
#' register, e.g., different years of the same register. The function checks
7-
#' that all files belong to the same register by comparing the alphabetic
8-
#' characters in the file name(s).
9-
#'
10-
#' The function looks for a year (1900-2099) in the file
11-
#' names in `path` to use the year as partition, see `vignette("design")`
12-
#' for more information about the partitioning.
13-
#'
14-
#' If a year is found, the data is saved as a partition by year in the output
15-
#' directory, e.g., `output_dir/register_name/year=2020/part-ad5b.parquet`
16-
#' (the ending being a UUID). If no year is found in the file name, the data
17-
#' is saved in a
18-
#' `year=__HIVE_DEFAULT_PARTITION__` partition, which is the standard Hive
19-
#' convention for missing partition values.
20-
#'
21-
#' Two columns are added to the output: `source_file` (the original SAS file
22-
#' path) and `year` (extracted from the file name, used as partition key).
23-
#'
24-
#' To be able to handle larger-than-memory SAS files, this function uses
25-
#' `convert_file()` internally and only converts one file at a time in chunks.
26-
#' As a result, identical rows are not deduplicated.
27-
#'
28-
#' @param path Paths to SAS files for one register. See [list_sas_files()].
29-
#' @param output_dir Directory to save the Parquet output to. Must not include
30-
#' the register name as this will be extracted from `path` to create the
31-
#' register folder.
32-
#' @param chunk_size Number of rows to read and convert at a time.
33-
#'
34-
#' @returns `output_dir`, invisibly.
35-
#'
36-
#' @export
37-
#' @examples
38-
#' sas_file_directory <- fs::path_package("fastreg", "extdata")
39-
#' convert_register(
40-
#' path = list_sas_files(sas_file_directory),
41-
#' output_dir = fs::path_temp("path/to/output/register/")
42-
#' )
43-
convert_register <- function(
44-
path,
45-
output_dir,
46-
chunk_size = 10000000L
47-
) {
48-
# Check that register dir is empty (if exists) to avoid duplicating data
49-
# since parts are named with UUIDs.
50-
# Get register name checks that only one register is in `path`.
51-
register_dir <- fs::path(output_dir, get_register_name(path))
52-
if (fs::dir_exists(register_dir) && length(fs::dir_ls(register_dir)) > 0) {
53-
cli::cli_abort(c(
54-
"Output directory is not empty: {.path {register_dir}}",
55-
"i" = "Delete the directory manually before re-running."
56-
))
57-
}
58-
59-
# Convert files.
60-
purrr::walk(path, \(p) {
61-
convert_file(p, output_dir, chunk_size)
62-
gc()
63-
})
64-
65-
# Success message.
66-
cli::cli_alert_success("Successfully converted {length(path)} file{?s}.")
67-
cli::cli_bullets(c(
68-
"*" = "Input: {.val {fs::path_file(path)}}",
69-
"*" = "Output: Register files in {.path {fs::path(output_dir, get_register_name(path))}}"
70-
))
71-
72-
invisible(output_dir)
73-
}
74-
751
#' Convert a single register SAS file to Parquet
762
#'
773
#' To be able to handle larger-than-memory files, the SAS file is converted in
@@ -80,7 +6,10 @@ convert_register <- function(
806
#' exists in the directory, since files are saved with UUIDs in their names.
817
#'
828
#' @param path Path to a single SAS file.
83-
#' @inheritParams convert_register
9+
#' @param output_dir Directory to save the Parquet output to. Must not include
10+
#' the register name as this will be extracted from `path` to create the
11+
#' register folder.
12+
#' @param chunk_size Number of rows to read and convert at a time.
8413
#'
8514
#' @returns `output_dir`, invisibly.
8615
#'

R/get.R

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#' Get the project ID from the current working directory path
2+
#'
3+
#' Gets a numeric project ID from the current working directory path by looking
4+
#' for a folder name with only digits. Errors if a project ID with an unexpected
5+
#' length was found.
6+
#'
7+
#' @returns A 6-digit character string, or `NA` if no project ID is found in the
8+
#' path.
9+
#' @noRd
10+
get_project_id <- function() {
11+
id <- fs::path_wd() |>
12+
stringr::str_extract("/[0-9]+/") |>
13+
stringr::str_remove_all("/")
14+
15+
if (is.na(id) || id == "") {
16+
cli::cli_warn(
17+
c(
18+
"No project ID could be found in the path of the current working directory, so outputting `NA`.",
19+
"i" = "Your path is {fs::path_wd()}. Maybe change to a working directory within a project?"
20+
)
21+
)
22+
}
23+
24+
if (stringr::str_length(id) != 6 && !is.na(id)) {
25+
cli::cli_abort(
26+
"Found an ID, but it was too long or too short to be a project ID.",
27+
c(
28+
"i" = "The ID found was {id}. Project IDs are expected to be 6 digits long."
29+
)
30+
)
31+
}
32+
id
33+
}
34+
35+
#' Get the path to the rawdata or workdata directory for the current project
36+
#'
37+
#' Looks in the [options()] for `fastreg.project_rawdata_dir` and
38+
#' `fastreg.project_workdata_dir` first, and if not found, constructs a path
39+
#' based on the project ID using `get_project_id()`. The constructed path is
40+
#' `E:/<project_id>/rawdata/` for raw data and `E:/<project_id>/workdata/` for n
41+
#' work data.
42+
#'
43+
#' @returns A path object.
44+
#' @noRd
45+
get_project_rawdata_dir <- function() {
46+
rawdata_path <- getOption("fastreg.project_rawdata_dir")
47+
if (!is.null(rawdata_path)) {
48+
return(fs::path(rawdata_path))
49+
}
50+
51+
id <- get_project_id()
52+
if (is.na(id) || id == "") {
53+
cli::cli_abort(
54+
c(
55+
"Can't set the {.path rawdata/} path without a project ID.",
56+
"i" = "Use {.code options(fastreg.project_rawdata_dir = '<path>')} or change into a directory within a project."
57+
)
58+
)
59+
}
60+
61+
glue::glue("E:/{id}/rawdata/") |>
62+
fs::path()
63+
}
64+
65+
#' @describeIn get_project_rawdata_dir Gets the project workdata directory.
66+
#' @noRd
67+
get_project_workdata_dir <- function() {
68+
workdata_path <- getOption("fastreg.project_workdata_dir")
69+
if (!is.null(workdata_path)) {
70+
return(fs::path(workdata_path))
71+
}
72+
73+
id <- get_project_id()
74+
if (is.na(id) || id == "") {
75+
cli::cli_abort(
76+
c(
77+
"Can't set the {.path workdata/} path without a project ID.",
78+
"i" = "Use {.code options(fastreg.project_workdata_dir = '<path>')} or change into a working directory within a project."
79+
)
80+
)
81+
}
82+
glue::glue("E:/{id}/workdata/") |>
83+
fs::path()
84+
}

README.md

Lines changed: 3 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
<!-- badges: start -->
66

7+
[![CRAN
8+
status](https://www.r-pkg.org/badges/version/fastreg.png)](https://CRAN.R-project.org/package=fastreg)
79
[![GitHub
810
Release](https://img.shields.io/github/v/release/dp-next/fastreg.svg)](https://github.com/dp-next/fastreg/releases/latest)
911
[![Build](https://github.com/dp-next/fastreg/actions/workflows/build.yml/badge.svg)](https://github.com/dp-next/fastreg/actions/workflows/build.yml)
@@ -13,7 +15,7 @@ status](https://results.pre-commit.ci/badge/github/dp-next/fastreg/main.svg)](ht
1315
[![Project Status: Active – The project has reached a stable, usable
1416
state and is being actively
1517
developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
16-
[![CRAN status](https://www.r-pkg.org/badges/version/fastreg)](https://CRAN.R-project.org/package=fastreg)
18+
1719
<!-- badges: end -->
1820

1921
## Overview
@@ -77,17 +79,6 @@ convert_file(
7779
)
7880
```
7981

80-
Use `convert_register()` to convert several SAS files from the same
81-
register into a Hive partitioned Parquet dataset. To list all SAS files
82-
in a directory, you can use the helper function `list_sas_files()`:
83-
84-
``` r
85-
convert_register(
86-
path = list_sas_files("path/to/sas_register/"),
87-
output_dir = "path/to/output_dir/"
88-
)
89-
```
90-
9182
Use `use_targets_template()` to copy a
9283
[targets](https://books.ropensci.org/targets/) template that converts
9384
multiple registers in parallel into your project:

README.qmd

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,6 @@ convert_file(
7979
)
8080
```
8181

82-
Use `convert_register()` to convert several SAS files from the same
83-
register into a Hive partitioned Parquet dataset. To list all SAS files
84-
in a directory, you can use the helper function `list_sas_files()`:
85-
86-
```{r, eval = FALSE}
87-
convert_register(
88-
path = list_sas_files("path/to/sas_register/"),
89-
output_dir = "path/to/output_dir/"
90-
)
91-
```
92-
9382
Use `use_targets_template()` to copy a
9483
[targets](https://books.ropensci.org/targets/) template that converts
9584
multiple registers in parallel into your project:

man/convert_register.Rd

Lines changed: 0 additions & 52 deletions
This file was deleted.

0 commit comments

Comments
 (0)