Skip to content

Commit bbd7279

Browse files
committed
docs: 📝 return chunk_info from convert() and add Quarto report
1 parent 7f78919 commit bbd7279

1 file changed

Lines changed: 65 additions & 20 deletions

File tree

vignettes/design.qmd

Lines changed: 65 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The actions are:
4848
- `convert`: Convert a register SAS file (or multiple) to Parquet.
4949
- `list`: List files in a directory, e.g., SAS or Parquet files.
5050
- `read`: Read a Parquet register into R as a DuckDB table.
51-
- `use`: Use a template in the current project.
51+
- `use`: Set up `_targets.R` and a Quarto report template.
5252
- `get`: Get or guess some information, e.g., the project ID, workdata
5353
directory, or rawdata directory from the current working directory.
5454

@@ -144,28 +144,73 @@ within the same register contain exactly the same columns and data
144144
types, the conversion report helps identify any differences between
145145
these files.
146146

147-
It addresses the following question:
148-
149-
- Are the column names, number of columns, and data types the same
150-
across all Parquet files within a register?
151-
152-
The report is produced by the following functions with distinct
153-
responsibilities:
154-
155-
- `get_parquet_info()`: Reads the Parquet parts for a single file and
156-
returns the column names and data types.
157-
- `create_report()`: Assembles the report by taking the outputs of
158-
`get_parquet_info()` for all files in a register and returns a
159-
structured list with a comparison and basic check of the files. It
160-
reports the findings of the check for the register as either OK (ll
161-
column names, data types, and number of columns are the same across
162-
all Parquet files) or DIFF (one or more files have different column
163-
names, number of columns, or data types).
164-
165147
::: callout-note
166148
Discrepancies (different columns or incompatible data types) between
167149
files within the same register do not stop the conversion, but will be
168150
noted in the report.
169151
:::
170152

171-
The results are written to a txt file to allow for later inspection.
153+
`convert_file()` returns a metadata tibble with one row per written
154+
chunk. This can be queried with `dplyr` directly or rendered into a
155+
Quarto report.
156+
157+
### Return value of `convert_file()`
158+
159+
`convert_file()` returns a tibble with one row per written chunk:
160+
161+
| Column | Description |
162+
|----------------|----------------------------------------------|
163+
| `input_path` | Path to the source SAS file |
164+
| `output_path` | Path to the written Parquet part file |
165+
| `row_count` | Number of rows in the chunk |
166+
| `column_count` | Number of columns in the chunk |
167+
| `columns` | Nested tibble with column `name` and `type` |
168+
169+
The information is derived from the chunk already in memory, not by
170+
reading the Parquet file back.
171+
172+
```r
173+
# Before repeat loop.
174+
chunk_info_list <- list()
175+
176+
# Inside the repeat loop, after writing.
177+
chunk_info_list <- c(chunk_info_list, list(tibble::tibble(
178+
input_path = path,
179+
output_path = fs::path(file_path),
180+
row_count = nrow(chunk),
181+
column_count = ncol(chunk),
182+
columns = list(tibble::tibble(
183+
name = colnames(chunk),
184+
type = purrr::map_chr(chunk, class)
185+
))
186+
)))
187+
188+
# After the loop, bind all chunk tibbles.
189+
dplyr::bind_rows(chunk_info_list)
190+
```
191+
192+
### Quarto report template
193+
194+
`use_fastreg_template()` copies both `_targets.R` and
195+
`conversion_report.qmd` into the current working directory. The Quarto
196+
doc reads `chunk_info` via `targets::tar_read()` and produces an HTML or
197+
PDF report for review.
198+
199+
<!-- TODO: What should the default format be? -->
200+
201+
```r
202+
chunk_info <- targets::tar_read(chunk_info)
203+
204+
# Nice overview of the info + schema comparison within registers.
205+
...
206+
```
207+
208+
The report is added to the targets pipeline as a last target:
209+
210+
```r
211+
tar_target(
212+
name = report,
213+
command = quarto::quarto_render("conversion_report.qmd"),
214+
deployment = "main"
215+
)
216+
```

0 commit comments

Comments
 (0)