Skip to content

Commit 1f60a20

Browse files
committed
write motivation + add cran installation instruction
1 parent d9c0a2d commit 1f60a20

2 files changed

Lines changed: 90 additions & 49 deletions

File tree

README.Rmd

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,33 @@ knitr::opts_chunk$set(
2020
[![CRAN status](https://www.r-pkg.org/badges/version/tdarec)](https://CRAN.R-project.org/package=tdarec)
2121
<!-- badges: end -->
2222

23-
The goal of {tdarec} is to provide [{recipes}](https://cran.r-project.org/package=recipes)-style preprocessing steps to compute persistent homology (PH) and calculate vectorizations of persistence diagrams (PDs), and to provide [{dials}](https://cran.r-project.org/package=dials)-style hyperparameter tuners to optimize these steps in ML workflows.
23+
{tdarec} provides
2424

25-
You can install the development version of tdarec from [GitHub](https://github.com/) with:
25+
* [{recipes}](https://cran.r-project.org/package=recipes)-style preprocessing steps to compute persistent homology and calculate vectorizations of persistence diagrams and
26+
* [{dials}](https://cran.r-project.org/package=dials)-style hyperparameter tuners to optimize these steps in machine learning workflows.
27+
28+
The most recent release can be installed from [CRAN](https://cran.r-project.org/package=tdarec):
29+
30+
``` r
31+
install.packages("tdarec")
32+
```
33+
34+
You can also install the development version from [GitHub](https://github.com/):
2635

2736
``` r
2837
# install.packages("pak")
2938
pak::pak("tdaverse/tdarec")
3039
```
3140

41+
## Motivation
42+
43+
The rich theory of persistent homology (PH) has inspired a great volume and diversity of applications to domains beyond mathematics.
44+
Many, possibly most, publications on this front are curated at the [Database of Original & Non-theoretical Uses of Topology](https://donut.topology.rocks/), which can be searched specifically for statistical inference or machine learning, for example.
45+
46+
Some of these applications are highly specialized, but others require only that a few new topological tools be coupled with conventional statistical designs.
47+
As researchers implement and test new specialized tools, they should also establish conditions and develop standards for their use.
48+
The goal of {tdarec} is to make this more standardized work more efficient, transparent, and reproducible, including by providing additional steps and dials as methods mature.
49+
3250
## Design
3351

3452
### Recipe steps
@@ -38,10 +56,10 @@ The current version provides two engines to compute PH (more will be implemented
3856
* **Vietoris--Rips** filtrations of point clouds (distance matrices or coordinate matrices) using [{ripserr}](https://github.com/tdaverse/ripserr)
3957
* **cubical** filtrations of rasters (pixelated or voxelated data) using {ripserr}
4058

41-
Also included are a pre-processing step to introduce **Gaussian blur** to rasters and a post-processing step to select PDs for **specific homological degrees**.
59+
Also included are a pre-processing step to introduce **Gaussian blur** to rasters and a post-processing step to select persistence diagrams (PDs) for **specific homological degrees**.
4260

43-
Finally, this version provides steps that deploy the highly efficient **vectorizations** implemented in [{TDAvec}](https://github.com/uislambekov/TDAvec).
44-
These were written with {Rcpp} specifically for ML applications.
61+
Finally, this version provides steps that deploy the highly efficient **vectorizations** for PDs implemented in [{TDAvec}](https://github.com/uislambekov/TDAvec).
62+
These were written with {Rcpp} specifically for machine learning applications.
4563

4664
### Tunable parameters
4765

@@ -52,7 +70,7 @@ An implementation is underway.
5270

5371
### Data formats and sets
5472

55-
While the most common {recipes} are designed for structured tabular data, i.e. columns with numeric or categorical entries, almost all data subjected to machine learning with persistent homology has been in forms like point clouds or greyscale images that must be stored in list-columns.
73+
While the most common {recipes} are designed for structured tabular data, i.e. columns with numeric or categorical entries, almost all data subjected to machine learning with PH has been in forms like point clouds or greyscale images that must be stored in list-columns.
5674
All {tdarec} examples use data in this form, and the data installed with the package is pre-processed for such use.
5775

5876
## Example
@@ -62,10 +80,10 @@ Note also that [{glmnet}](https://cran.r-project.org/package=glmnet) and [{tdaun
6280

6381
### Setup
6482

65-
While not required, we attach Tidyverse and Tidymodels for convenience (with messages suppressed):
83+
While not required, we attach tidyverse and tidymodels for convenience (with messages suppressed):
6684

6785
```{r example packages, message=FALSE}
68-
# prepare a Tidymodels session and attach {tdarec}
86+
# prepare a tidymodels session and attach {tdarec}
6987
library(tidyverse)
7088
library(tidymodels)
7189
library(tdarec)
@@ -98,13 +116,13 @@ klein_test <- testing(klein_split)
98116
klein_folds <- vfold_cv(klein_train, v = 3L)
99117
```
100118

101-
In this example, we adopt a common transformation of persistence diagrams, Euler characteristic curves.
119+
In this example, we adopt a common transformation of PDs, Euler characteristic curves.
102120
For their vectorization, we need a scale sequence that spans the birth and death times of any persistent features, and for this we choose a round number larger than the diameters of both point clouds (based on the sampler documentation) as an upper bound.
103121
Rather than choose _a priori_ to use homology up to degree 0, 1, 2, or 3, we prepare to tune the maximum degree during optimization.
104122

105123
### Specifications
106124

107-
To prevent the model from using the data set column as a predictor, we assign it a new role, which is preserved by the persistent homology step and ignored by the vectorization step (which outputs new predictor columns).
125+
To prevent the model from using the data set column as a predictor, we assign it a new role, which is preserved by the PH step and ignored by the vectorization step (which outputs new predictor columns).
108126

109127
```{r example recipe}
110128
# specify a pre-processing recipe
@@ -116,7 +134,7 @@ recipe(embedding ~ sample, data = klein_train) |>
116134
print() -> klein_rec
117135
```
118136

119-
For simplicity, we choose a common model for ML classification, penalized logistic regression.
137+
For simplicity, we choose a common model for classification, penalized logistic regression.
120138
We fix the mixture coefficient to use LASSO rather than ridge regression but prepare the penalty parameter for tuning.
121139

122140
```{r example model}
@@ -196,7 +214,7 @@ Please note that the tdarec project is released with a [Contributor Code of Cond
196214

197215
### Generated code
198216

199-
Much of the code exposing {TDAvec} tools to Tidymodels is generated by elaborate scripts rather than written manually.
217+
Much of the code exposing {TDAvec} tools to tidymodels is generated by elaborate scripts rather than written manually.
200218
While maintenance of these scripts takes effort, it prevents (or at least flags) errors arising from cascading implications of changes to the original functions, and it allows simple and rapid package-wide adjustments. If you see an issue with generated code, please raise an issue to discuss it before submitting a pull request.
201219

202220
### Acknowledgments

README.md

Lines changed: 60 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,47 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h
1111
status](https://www.r-pkg.org/badges/version/tdarec)](https://CRAN.R-project.org/package=tdarec)
1212
<!-- badges: end -->
1313

14-
The goal of {tdarec} is to provide
15-
[{recipes}](https://cran.r-project.org/package=recipes)-style
16-
preprocessing steps to compute persistent homology (PH) and calculate
17-
vectorizations of persistence diagrams (PDs), and to provide
18-
[{dials}](https://cran.r-project.org/package=dials)-style hyperparameter
19-
tuners to optimize these steps in ML workflows.
14+
{tdarec} provides
2015

21-
You can install the development version of tdarec from
22-
[GitHub](https://github.com/) with:
16+
- [{recipes}](https://cran.r-project.org/package=recipes)-style
17+
preprocessing steps to compute persistent homology and calculate
18+
vectorizations of persistence diagrams and
19+
- [{dials}](https://cran.r-project.org/package=dials)-style
20+
hyperparameter tuners to optimize these steps in machine learning
21+
workflows.
22+
23+
The most recent release can be installed from
24+
[CRAN](https://cran.r-project.org/package=tdarec):
25+
26+
``` r
27+
install.packages("tdarec")
28+
```
29+
30+
You can also install the development version from
31+
[GitHub](https://github.com/):
2332

2433
``` r
2534
# install.packages("pak")
2635
pak::pak("tdaverse/tdarec")
2736
```
2837

38+
## Motivation
39+
40+
The rich theory of persistent homology (PH) has inspired a great volume
41+
and diversity of applications to domains beyond mathematics. Many,
42+
possibly most, publications on this front are curated at the [Database
43+
of Original & Non-theoretical Uses of
44+
Topology](https://donut.topology.rocks/), which can be searched
45+
specifically for statistical inference or machine learning, for example.
46+
47+
Some of these applications are highly specialized, but others require
48+
only that a few new topological tools be coupled with conventional
49+
statistical designs. As researchers implement and test new specialized
50+
tools, they should also establish conditions and develop standards for
51+
their use. The goal of {tdarec} is to make this more standardized work
52+
more efficient, transparent, and reproducible, including by providing
53+
additional steps and dials as methods mature.
54+
2955
## Design
3056

3157
### Recipe steps
@@ -41,13 +67,13 @@ issue](https://github.com/tdaverse/tdarec/issues/2) for plans):
4167
{ripserr}
4268

4369
Also included are a pre-processing step to introduce **Gaussian blur**
44-
to rasters and a post-processing step to select PDs for **specific
45-
homological degrees**.
70+
to rasters and a post-processing step to select persistence diagrams
71+
(PDs) for **specific homological degrees**.
4672

4773
Finally, this version provides steps that deploy the highly efficient
48-
**vectorizations** implemented in
74+
**vectorizations** for PDs implemented in
4975
[{TDAvec}](https://github.com/uislambekov/TDAvec). These were written
50-
with {Rcpp} specifically for ML applications.
76+
with {Rcpp} specifically for machine learning applications.
5177

5278
### Tunable parameters
5379

@@ -63,10 +89,10 @@ is vectorized. An implementation is underway.
6389

6490
While the most common {recipes} are designed for structured tabular
6591
data, i.e. columns with numeric or categorical entries, almost all data
66-
subjected to machine learning with persistent homology has been in forms
67-
like point clouds or greyscale images that must be stored in
68-
list-columns. All {tdarec} examples use data in this form, and the data
69-
installed with the package is pre-processed for such use.
92+
subjected to machine learning with PH has been in forms like point
93+
clouds or greyscale images that must be stored in list-columns. All
94+
{tdarec} examples use data in this form, and the data installed with the
95+
package is pre-processed for such use.
7096

7197
## Example
7298

@@ -79,11 +105,11 @@ installed.
79105

80106
### Setup
81107

82-
While not required, we attach Tidyverse and Tidymodels for convenience
108+
While not required, we attach tidyverse and tidymodels for convenience
83109
(with messages suppressed):
84110

85111
``` r
86-
# prepare a Tidymodels session and attach {tdarec}
112+
# prepare a tidymodels session and attach {tdarec}
87113
library(tidyverse)
88114
library(tidymodels)
89115
library(tdarec)
@@ -132,21 +158,19 @@ klein_test <- testing(klein_split)
132158
klein_folds <- vfold_cv(klein_train, v = 3L)
133159
```
134160

135-
In this example, we adopt a common transformation of persistence
136-
diagrams, Euler characteristic curves. For their vectorization, we need
137-
a scale sequence that spans the birth and death times of any persistent
138-
features, and for this we choose a round number larger than the
139-
diameters of both point clouds (based on the sampler documentation) as
140-
an upper bound. Rather than choose *a priori* to use homology up to
141-
degree 0, 1, 2, or 3, we prepare to tune the maximum degree during
142-
optimization.
161+
In this example, we adopt a common transformation of PDs, Euler
162+
characteristic curves. For their vectorization, we need a scale sequence
163+
that spans the birth and death times of any persistent features, and for
164+
this we choose a round number larger than the diameters of both point
165+
clouds (based on the sampler documentation) as an upper bound. Rather
166+
than choose *a priori* to use homology up to degree 0, 1, 2, or 3, we
167+
prepare to tune the maximum degree during optimization.
143168

144169
### Specifications
145170

146171
To prevent the model from using the data set column as a predictor, we
147-
assign it a new role, which is preserved by the persistent homology step
148-
and ignored by the vectorization step (which outputs new predictor
149-
columns).
172+
assign it a new role, which is preserved by the PH step and ignored by
173+
the vectorization step (which outputs new predictor columns).
150174

151175
``` r
152176
# specify a pre-processing recipe
@@ -169,10 +193,9 @@ recipe(embedding ~ sample, data = klein_train) |>
169193
#> • Euler characteristic curve of: sample
170194
```
171195

172-
For simplicity, we choose a common model for ML classification,
173-
penalized logistic regression. We fix the mixture coefficient to use
174-
LASSO rather than ridge regression but prepare the penalty parameter for
175-
tuning.
196+
For simplicity, we choose a common model for classification, penalized
197+
logistic regression. We fix the mixture coefficient to use LASSO rather
198+
than ridge regression but prepare the penalty parameter for tuning.
176199

177200
``` r
178201
# specify a classification model
@@ -228,9 +251,9 @@ klein_res |>
228251
select_best(metric = "roc_auc") |>
229252
print() -> klein_best
230253
#> # A tibble: 1 × 3
231-
#> penalty vr_degree .config
232-
#> <dbl> <int> <chr>
233-
#> 1 0.0000000001 1 Preprocessor1_Model1
254+
#> penalty vr_degree .config
255+
#> <dbl> <int> <chr>
256+
#> 1 0.0000000001 1 pre1_mod1_post0
234257
```
235258

236259
### Evaluation
@@ -274,7 +297,7 @@ By contributing to this project, you agree to abide by its terms.
274297

275298
### Generated code
276299

277-
Much of the code exposing {TDAvec} tools to Tidymodels is generated by
300+
Much of the code exposing {TDAvec} tools to tidymodels is generated by
278301
elaborate scripts rather than written manually. While maintenance of
279302
these scripts takes effort, it prevents (or at least flags) errors
280303
arising from cascading implications of changes to the original

0 commit comments

Comments
 (0)