You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of {tdarec} is to provide [{recipes}](https://cran.r-project.org/package=recipes)-style preprocessing steps to compute persistent homology (PH) and calculate vectorizations of persistence diagrams (PDs), and to provide [{dials}](https://cran.r-project.org/package=dials)-style hyperparameter tuners to optimize these steps in ML workflows.
23
+
{tdarec} provides
24
24
25
-
You can install the development version of tdarec from [GitHub](https://github.com/) with:
25
+
*[{recipes}](https://cran.r-project.org/package=recipes)-style preprocessing steps to compute persistent homology and calculate vectorizations of persistence diagrams and
26
+
*[{dials}](https://cran.r-project.org/package=dials)-style hyperparameter tuners to optimize these steps in machine learning workflows.
27
+
28
+
The most recent release can be installed from [CRAN](https://cran.r-project.org/package=tdarec):
29
+
30
+
```r
31
+
install.packages("tdarec")
32
+
```
33
+
34
+
You can also install the development version from [GitHub](https://github.com/):
26
35
27
36
```r
28
37
# install.packages("pak")
29
38
pak::pak("tdaverse/tdarec")
30
39
```
31
40
41
+
## Motivation
42
+
43
+
The rich theory of persistent homology (PH) has inspired a great volume and diversity of applications to domains beyond mathematics.
44
+
Many, possibly most, publications on this front are curated at the [Database of Original & Non-theoretical Uses of Topology](https://donut.topology.rocks/), which can be searched specifically for statistical inference or machine learning, for example.
45
+
46
+
Some of these applications are highly specialized, but others require only that a few new topological tools be coupled with conventional statistical designs.
47
+
As researchers implement and test new specialized tools, they should also establish conditions and develop standards for their use.
48
+
The goal of {tdarec} is to make this more standardized work more efficient, transparent, and reproducible, including by providing additional steps and dials as methods mature.
49
+
32
50
## Design
33
51
34
52
### Recipe steps
@@ -38,10 +56,10 @@ The current version provides two engines to compute PH (more will be implemented
38
56
***Vietoris--Rips** filtrations of point clouds (distance matrices or coordinate matrices) using [{ripserr}](https://github.com/tdaverse/ripserr)
39
57
***cubical** filtrations of rasters (pixelated or voxelated data) using {ripserr}
40
58
41
-
Also included are a pre-processing step to introduce **Gaussian blur** to rasters and a post-processing step to select PDs for **specific homological degrees**.
59
+
Also included are a pre-processing step to introduce **Gaussian blur** to rasters and a post-processing step to select persistence diagrams (PDs) for **specific homological degrees**.
42
60
43
-
Finally, this version provides steps that deploy the highly efficient **vectorizations** implemented in [{TDAvec}](https://github.com/uislambekov/TDAvec).
44
-
These were written with {Rcpp} specifically for ML applications.
61
+
Finally, this version provides steps that deploy the highly efficient **vectorizations**for PDs implemented in [{TDAvec}](https://github.com/uislambekov/TDAvec).
62
+
These were written with {Rcpp} specifically for machine learning applications.
45
63
46
64
### Tunable parameters
47
65
@@ -52,7 +70,7 @@ An implementation is underway.
52
70
53
71
### Data formats and sets
54
72
55
-
While the most common {recipes} are designed for structured tabular data, i.e. columns with numeric or categorical entries, almost all data subjected to machine learning with persistent homology has been in forms like point clouds or greyscale images that must be stored in list-columns.
73
+
While the most common {recipes} are designed for structured tabular data, i.e. columns with numeric or categorical entries, almost all data subjected to machine learning with PH has been in forms like point clouds or greyscale images that must be stored in list-columns.
56
74
All {tdarec} examples use data in this form, and the data installed with the package is pre-processed for such use.
57
75
58
76
## Example
@@ -62,10 +80,10 @@ Note also that [{glmnet}](https://cran.r-project.org/package=glmnet) and [{tdaun
62
80
63
81
### Setup
64
82
65
-
While not required, we attach Tidyverse and Tidymodels for convenience (with messages suppressed):
83
+
While not required, we attach tidyverse and tidymodels for convenience (with messages suppressed):
66
84
67
85
```{r example packages, message=FALSE}
68
-
# prepare a Tidymodels session and attach {tdarec}
86
+
# prepare a tidymodels session and attach {tdarec}
In this example, we adopt a common transformation of persistence diagrams, Euler characteristic curves.
119
+
In this example, we adopt a common transformation of PDs, Euler characteristic curves.
102
120
For their vectorization, we need a scale sequence that spans the birth and death times of any persistent features, and for this we choose a round number larger than the diameters of both point clouds (based on the sampler documentation) as an upper bound.
103
121
Rather than choose _a priori_ to use homology up to degree 0, 1, 2, or 3, we prepare to tune the maximum degree during optimization.
104
122
105
123
### Specifications
106
124
107
-
To prevent the model from using the data set column as a predictor, we assign it a new role, which is preserved by the persistent homology step and ignored by the vectorization step (which outputs new predictor columns).
125
+
To prevent the model from using the data set column as a predictor, we assign it a new role, which is preserved by the PH step and ignored by the vectorization step (which outputs new predictor columns).
For simplicity, we choose a common model for ML classification, penalized logistic regression.
137
+
For simplicity, we choose a common model for classification, penalized logistic regression.
120
138
We fix the mixture coefficient to use LASSO rather than ridge regression but prepare the penalty parameter for tuning.
121
139
122
140
```{r example model}
@@ -196,7 +214,7 @@ Please note that the tdarec project is released with a [Contributor Code of Cond
196
214
197
215
### Generated code
198
216
199
-
Much of the code exposing {TDAvec} tools to Tidymodels is generated by elaborate scripts rather than written manually.
217
+
Much of the code exposing {TDAvec} tools to tidymodels is generated by elaborate scripts rather than written manually.
200
218
While maintenance of these scripts takes effort, it prevents (or at least flags) errors arising from cascading implications of changes to the original functions, and it allows simple and rapid package-wide adjustments. If you see an issue with generated code, please raise an issue to discuss it before submitting a pull request.
0 commit comments