Skip to content

Commit f357d75

Browse files
Removed dontrun gs examples for CRAN; fixed issues #12, #13, #14
1 parent 316e977 commit f357d75

9 files changed

Lines changed: 71 additions & 128 deletions

File tree

R/grid_search_cv.R

Lines changed: 3 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -192,39 +192,17 @@
192192
#' # 3. A dense Gaussian noise component
193193
#' data <- sim_data()
194194
#' #### -------Tiny grid search-------####
195-
#' # Here is a tiny grid search just to test the function.
195+
#' # Here is a tiny grid search just to test the function quickly.
196196
#' # In practice we would recommend a larger grid search.
197+
#' # For examples of larger searches, see the vignettes.
197198
#' gs <- grid_search_cv(
198199
#' data$D,
199200
#' rrmc,
200-
#' data.frame("eta" = 0.3),
201+
#' data.frame("eta" = 0.35),
201202
#' r = 3,
202203
#' num_runs = 2
203204
#' )
204205
#' gs$summary_stats
205-
#' #### -------Small grid search-------####
206-
#' # Normally we would conduct grid search to tune eta. But, to keep the example
207-
#' # short, we will just use best parameters from the below grid search example:
208-
#' \dontrun{
209-
#' eta_0 <- get_pcp_defaults(data$D)$eta
210-
#' eta_grid <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2))), "r" = 7)
211-
#' gs <- grid_search_cv(data$D, rrmc, eta_grid)
212-
#' gs$summary_stats
213-
#' }
214-
#' # The gs found the best rank to be 3, and the best eta to be 0.3 or 0.4, so
215-
#' # we will split the difference and use an eta of 0.35
216-
#' pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
217-
#' data.frame(
218-
#' "Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
219-
#' "PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
220-
#' "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
221-
#' "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
222-
#' )
223-
#' # Results:
224-
#' # The grid search correctly found the rank (3) of the ground truth L matrix!
225-
#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
226-
#' # PCA "cheat" by telling PCA it was looking for a rank 3 solution)!
227-
#' # PCP successfully isolated the outlying event in S!
228206
#' @export
229207
#' @importFrom magrittr %>%
230208
#' @importFrom rlang .data

R/root_pcp.R

Lines changed: 2 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -175,20 +175,8 @@
175175
#' # 2. A ground truth sparse component S w/outliers along the diagonal; and
176176
#' # 3. A dense Gaussian noise component
177177
#' data <- sim_data(r = 2, sigma = 0.1)
178-
#' # Normally we would conduct grid search to tune lambda and mu. But, to keep
179-
#' # the example short, we will just use best parameters found in the below grid
180-
#' # search example:
181-
#' \dontrun{
182-
#' lambda_0 <- get_pcp_defaults(data$D)$lambda
183-
#' mu_0 <- get_pcp_defaults(data$D)$mu
184-
#' lambdas <- lambda_0 + seq(-0.05, 0.2, 0.025)
185-
#' mus <- mu_0 + seq(-1, 1, 0.3)
186-
#' params <- expand.grid(lambdas, mus)
187-
#' names(params) <- c("lambda", "mu")
188-
#' gs <- grid_search_cv(data$D, root_pcp, params)
189-
#' dplyr::arrange(gs$summary_stats, rel_err)
190-
#' }
191-
#' # The gs found the best parameters to be lambda = 0.225 and mu = 3.04
178+
#' # Best practice is to conduct a grid search with grid_search_cv() function,
179+
#' # but we skip that here for brevity.
192180
#' pcp_model <- root_pcp(data$D, lambda = 0.225, mu = 3.04)
193181
#' data.frame(
194182
#' "Estimated_L_rank" = matrix_rank(pcp_model$L, 5e-2),
@@ -197,11 +185,6 @@
197185
#' "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
198186
#' "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
199187
#' )
200-
#' # Results:
201-
#' # PCP found a rank 2 solution!
202-
#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
203-
#' # PCA "cheat" by telling PCA it was looking for a rank 2 solution)!
204-
#' # PCP successfully isolated the outlying events in S!
205188
#' @references Zhang, Junhui, Jingkai Yan, and John Wright.
206189
#' "Square root principal component pursuit: tuning-free noisy robust matrix
207190
#' recovery." Advances in Neural Information Processing Systems 34 (2021):

R/rrmc.R

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -164,28 +164,15 @@
164164
#' # 2. A ground truth sparse component S w/outliers along the diagonal; and
165165
#' # 3. A dense Gaussian noise component
166166
#' data <- sim_data()
167-
#' # Normally we would conduct grid search to tune eta. But, to keep the example
168-
#' # short, we will just use best parameters from the below grid search example:
169-
#' \dontrun{
170-
#' eta_0 <- get_pcp_defaults(data$D)$eta
171-
#' eta_grid <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2))), "r" = 7)
172-
#' gs <- grid_search_cv(data$D, rrmc, eta_grid)
173-
#' dplyr::arrange(gs$summary_stats, rel_err)
174-
#' }
175-
#' # The gs found the best rank to be 3, and the best eta to be 0.3 or 0.4, so
176-
#' # we will split the difference and use an eta of 0.35
167+
#' # Best practice is to conduct a grid search with grid_search_cv() function,
168+
#' # but we skip that here for brevity.
177169
#' pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
178170
#' data.frame(
179171
#' "Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
180172
#' "PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
181173
#' "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
182174
#' "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
183175
#' )
184-
#' # Results:
185-
#' # The grid search correctly found the rank (3) of the ground truth L matrix!
186-
#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
187-
#' # PCA "cheat" by telling PCA it was looking for a rank 3 solution)!
188-
#' # PCP successfully isolated the outlying event in S!
189176
#' @references Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain.
190177
#' "Nearly optimal robust matrix completion."
191178
#' International Conference on Machine Learning. PMLR, 2017. [available

README.Rmd

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,11 @@ Special thanks to Sophie Calhoun for designing `pcpr`'s logo!
125125

126126
## Usage
127127
```{r usage}
128+
# In the below example, we simulate a simple mixtures model and run PCP,
129+
# comparing it's performance to that of PCA. For an in depth example with
130+
# simulated data, see vignette("pcp-quickstart"). For more realistic
131+
# PCP usage, check out vignette("pcp-applied").
132+
128133
# Simulate an environmental mixture
129134
data <- sim_data(
130135
n = 100, p = 10, r = 3,
@@ -138,12 +143,22 @@ Z_0 <- data$Z # Ground truth noise matrix
138143
139144
# Simulate a limit of detection for each chemical in mixture
140145
lod_info <- sim_lod(D, q = 0.1)
146+
D_lod <- lod_info$D_tilde
141147
lod <- lod_info$lod
142148
143149
# Simulate missing observations
144-
corrupted_data <- sim_na(D, perc = 0.05)
150+
corrupted_data <- sim_na(D_lod, perc = 0.05)
145151
D_tilde <- corrupted_data$D_tilde
146152
153+
# Finish simulating LOD by imputing values < LOD with LOD/sqrt(2)
154+
lod_root2 <- matrix(
155+
lod / sqrt(2),
156+
nrow = nrow(D_tilde),
157+
ncol = ncol(D_tilde), byrow = TRUE
158+
)
159+
lod_idxs <- which(lod_info$tilde_mask == 1)
160+
D_tilde[lod_idxs] <- lod_root2[lod_idxs]
161+
147162
# Run grid search to obtain optimal r, eta parameters
148163
# (Not shown here to save space, see vignette("pcp-quickstart")
149164
# for full example which obtains r = 3, eta = 0.224)
@@ -153,6 +168,9 @@ eta_star <- 0.224
153168
# Run non-convex PCP to estimate L, S from D_tilde
154169
pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
155170
171+
# Clean up sparse matrix
172+
pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
173+
156174
# Benchmark with PCA's attempt at recovering L
157175
D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
158176
L_pca <- proj_rank_r(D_imputed, r = r_star)

README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,11 @@ Special thanks to Sophie Calhoun for designing `pcpr`’s logo!
136136
## Usage
137137

138138
``` r
139+
# In the below example, we simulate a simple mixtures model and run PCP,
140+
# comparing it's performance to that of PCA. For an in depth example with
141+
# simulated data, see vignette("pcp-quickstart"). For more realistic
142+
# PCP usage, check out vignette("pcp-applied").
143+
139144
# Simulate an environmental mixture
140145
data <- sim_data(
141146
n = 100, p = 10, r = 3,
@@ -149,12 +154,22 @@ Z_0 <- data$Z # Ground truth noise matrix
149154

150155
# Simulate a limit of detection for each chemical in mixture
151156
lod_info <- sim_lod(D, q = 0.1)
157+
D_lod <- lod_info$D_tilde
152158
lod <- lod_info$lod
153159

154160
# Simulate missing observations
155-
corrupted_data <- sim_na(D, perc = 0.05)
161+
corrupted_data <- sim_na(D_lod, perc = 0.05)
156162
D_tilde <- corrupted_data$D_tilde
157163

164+
# Finish simulating LOD by imputing values < LOD with LOD/sqrt(2)
165+
lod_root2 <- matrix(
166+
lod / sqrt(2),
167+
nrow = nrow(D_tilde),
168+
ncol = ncol(D_tilde), byrow = TRUE
169+
)
170+
lod_idxs <- which(lod_info$tilde_mask == 1)
171+
D_tilde[lod_idxs] <- lod_root2[lod_idxs]
172+
158173
# Run grid search to obtain optimal r, eta parameters
159174
# (Not shown here to save space, see vignette("pcp-quickstart")
160175
# for full example which obtains r = 3, eta = 0.224)
@@ -164,6 +179,9 @@ eta_star <- 0.224
164179
# Run non-convex PCP to estimate L, S from D_tilde
165180
pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
166181

182+
# Clean up sparse matrix
183+
pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
184+
167185
# Benchmark with PCA's attempt at recovering L
168186
D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
169187
L_pca <- proj_rank_r(D_imputed, r = r_star)
@@ -178,9 +196,9 @@ data.frame(
178196
"PCP_S_sparsity" = sparsity(pcp_model$S)
179197
)
180198
#> Obs_rel_err PCA_L_rel_err PCP_L_rel_err PCP_S_rel_err PCP_L_rank
181-
#> 1 0.1496416 0.08674625 0.05215485 0.3600219 3
199+
#> 1 0.1440249 0.08096932 0.05847706 0.232115 3
182200
#> PCP_S_sparsity
183-
#> 1 0.964
201+
#> 1 0.989
184202
```
185203

186204
## References

man/grid_search_cv.Rd

Lines changed: 3 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/root_pcp.Rd

Lines changed: 2 additions & 19 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/rrmc.Rd

Lines changed: 2 additions & 15 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/pcp-quickstart.Rmd

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,8 @@ lod_root2 <- matrix(
126126
nrow = nrow(D_tilde),
127127
ncol = ncol(D_tilde), byrow = TRUE
128128
)
129-
D_tilde[which(lod_info$tilde_mask == 1)] <- lod_root2[which(lod_info$tilde_mask == 1)]
129+
lod_idxs <- which(lod_info$tilde_mask == 1)
130+
D_tilde[lod_idxs] <- lod_root2[lod_idxs]
130131
plot_matrix(D_tilde)
131132
```
132133

@@ -154,9 +155,11 @@ indicative of complex underlying patterns and a relatively large degree of noise
154155
Most EH data can be described this way. `root_pcp()` is best for data characterized
155156
by rapidly decaying singular values, indicative of very well-defined latent patterns.
156157

157-
For a simple example like the above, both PCP models are perfectly suitable.
158+
The singular values plotted above decay quickly from the first to the second, but very gradually
159+
from the second onward. For this simple simulated dataset, both PCP models are perfectly suitable.
158160
We will use `rrmc()`, as this is the model environmental health researchers will
159-
likely employ most frequently.
161+
likely employ most frequently. The `vignette("pcp-applied")` contains an exemplary
162+
mixtures matrix singular value plot with slowly decaying singular values.
160163

161164
## Grid search for parameter tuning
162165

@@ -188,7 +191,7 @@ passed `etas` as the `grid` argument to search and sent $r = 5$ as a constant
188191
parameter common to all models in the search. Since `length(etas) = 6` and $r = 5$, we
189192
searched through 30 different PCP models. The `num_runs` argument determines how many (random)
190193
tests should be performed for each unique model setting. By default, `num_runs = 100`,
191-
so our grid search tuned `r` and `eta` by measuring the performance of 300 different PCP models.
194+
so our grid search tuned `r` and `eta` by measuring the performance of 3000 different PCP models.
192195
We passed the simulated `lod` vector as another constant to the grid search,
193196
equipping each `rrmc()` run with the same LOD information.
194197

@@ -205,7 +208,13 @@ gs$summary_stats
205208
Inspecting the `summary_stats` table from the output grid search provides the mean-aggregated
206209
statistics for each of the 30 distinct parameter settings we tested.
207210
The grid search correctly identified the rank `r r_star` solution as the best
208-
(lowest relative error rate). The corresponding `eta` = `r eta_star`.
211+
(lowest relative error `rel_err` rate). The corresponding `eta` = `r eta_star`. The top three parameter
212+
settings also seem to have reasonable `S_sparsity` levels as well (all are above `0.95`). The next three
213+
parameter settings seem to under-regularize the sparse `S` matrix by quite a bit, as 80% of entries are non-zero.
214+
We will take the top parameters identified by the grid search in this instance. Had the very top parameters
215+
yielded a sparsity of e.g. `0.7`, we likely then would have preferred the second set of parameters with sparisities in the
216+
`0.9`s. This decision would have been grounded in prior assumptions about the amount of outliers to expect in the mixtuere.
217+
For more on the interpreation of grid search results, consult the documentation for the `grid_search_cv()` function.
209218

210219
## Running PCP
211220

@@ -286,6 +295,8 @@ PCP's sparse matrix estimate was only off from the ground truth `S_0` by
286295

287296
We can now pair our estimated `L` matrix with any matrix factorization method of our
288297
choice (e.g. PCA, factor analysis, or non-negative matrix factorization) to extract
289-
the latent chemical exposure patterns. These patterns, along with the isolated outlying
298+
the latent chemical exposure patterns (an example of what this looks like is
299+
in `vignette("pcp-applied")`, where non-negative matrix factorization is used to extract
300+
patterns from PCP's `L` matrix). These patterns, along with the isolated outlying
290301
exposure events in `S`, can then be analyzed with any outcomes of interest in
291302
downstream epidemiological analyses.

0 commit comments

Comments
 (0)