Removed dontrun gs examples for CRAN; fixed issues #12, #13, #14

lawrence-chillrud · lawrence-chillrud · commit f357d7562254 · 2025-03-26T13:35:46.000-05:00
diff --git a/R/grid_search_cv.R b/R/grid_search_cv.R
@@ -192,39 +192,17 @@
 #' # 3. A dense Gaussian noise component
 #' data <- sim_data()
 #' #### -------Tiny grid search-------####
-#' # Here is a tiny grid search just to test the function.
+#' # Here is a tiny grid search just to test the function quickly.
 #' # In practice we would recommend a larger grid search.
+#' # For examples of larger searches, see the vignettes.
 #' gs <- grid_search_cv(
 #'   data$D,
 #'   rrmc,
-#'   data.frame("eta" = 0.3),
+#'   data.frame("eta" = 0.35),
 #'   r = 3,
 #'   num_runs = 2
 #' )
 #' gs$summary_stats
-#' #### -------Small grid search-------####
-#' # Normally we would conduct grid search to tune eta. But, to keep the example
-#' # short, we will just use best parameters from the below grid search example:
-#' \dontrun{
-#' eta_0 <- get_pcp_defaults(data$D)$eta
-#' eta_grid <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2))), "r" = 7)
-#' gs <- grid_search_cv(data$D, rrmc, eta_grid)
-#' gs$summary_stats
-#' }
-#' # The gs found the best rank to be 3, and the best eta to be 0.3 or 0.4, so
-#' # we will split the difference and use an eta of 0.35
-#' pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
-#' data.frame(
-#'   "Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
-#'   "PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
-#'   "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
-#'   "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
-#' )
-#' # Results:
-#' # The grid search correctly found the rank (3) of the ground truth L matrix!
-#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
-#' # PCA "cheat" by telling PCA it was looking for a rank 3 solution)!
-#' # PCP successfully isolated the outlying event in S!
 #' @export
 #' @importFrom magrittr %>%
 #' @importFrom rlang .data
diff --git a/R/root_pcp.R b/R/root_pcp.R
@@ -175,20 +175,8 @@
 #' # 2. A ground truth sparse component S w/outliers along the diagonal; and
 #' # 3. A dense Gaussian noise component
 #' data <- sim_data(r = 2, sigma = 0.1)
-#' # Normally we would conduct grid search to tune lambda and mu. But, to keep
-#' # the example short, we will just use best parameters found in the below grid
-#' # search example:
-#' \dontrun{
-#' lambda_0 <- get_pcp_defaults(data$D)$lambda
-#' mu_0 <- get_pcp_defaults(data$D)$mu
-#' lambdas <- lambda_0 + seq(-0.05, 0.2, 0.025)
-#' mus <- mu_0 + seq(-1, 1, 0.3)
-#' params <- expand.grid(lambdas, mus)
-#' names(params) <- c("lambda", "mu")
-#' gs <- grid_search_cv(data$D, root_pcp, params)
-#' dplyr::arrange(gs$summary_stats, rel_err)
-#' }
-#' # The gs found the best parameters to be lambda = 0.225 and mu = 3.04
+#' # Best practice is to conduct a grid search with grid_search_cv() function,
+#' # but we skip that here for brevity.
 #' pcp_model <- root_pcp(data$D, lambda = 0.225, mu = 3.04)
 #' data.frame(
 #'   "Estimated_L_rank" = matrix_rank(pcp_model$L, 5e-2),
@@ -197,11 +185,6 @@
 #'   "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
 #'   "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
 #' )
-#' # Results:
-#' # PCP found a rank 2 solution!
-#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
-#' # PCA "cheat" by telling PCA it was looking for a rank 2 solution)!
-#' # PCP successfully isolated the outlying events in S!
 #' @references Zhang, Junhui, Jingkai Yan, and John Wright.
 #'   "Square root principal component pursuit: tuning-free noisy robust matrix
 #'   recovery." Advances in Neural Information Processing Systems 34 (2021):
diff --git a/R/rrmc.R b/R/rrmc.R
@@ -164,28 +164,15 @@
 #' # 2. A ground truth sparse component S w/outliers along the diagonal; and
 #' # 3. A dense Gaussian noise component
 #' data <- sim_data()
-#' # Normally we would conduct grid search to tune eta. But, to keep the example
-#' # short, we will just use best parameters from the below grid search example:
-#' \dontrun{
-#' eta_0 <- get_pcp_defaults(data$D)$eta
-#' eta_grid <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2))), "r" = 7)
-#' gs <- grid_search_cv(data$D, rrmc, eta_grid)
-#' dplyr::arrange(gs$summary_stats, rel_err)
-#' }
-#' # The gs found the best rank to be 3, and the best eta to be 0.3 or 0.4, so
-#' # we will split the difference and use an eta of 0.35
+#' # Best practice is to conduct a grid search with grid_search_cv() function,
+#' # but we skip that here for brevity.
 #' pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
 #' data.frame(
 #'   "Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
 #'   "PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
 #'   "PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
 #'   "PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
 #' )
-#' # Results:
-#' # The grid search correctly found the rank (3) of the ground truth L matrix!
-#' # PCP outperformed PCA in it's recovery of the L matrix (even though we let
-#' # PCA "cheat" by telling PCA it was looking for a rank 3 solution)!
-#' # PCP successfully isolated the outlying event in S!
 #' @references Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain.
 #'   "Nearly optimal robust matrix completion."
 #'   International Conference on Machine Learning. PMLR, 2017. [available
diff --git a/README.Rmd b/README.Rmd
@@ -125,6 +125,11 @@ Special thanks to Sophie Calhoun for designing `pcpr`'s logo!
 
 ## Usage
 ```{r usage}
+# In the below example, we simulate a simple mixtures model and run PCP,
+# comparing it's performance to that of PCA. For an in depth example with
+# simulated data, see vignette("pcp-quickstart"). For more realistic
+# PCP usage, check out vignette("pcp-applied").
+
 # Simulate an environmental mixture
 data <- sim_data(
   n = 100, p = 10, r = 3,
@@ -138,12 +143,22 @@ Z_0 <- data$Z # Ground truth noise matrix
 
 # Simulate a limit of detection for each chemical in mixture
 lod_info <- sim_lod(D, q = 0.1)
+D_lod <- lod_info$D_tilde
 lod <- lod_info$lod
 
 # Simulate missing observations
-corrupted_data <- sim_na(D, perc = 0.05)
+corrupted_data <- sim_na(D_lod, perc = 0.05)
 D_tilde <- corrupted_data$D_tilde
 
+# Finish simulating LOD by imputing values < LOD with LOD/sqrt(2)
+lod_root2 <- matrix(
+  lod / sqrt(2),
+  nrow = nrow(D_tilde),
+  ncol = ncol(D_tilde), byrow = TRUE
+)
+lod_idxs <- which(lod_info$tilde_mask == 1)
+D_tilde[lod_idxs] <- lod_root2[lod_idxs]
+
 # Run grid search to obtain optimal r, eta parameters
 # (Not shown here to save space, see vignette("pcp-quickstart")
 # for full example which obtains r = 3, eta = 0.224)
@@ -153,6 +168,9 @@ eta_star <- 0.224
 # Run non-convex PCP to estimate L, S from D_tilde
 pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
 
+# Clean up sparse matrix
+pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
+
 # Benchmark with PCA's attempt at recovering L
 D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
 L_pca <- proj_rank_r(D_imputed, r = r_star)
diff --git a/README.md b/README.md
@@ -136,6 +136,11 @@ Special thanks to Sophie Calhoun for designing `pcpr`’s logo!
 ## Usage
 
 ``` r
+# In the below example, we simulate a simple mixtures model and run PCP,
+# comparing it's performance to that of PCA. For an in depth example with
+# simulated data, see vignette("pcp-quickstart"). For more realistic
+# PCP usage, check out vignette("pcp-applied").
+
 # Simulate an environmental mixture
 data <- sim_data(
   n = 100, p = 10, r = 3,
@@ -149,12 +154,22 @@ Z_0 <- data$Z # Ground truth noise matrix
 
 # Simulate a limit of detection for each chemical in mixture
 lod_info <- sim_lod(D, q = 0.1)
+D_lod <- lod_info$D_tilde
 lod <- lod_info$lod
 
 # Simulate missing observations
-corrupted_data <- sim_na(D, perc = 0.05)
+corrupted_data <- sim_na(D_lod, perc = 0.05)
 D_tilde <- corrupted_data$D_tilde
 
+# Finish simulating LOD by imputing values < LOD with LOD/sqrt(2)
+lod_root2 <- matrix(
+  lod / sqrt(2),
+  nrow = nrow(D_tilde),
+  ncol = ncol(D_tilde), byrow = TRUE
+)
+lod_idxs <- which(lod_info$tilde_mask == 1)
+D_tilde[lod_idxs] <- lod_root2[lod_idxs]
+
 # Run grid search to obtain optimal r, eta parameters
 # (Not shown here to save space, see vignette("pcp-quickstart")
 # for full example which obtains r = 3, eta = 0.224)
@@ -164,6 +179,9 @@ eta_star <- 0.224
 # Run non-convex PCP to estimate L, S from D_tilde
 pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
 
+# Clean up sparse matrix
+pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
+
 # Benchmark with PCA's attempt at recovering L
 D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
 L_pca <- proj_rank_r(D_imputed, r = r_star)
@@ -178,9 +196,9 @@ data.frame(
   "PCP_S_sparsity" = sparsity(pcp_model$S)
 )
 #>   Obs_rel_err PCA_L_rel_err PCP_L_rel_err PCP_S_rel_err PCP_L_rank
-#> 1   0.1496416    0.08674625    0.05215485     0.3600219          3
+#> 1   0.1440249    0.08096932    0.05847706      0.232115          3
 #>   PCP_S_sparsity
-#> 1          0.964
+#> 1          0.989
 ```
 
 ## References
diff --git a/man/grid_search_cv.Rd b/man/grid_search_cv.Rd
diff --git a/man/root_pcp.Rd b/man/root_pcp.Rd
diff --git a/man/rrmc.Rd b/man/rrmc.Rd
diff --git a/vignettes/pcp-quickstart.Rmd b/vignettes/pcp-quickstart.Rmd
@@ -126,7 +126,8 @@ lod_root2 <- matrix(
   nrow = nrow(D_tilde),
   ncol = ncol(D_tilde), byrow = TRUE
 )
-D_tilde[which(lod_info$tilde_mask == 1)] <- lod_root2[which(lod_info$tilde_mask == 1)]
+lod_idxs <- which(lod_info$tilde_mask == 1)
+D_tilde[lod_idxs] <- lod_root2[lod_idxs]
 plot_matrix(D_tilde)
 ```
 
@@ -154,9 +155,11 @@ indicative of complex underlying patterns and a relatively large degree of noise
 Most EH data can be described this way. `root_pcp()` is best for data characterized
 by rapidly decaying singular values, indicative of very well-defined latent patterns.
 
-For a simple example like the above, both PCP models are perfectly suitable. 
+The singular values plotted above decay quickly from the first to the second, but very gradually
+from the second onward. For this simple simulated dataset, both PCP models are perfectly suitable.
 We will use `rrmc()`, as this is the model environmental health researchers will
-likely employ most frequently.
+likely employ most frequently. The `vignette("pcp-applied")` contains an exemplary
+mixtures matrix singular value plot with slowly decaying singular values.
 
 ## Grid search for parameter tuning
 
@@ -188,7 +191,7 @@ passed `etas` as the `grid` argument to search and sent $r = 5$ as a constant
 parameter common to all models in the search. Since `length(etas) = 6` and $r = 5$, we
 searched through 30 different PCP models. The `num_runs` argument determines how many (random)
 tests should be performed for each unique model setting. By default, `num_runs = 100`,
-so our grid search tuned `r` and `eta` by measuring the performance of 300 different PCP models.
+so our grid search tuned `r` and `eta` by measuring the performance of 3000 different PCP models.
 We passed the simulated `lod` vector as another constant to the grid search,
 equipping each `rrmc()` run with the same LOD information.
 
@@ -205,7 +208,13 @@ gs$summary_stats
 Inspecting the `summary_stats` table from the output grid search provides the mean-aggregated
 statistics for each of the 30 distinct parameter settings we tested.
 The grid search correctly identified the rank `r r_star` solution as the best
-(lowest relative error rate). The corresponding `eta` = `r eta_star`.
+(lowest relative error `rel_err` rate). The corresponding `eta` = `r eta_star`. The top three parameter
+settings also seem to have reasonable `S_sparsity` levels as well (all are above `0.95`). The next three
+parameter settings seem to under-regularize the sparse `S` matrix by quite a bit, as 80% of entries are non-zero.
+We will take the top parameters identified by the grid search in this instance. Had the very top parameters
+yielded a sparsity of e.g. `0.7`, we likely then would have preferred the second set of parameters with sparisities in the
+`0.9`s. This decision would have been grounded in prior assumptions about the amount of outliers to expect in the mixtuere.
+For more on the interpreation of grid search results, consult the documentation for the `grid_search_cv()` function.
 
 ## Running PCP
 
@@ -286,6 +295,8 @@ PCP's sparse matrix estimate was only off from the ground truth `S_0` by
 
 We can now pair our estimated `L` matrix with any matrix factorization method of our
 choice (e.g. PCA, factor analysis, or non-negative matrix factorization) to extract
-the latent chemical exposure patterns. These patterns, along with the isolated outlying
+the latent chemical exposure patterns (an example of what this looks like is
+in `vignette("pcp-applied")`, where non-negative matrix factorization is used to extract
+patterns from PCP's `L` matrix). These patterns, along with the isolated outlying
 exposure events in `S`, can then be analyzed with any outcomes of interest in
 downstream epidemiological analyses.