kuenm2/vignettes/prepare_data.Rmd at 4dd5971b391847059b8a64831c2aab5f36a23e55 · wevertonbio/kuenm2 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
---
title: "2. Prepare Data for Model Calibration"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. Prepare Data for Model Calibration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 6,
  dpi=100,
  out.width = "95%"
)
```

# Summary

- [Description](#description)
- [Getting ready](#getting-ready)
- [Preparing data](#preparing-data)
  - [Example data](#example-data)
  - [First steps in preparing data](#first-steps-in-preparing-data)
  - [Using pre-processed data](#using-pre-processed-data)
- [Exploring prepared data](#exploring-prepared-data)
  - [Comparative histograms](#comparative-histograms)
  - [Distribution of data for models](#distribution-of-data-for-models)
  - [Similarity assessment in partitions](#similarity-assessment-in-partitions)
- [More options to prepare data](#more-options-to-prepare-data)
  - [Custom formulas](#custom-formulas)
  - [Using a bias file](#using-a-bias-file)
  - [PCA for variables](#pca-for-variables)
    - [Internal PCA](#internal-pca)
    - [External PCA](#external-pca)
  - [Using external data partitions](#using-external-data-partitions)
    - [ENMeval partitions](#enmeval-partitions)
    - [flexsdm partitions](#flexsdm-partitions)
- [Saving prepared data](#saving-prepared-data)

<hr>

# Description

Before starting the ENM process, data must be formatted in a specific structure required by functions in `kuenm2`. This vignette guides users through the steps necessary to prepare occurrence data and environmental predictors using built-in tools. It covers the use of the main functions, `prepare_data()` and `prepare_user_data()`, to generate standardized objects that are essential for model calibration. The guide also demonstrates options to compute principal components from variables (PCA), incorporating sampling bias, integrating data partitioning schemes from external methods, exploring prepared data, and saving the data object for later use.

<br>

# Getting ready

If `kuenm2`  has not been installed yet, please do so. See the [Main guide](index.html) for installation instructions. See also the [basic data cleaning guide](basic_data_cleaning.html) for some data cleaning steps.

Use the following lines of code to load `kuenm2`  and any other required packages, and define a working directory (if needed). In general, setting a working directory in R is considered good practice, as it provides better control over where files are read from or saved to. If users are not working within an R project, we recommend setting a working directory, since at least one file will be saved at later stages of this guide.

Note: functions from other packages (i.e., not from base R or `kuenm2` ) used in this guide will be displayed as `package::function()`.

```{r, results='hide', message=FALSE, warning=FALSE}
# Load packages
library(kuenm2)
library(terra)

# Current directory
getwd()

# Define new directory
#setwd("YOUR/DIRECTORY")  # uncomment and modify if setting a new directory

# Saving original plotting parameters
original_par <- par(no.readonly = TRUE)
```

<br>

# Preparing data

## Example data

We will use occurrence records provided within the `kuenm2`  package. Most example data in the package is derived from [Trindade & Marques (2024)](https://doi.org/10.1111/ddi.13931). The `occ_data` object contains 51 occurrences of *Myrcia hatschbachii*, a tree endemic to Southern Brazil. Although this example data set has three columns (species, x, and y), only two numeric columns with longitude and latitude coordinates are required.

```{r Import occurrence data}
# Import occurrences
data(occ_data, package = "kuenm2")

# Check data structure
str(occ_data)
```

<br>

As predictor variables, we will use another dataset included in the package. This dataset comprises four bioclimatic variables from  [WorldClim 2.1](https://worldclim.org/data/bioclim.html) at 10 arc-minute resolution, and a categorical variable (SoilType) from [SoilGrids](https://soilgrids.org/) aggregated to 10 arc-minutes. All variables have been masked using a polygon that delimits the area for model calibration. This polygon was generated by drawing a minimum convex polygon around the records, with a 300 km buffer.

Note: At later steps, `kuenm2` helps users automate model transfers to multiple future and/or past scenarios. If using bioclimatic variables from WroldClim, part of that process involves renaming variables, and renaming options are limited. Try to limit variable names to the following formats: `"bio_1", "bio_12"`; `"bio_01", "bio_12"`; `"Bio_1", "Bio_12"`; or `"Bio_01", "Bio_12"`. If variable names have other formats, using automated steps are still possible but require a few extra steps.

```{r Load variables, dpi = 80, out.width = "70%"}
# Import raster layers
var <- terra::rast(system.file("extdata", "Current_variables.tif",
                               package = "kuenm2"))

# Check variables
terra::plot(var)
```

<br>

Visualize occurrences records in geography:

```{r, dpi = 80, out.width = "70%"}
# Visualize occurrences on one variable
terra::plot(var[["bio_1"]], main = "Bio 1")

points(occ_data[, c("x", "y")], col = "black")
```

<br>

## First steps in preparing data

The functions `prepare_data()` and `prepare_user_data()` are central to getting data ready for model calibration. They handles several key steps:

* **Defining the algorithm**: Users can choose between `maxnet` or `glm`.
* **Generating background points**: Background points are sampled from raster layers (`prepare_data()`), unless provided by the user (`prepare_user_data()`). Background points serve as a reference to contrast presence records. The default number of points is 1000; we encourage users to define appropriate numbers considering the calibration area extension, the number of presence records, and the number of pixels available (thic can change depending on layer resolution).
* **Principal component analysis (PCA)**: An optional step that can be done with the variables provided.
* **Preparing calibration data**: Presence records and background points are associate with predictor values and put together in a `data.frame` to be used in the ENM. Calibration data is provided by the user in `prepare_user_data()`.
* **Data partitioning**: The function divides your data to prepare training and testing sets via a cross-validation process. The partitioning methods directly available include `kfolds`, `subsample`, and `bootstrap`.
* **Defining grid of model parameters**: This helps setting up combinations of feature classes (FCs), regularization multiplier (RM) values (for Maxnet), and sets of predictor variables. An explanation of the roles of RMs and FCs in Maxent models see [Merow et al. 2013](https://doi.org/10.1111/j.1600-0587.2013.07872.x).

As with any function, we recommend checking the documentation with `help(prepare_data)` for more detailed explanations. Now, let's prepare the data for model calibration, using 4 k-folds to partition training and testing datasets:

```{r simple prepare data}
# Prepare data for maxnet model
d <- prepare_data(algorithm = "maxnet",
                  occ = occ_data,
                  x = "x", y = "y",
                  raster_variables = var,
                  species = "Myrcia hatschbachii",
                  categorical_variables = "SoilType",
                  partition_method = "kfolds",
                  n_partitions = 4,
                  n_background = 1000,
                  features = c("l", "q", "lq", "lqp"),
                  r_multiplier = c(0.1, 1, 2))
```

<br>

The `prepare_data()` function returns a `prepared_data` object, a list that contains several essential components for model calibration. Below is an example of the object’s printed output, which provides a summary of its contents.

```{r print prepared data}
print(d)
```

<br>

The parts of the `prepared_data` object can be explored in further detail by indexing them as in the following example.

```{r explore some data}
# Check the algorithm selected
d$algorithm

# See first rows of calibration data
head(d$calibration_data)

# See first rows of formula grid
head(d$formula_grid)
```

<br>

The algorithms that can be selected are `"maxnet"` or `"glm"`. When using GLMs, regularization multipliers are not used.

Now, let's run an example using `glm`, this time using the `subsample` partitioning method, with a total of 10 partitions, and 70% of the dataset used for training in every iteration.

```{r prepare glm}
# Prepare data selecting GLM as the algorithm
d_glm <- prepare_data(algorithm = "glm",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var,
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType",
                      partition_method = "subsample",
                      n_partitions = 10,
                      train_proportion = 0.7,
                      n_background = 300,
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = NULL)  # Not necessary with GLMs

# Print object
d_glm
```

<br>

## Using pre-processed data

In some cases, users already have data that has been prepared for model calibration (e.g., when data is prepared for time-specific applications). When that is the case, the function `prepare_user_data()` can take the pre-processed data and get them ready for the next analyses. User-prepared data must be a `data.frame` that includes a column with zeros and ones, indicating **presence (1)** and **background (0)** records, along with columns with values for each of the **variables**. The package includes an example of such a `data.frame` for reference (see below).

```{r import user data}
data("user_data", package = "kuenm2")

head(user_data)
```

<br>

The `prepare_user_data()` function operates similarly to `prepare_data()`, but with key differences. The main difference is that instead of requiring a table with coordinates and a `SpatRaster` of variables, it takes the already prepared `data.frame`. See full documentation with `help(prepare_user_data)`.

```{r prepare user data}
# Prepare data for maxnet model
data_user <- prepare_user_data(algorithm = "maxnet",
                               user_data = user_data,  # user-prepared data.frame
                               pr_bg = "pr_bg",
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               partition_method = "bootstrap",
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))

data_user
```

<br>

This function also allows users to provide a list identifying which points should be used for testing in each cross-validation iteration. This can be useful to keep data partitions stable among distinct model calibration routines. If `user_folds` is `NULL` (the default), the function partitions the data according to `partition_method`, `n_partitions`, and `train_proportion`.

<br>

# Exploring prepared data

In the following examples, we'll use the object `d`, prepared using the `maxnet` algorithm. The same can be done with `prepared_data` using `glm` as de algorithm.

<br>

## Comparative histograms

Users can visualize the distribution of predictor values for occurrence records, background points, and the entire calibration area using histograms. An example is presented below. See full documentation with `help(explore_calibration_hist)` and `help(plot_explore_calibration)`.

```{r explore histogram, out.width = "70%", fig.width = 7, fig.height = 4.5}
# Prepare histogram data
calib_hist <- explore_calibration_hist(data = d, raster_variables = var,
                                       include_m = TRUE)

# Plot histograms
plot_calibration_hist(explore_calibration = calib_hist)
```

<br>

In the previous plot, gray represents values across the entire calibration area, blue background values, and green values at presence records (magnified by a factor of 2 to enhance visualization). Both the colors and the magnification factor can be customized.

If `raster_variables` are not available, exclude that argument and `include_m` when running the function `explore_calibration_hist()`. This could happen when users have pre-processed data and run `prepare_user_data()`.

<br>

## Distribution of data for models

Additionally, users can explore the geographic distribution of occurrences and background, as well as how they were partitioned. See full documentation with `help(explore_partition_geo)`.

```{r explore spatial, out.width = "75%"}
# Explore spatial distribution
pbg <- explore_partition_geo(data = d, raster_variables = var[[1]])

# Plot exploration results in geography
terra::plot(pbg)
```

<br>

Note that, by default, background points are selected randomly within the calibration area. However, users can influence this selection by providing a bias file (see section [More options to prepare data](#more-options-to-prepare-data)).

<br>

## Similarity assessment in partitions

When partitioning data, some testing points may fall into environments that are different from those in which training points are. This forces the model to evaluate under extrapolation, testing its predictions on conditions outside its training range.

The `explore_partition_extrapolation()` function calculates dissimilarity and detects non-analogous conditions in testing points after comparing them to the training data for each partition. Dissimilarity tests are performed using the mobility-oriented parity metric (MOP; [Owens et al. 2013](https://doi.org/10.1016/j.ecolmodel.2013.04.011)) as in [Cobos et al. (2024)](https://biogeography.pensoft.net/article/132916/). This analysis only requires a `prepared_data` object.

By default, the function `explore_partition_extrapolation()` uses all training data (presences and backgrounds) to define the environmental space of reference, and computes MOP for testing records. If computing MOP for test background points is needed, set `include_test_background = TRUE`.

```{r}
# Run extrapolation risk analysis
mop_partition <- explore_partition_extrapolation(data = d,
                                                 include_test_background = TRUE)
```

<br>

The previous run returns a list in which the main outcome is `Mop_results`. This is a `data.frame`, in which each row is a testing record; the columns are:

- **mop_distance**: MOP distances;
- **inside_range** an indicator of whether environmental conditions at each testing record fall within the training range;
- **n_var_out**: number of variables outside the training range;
- **towards_low** and **towards_high **: names of variables with values lower or higher than the training range;
- **SoilType**: because the `prepared_data` object includes a categorical variable, it also contains a column indicating which categories in testing data were not present in training data.

```{r mop_results}
# Check some of the results
head(mop_partition$Mop_results)

# Number of testing records with values outside training ranges
nrow(mop_partition$Mop_results[mop_partition$Mop_results$n_var_out > 1, ])
```

<br>

Now we can use the function `plot_explore_partition()` to visualize the records from each partition in both geographic and environmental spaces. As comparisons were performed in environmental space, to visualize results in geographic space the plotting functions uses a simplified map of the world, but another spatial object can be defined in `calibration_area` if needed.

The type MOP result to plot can be specified as: "simple" to show records in a partition within or out of envrionmental range of the other partitions; or "distance" to display the distance of each record to the nearest set of conditions in the other partitions.

```{r, fig.width = 8, fig.height = 3.5, out.width = "75%"}
# Simple plot in geographic space
plot_explore_partition(explore_partition = mop_partition, space = "G",
                       type_of_plot = "simple")

# Distance plot in geographic space
plot_explore_partition(explore_partition = mop_partition, space = "G",
                       type_of_plot = "distance",
                       lwd_legend = 4)

# Simple plot in environmental space
plot_explore_partition(explore_partition = mop_partition, space = "E",
                       type_of_plot = "simple",
                       variables = c("bio_7", "bio_15"))

# Distance plot in environmental space
plot_explore_partition(explore_partition = mop_partition, space = "E",
                       type_of_plot = "distance",
                       variables = c("bio_7", "bio_15"),
                       lwd_legend = 4)

```

<br>

The data used in this example was partitioned using k-folds which reduces the chances of finding novel conditions when comparing testing vs training sets of data. However, that might not be the case when using data partitioning methods such as spatial blocks (see [Using external data partitions](#using-external-data-partitions)).

<br>

# More options to prepare data

The examples of data preparation above show a few of the options that can be used to get data ready to start the ENM process. The next sections demonstrate other options available for data preparation, including: (1) customizing the list of model formulas to test in model calibration; (2) principal component analysis for variables; and (3) using external data partitioning methods.

<br>

## Custom formulas

By default, `kuenm2` builds the formula grid automatically using the variables provided and the feature classes defined.

For instance, if `raster_variables` or `user_data` contain *bio_1* and *bio_12*, and you set the features to `lq` (linear + quadratic), the formula will include linear and quadratic terms for each variable. In this example, the resulting formula would be:

```{r formula example}
"~ bio_1 + bio_12 + I(bio_1^2) + I(bio_12^2)"
```

<br>

Instead of letting the functions build formulas based on selected features, users can provide custom formulas. This is useful when full control over which terms are included is needed (for example, including the quadratic version of specific variables):

```{r user_formula}
# Set custom formulas
my_formulas <- c("~ bio_1 + bio_12 + I(bio_1^2) + I(bio_12^2)",
                 "~ bio_1 + bio_12 + I(bio_1^2)",
                 "~ bio_1 + bio_12 + I(bio_12^2)",
                 "~ bio_1 + I(bio_1^2) + I(bio_12^2)")

# Prepare data using custom formulas
d_custom_formula <- prepare_data(algorithm = "maxnet",
                                 occ = occ_data,
                                 x = "x", y = "y",
                                 raster_variables = var,
                                 species = "Myrcia hatschbachii",
                                 categorical_variables = "SoilType",
                                 partition_method = "kfolds",
                                 n_partitions = 4,
                                 n_background = 1000,
                                 user_formulas = my_formulas,  # Custom formulas
                                 r_multiplier = c(0.1, 1, 2))

# Check formula grid
d_custom_formula$formula_grid
```

<br>

## Using a bias file

A bias file is a `SpatRaster` object that contains values that influence the selection of background points within the calibration area. This can be particularly useful for mitigating sampling bias. For instance, if we want the selection of background points to be informed by how historical sampling has been, a layer of the density of records from a target group can be used (see [Ponder et al. 2001](https://doi.org/10.1046/j.1523-1739.2001.015003648.x), [Anderson et al. 2003](https://doi.org/10.1046/j.1365-2699.2003.00867.x), and [Barber et al. 2020](https://doi.org/10.1111/ddi.13442)). The bias file must have the same extent, resolution, and number of cells as your raster variables.

Let's illustrate this process with an example bias file included in the package. This `SpatRaster` has lower values in the center and higher values towards the borders of the area:

```{r import bias file, out.width = "70%"}
# Import a bias file
bias <- terra::rast(system.file("extdata", "bias_file.tif", package = "kuenm2"))

# Check the bias values
terra::plot(bias)
```

<br>

This bias layer will be used to prepare two new datasets: one with a "direct" bias effect (with higher probability of selecting background points in regions with higher bias values) and another with an "inverse" effect (the oposite).

```{r bias data, results='hide'}
# Using a direct bias effect in sampling
d_bias_direct <- prepare_data(algorithm = "maxnet",
                              occ = occ_data,
                              x = "x", y = "y",
                              raster_variables = var,
                              species = "Myrcia hatschbachii",
                              categorical_variables = "SoilType",
                              n_background = 1000,
                              partition_method = "kfolds",
                              bias_file = bias, bias_effect = "direct",  # bias parameters
                              features = c("l", "q", "p", "lq", "lqp"),
                              r_multiplier = c(0.1, 1, 2, 3, 5))

# Using an indirect bias effect
d_bias_inverse <- prepare_data(algorithm = "maxnet",
                               occ = occ_data,
                               x = "x", y = "y",
                               raster_variables = var,
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               n_background = 1000,
                               partition_method = "kfolds",
                               bias_file = bias, bias_effect = "inverse",  # bias parameters
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))
```

<br>

Let's use the `explore_partition_geo` function to see the effect of using a bias file.

```{r exp_part_geobias}
# Explore spatial distribution of points
## No bias
geo_dist <- explore_partition_geo(data = d, raster_variables = var)

## Direct bias effect
geo_dist_bias <- explore_partition_geo(data = d_bias_direct,
                                       raster_variables = var)

## Inverse bias effect
geo_dist_bias_inv <- explore_partition_geo(data = d_bias_inverse,
                                           raster_variables = var)

## Adjusting plotting grid
par(mfrow = c(2, 2))

## The plots to show sampling bias effects
terra::plot(bias, main = "Bias file")
terra::plot(geo_dist$Background, main = "Random Background",
            plg = list(cex = 0.75))  # Decrease size of legend text)
terra::plot(geo_dist_bias$Background, main = "Direct Bias Effect",
            plg = list(cex = 0.75))  # Decrease size of legend text)
terra::plot(geo_dist_bias_inv$Background, main = "Inverse Bias Effect",
            plg = list(cex = 0.75))  # Decrease size of legend text)
```

<br>

## PCA for variables

A common approach in ENM involves summarizing the information from a set of variables into a smaller set of orthogonal variables using Principal Component Analysis (PCA) (see [Trindade et al. 2025](https://onlinelibrary.wiley.com/doi/epdf/10.1111/jbi.70103) for an example). `kuenm2` has options to perform a PCA internally or use variables that have been externally prepared as PCs.

<br>

### Internal PCA

`kuenm2` can perform all PCA transformations internally, which eliminates the need of transforming raw variables into PCs when producing projections later on. This is particularly advantageous when projecting model results across multiple scenarios (e.g., various Global Climate Models for different future periods). By performing PCA internally, you only need to provide the raw environmental variables (e.g., `bio_1`, `bio_2`, etc.) when making projections; helper functions will handle the PCA transformation internally.

Let's explore how to implement this:

```{r prepare data pca}
# Prepare data for maxnet models using PCA parameters
d_pca <- prepare_data(algorithm = "maxnet",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var,
                      do_pca = TRUE, center = TRUE, scale = TRUE,  # PCA parameters
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType",
                      n_background = 1000,
                      partition_method = "kfolds",
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = c(0.1, 1, 2, 3, 5))

print(d_pca)
```

<br>

The elements `calibration_data` and `formula_grid` have now been generated considering the principal components (PCs). By default, all continuous variables were included in the PCA, while categorical variables (e.g., "SoilType") were excluded. The default settings to define the number of PCs to retain keeps as many PCs as needed to collectively explain 95% of the total variance, and then further filter these, keeping only those axes that individually explain at least 5% of the variance. These parameters can be changed using the arguments in the function `prepare_data()`.

```{r explore hist, out.width = "70%", fig.width = 7, fig.height = 4.5}
# Check calibration data
head(d_pca$calibration_data)

# Check formula grid
head(d_pca$formula_grid)

# Explore variables distribution
calib_hist_pca <- explore_calibration_hist(data = d_pca, raster_variables = var,
                                           include_m = TRUE, breaks = 7)

plot_calibration_hist(explore_calibration = calib_hist_pca)
```

<br>

### External PCA

Users can choose to perform a PCA with their data by using the `perform_pca()` function, or one of their preference. Se an example with `perform_pca()` below:

```{r do PCA, out.width = "70%"}
# PCA with raw raster variables
pca_var <- perform_pca(raster_variables = var, exclude_from_pca = "SoilType",
                       center = TRUE, scale = TRUE)

# Plot
terra::plot(pca_var$env)
```

<br>

Now, let's use the PCs generated by `perform_pca()` to prepare the data:

```{r prepare data pca external}
# Prepare data for maxnet model using PCA variables
d_pca_extern <- prepare_data(algorithm = "maxnet",
                             occ = occ_data,
                             x = "x", y = "y",
                             raster_variables = pca_var$env,  # Output of perform_pca()
                             do_pca = FALSE,  # Set to FALSE because variables are PCs
                             species = "Myrcia hatschbachii",
                             categorical_variables = "SoilType",
                             n_background = 1000,
                             partition_method = "kfolds",
                             features = c("l", "q", "p", "lq", "lqp"),
                             r_multiplier = c(0.1, 1, 2, 3, 5))

# Check the object
print(d_pca_extern)

# Check formula grid
head(d_pca_extern$formula_grid)
```

<br>

Note that since PCA was performed  externally, `do_pca = FALSE`  is set in `prepare_data()`. This is crucial because setting it to `TRUE` would incorrectly apply PCA to variables that are *already* PCs. The `prepared_data` object in this scenario does not store any PCA-related information. Therefore, users **must provide PCs** instead of raw raster variables when projecting models.

<br>

## Using external data partitions

The functions `prepare_data()` and `prepare_user_data()` in the `kuenm2` package include four built-in methods for data partitioning:

- **"kfolds"**: Splits the dataset into *K* subsets (folds) of approximately equal size. In each partition, one fold is used as the test set, while the remaining folds are combined to form the training set.

- **"bootstrap"**: Creates the training dataset by sampling observations from the original dataset *with replacement* (i.e., the same observation can be selected multiple times). The test set consists of the observations that were not selected in that specific replicate.

- **"subsample"**: Similar to bootstrap, but the training set is created by sampling *without replacement* (i.e., each observation is selected once). The test set includes the observations not selected for training.

Other methods for data partitioning are also available, including those based on spatial rules ([Radosavljevic and Anderson, 2014](https://onlinelibrary.wiley.com/doi/pdf/10.1111/jbi.12227)). Although `kuenm2` does not currently implement spatial partitioning techniques, it is possible to use the ones implemented in other R packages. After partitioning data using other packages, those results can be used to replace the `part_data` section in the `prepared_data` object from `kuenm2`.

<br>

### ENMeval partitions

The ENMeval package ([Kass et al. 2021](https://besjournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1111/2041-210X.13628)) provides three spatial partitioning methods:

- **Spatial block**: Divides occurrence data into four groups based on latitude and longitude, aiming for groups of roughly equal number of occurrences.

- **Checkerboard 1 (basic)**: Generates a checkerboard grid over the study area and assigns points to groups based on their location in the grid.

- **Checkerboard 2 (hierarchical)**: Aggregates the input raster at two hierarchical spatial scales using specified aggregation factors. Points are then grouped based on their position in the resulting hierarchical checkerboard.

Let's use the **spatial block** method as an example. We will use the object `d`, `prepared_data` created in previous steps.

```{r import enmeval_block, echo=FALSE}
data(enmeval_block, package = "kuenm2")
```

```{r spatial block enmeval, eval=FALSE}
# Install ENMeval if not already installed
if(!require("ENMeval")){
  install.packages("ENMeval")
}

# Extract calibration data from the prepared_data object and
# separate presence and background records
calib_occ <- d$data_xy[d$calibration_data$pr_bg == 1, ]  # Presences
calib_bg <- d$data_xy[d$calibration_data$pr_bg == 0, ]  # Background

# Apply spatial block partitioning using ENMeval
enmeval_block <- ENMeval::get.block(occs = calib_occ, bg = calib_bg)

# Inspect the structure of the output
str(enmeval_block)
#> List of 2
#>  $ occs.grp: num [1:51] 1 1 2 2 1 4 2 2 4 2 ...
#>  $ bg.grp  : num [1:957] 2 4 4 1 3 3 3 1 1 3 ...
```

<br>

The output of `get.block()` is a list with two elements: `occs.grp` and `bg.grp`. The `occs.grp` vector is for occurrence records and `bg.grp` for background points. Both vectors contain numeric values from 1 to 4, indicating the spatial group.

`kuenm2` stores partitioned data as a **list of vectors**, in which each vector is a partition, containing the **indices of points left out for testing**. The indices include both presence and background points.

```{r index part kuenm2}
str(d$part_data)
```

<br>

We can convert the spatial group information stored in `enmeval_block` into a list compatible with `kuenm2`:

```{r transform enmeval_block}
# Identify unique spatial blocks
id_blocks <- sort(unique(unlist(enmeval_block)))

# Create a list of test indices for each spatial block
new_part_data <- lapply(id_blocks, function(i) {
  # Indices of occurrence records in group i
  rep_i_presence <- which(enmeval_block$occs.grp == i)

  # Indices of background records in group i
  rep_i_bg <- which(enmeval_block$bg.grp == i)

  # To get the right indices for background,
  # we need to sum the total number of records to background indices
  rep_i_bg <- rep_i_bg + nrow(occ_data)

  # Combine presence and background indices for the test set
  c(rep_i_presence, rep_i_bg)
})

# Assign names to each partition
names(new_part_data) <- paste0("Partition_", id_blocks)

# Inspect the structure of the new partitioned data
str(new_part_data)
```

<br>

We now have a list containing four vectors, each storing the indices of test data defined using the `get.block()` function from the `ENMeval` package. The final step is to replace the original `part_data` in the `prepared_data` object with `new_part_data`. We should also update the partitioning method to reflect this change.

```{r replace part_data}
# Replace the original partition data with the new spatial blocks
d_spatial_block <- d
d_spatial_block$part_data <- new_part_data

# Update the partitioning method to reflect the new approach
d_spatial_block$partition_method <- "Spatial block (ENMeval)"  # Any name works

# Print the updated prepared_data object
print(d_spatial_block)
```

<br>

Let's check the spatial distribution of the partitioned data:

```{r plot spatial block, fig.width = 9, fig.height = 4}
# Explore data partitioning in geography
geo_block <- explore_partition_geo(d_spatial_block, raster_variables = var[[1]])

# Plot data partition in geography
terra::plot(geo_block[[c("Presence", "Background")]])
```

<br>

Because environmental conditions often vary by region, using spatial blocks increases the chances of having testing records outside training environmental ranges. Let's explore this effect using the `prepared_data` object, partitioned with `ENMeval` spatial blocks.

```{r, fig.width = 8, fig.height = 3.5, out.width = "75%"}
# Run extrapolation risk analysis
mop_blocks <- explore_partition_extrapolation(data = d_spatial_block,
                                              include_test_background = TRUE)

# Check some testing records with values outside training ranges
head(mop_blocks$Mop_results[mop_blocks$Mop_results$n_var_out > 1, ])

# Check simple extrapolation in geographic space
plot_explore_partition(explore_partition = mop_blocks, space = "G",
                       type_of_plot = "simple")

# Now in environmental space
plot_explore_partition(explore_partition = mop_blocks, space = "E",
                       type_of_plot = "simple",
                       variables = c("bio_7", "bio_15"))
```

<br>

Note that in partition 1, some occurrence records fall outside the envrionmental range of the other partitions (the same happens with many background records).

<br>

### flexsdm partitions

The package `flexsdm` ([Velazco et al. 2022](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13874)) also offers multiple partitioning methods. We will use the method spatial block cross-validation. In this method, the number and size of blocks is optimized internally considering spatial autocorrelation, environmental similarity, and the number of presence and background records within each partition. For more details, see [Data partitioning at the package website](https://sjevelazco.github.io/flexsdm/articles/v01_pre_modeling.html#data-partitioning).

```{r import flexsdm_block, echo=FALSE}
data(flexsdm_block, package = "kuenm2")
```

```{r spatial block flexsdm, eval=FALSE}
# Install flexsdm if not already installed
if (!require("flexsdm")) {
  if (!require("remotes")) {
    install.packages("remotes")
  }

  remotes::install_github("sjevelazco/flexsdm")  # needs compilation tools to work
}

# Combine calibration data with spatial coordinates
calib_data <- cbind(d$data_xy, d$calibration_data)

# Split data in test and train using flexsdm
flexsdm_block <- flexsdm::part_sblock(env_layer = var, data = calib_data,
                                      x = "x", y = "y", pr_ab = "pr_bg",
                                      min_res_mult = 1, max_res_mult = 500,
                                      num_grids = 30, n_part = 4,
                                      prop = 0.5)
#> The following grid cell sizes will be tested:
#> 0.17 | 3.03 | 5.9 | 8.77 | 11.64 | 14.51 | 17.37 | 20.24 | 23.11 | 25.98 |
#> 28.84 | 31.71 | 34.58 | 37.45 | 40.32 | 43.18 | 46.05 | 48.92 | 51.79 |
#> 54.66 | 57.52 | 60.39 | 63.26 | 66.13 | 68.99 | 71.86 | 74.73 | 77.6 |
#> 80.47 | 83.33
#>
#> Creating basic raster mask...
#>
#> Searching for the optimal grid size...

# See the structure of the object
str(flexsdm_block$part)
#> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1008 obs. of  4 variables:
#>  $ x    : num  -51.3 -50.6 -49.3 -49.8 -50.2 ...
#>  $ y    : num  -29 -27.6 -27.8 -26.9 -28.2 ...
#>  $ pr_ab: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ .part: int  3 1 3 2 3 3 1 4 4 3 ...
```

<br>

The output from `flexsdm` differs from that of `ENMeval`. Instead of returning a list with separate vectors of spatial group IDs, `flexsdm` appends the group assignments as a new column in the `part` element of its output.

As with the ENMeval example, we can transform the spatial group information stored in `flexsdm_block` into a format compatible with `kuenm2`:

```{r transform flexsdm_block}
# Identify unique spatial blocks from flexsdm output
id_blocks <- sort(unique(flexsdm_block$part$.part))

# Create a list of test indices for each spatial block
new_part_data_flexsdm <- lapply(id_blocks, function(i) {
  # Indices of occurrences and background points in group i
  which(flexsdm_block$part$.part == i)
})

# Assign names to each partition partition
names(new_part_data_flexsdm) <- paste0("Partition_", id_blocks)

# Inspect the structure of the new partitioned data
str(new_part_data_flexsdm)

# Replace the partition data in the prepared_data object
d_block_flexsdm <- d
d_block_flexsdm$part_data <- new_part_data_flexsdm

# Update the partition method description
d_block_flexsdm$partition_method <- "Spatial block (flexsdm)"  # any name works

# Display the updated prepared_data object
print(d_block_flexsdm)
```

<br>

Let's check the spatial distribution of the partitioned data:

```{r plot spatial block flexsdm, fig.width = 9, fig.height = 4}
# Explore data partitioning in geography
geo_block_flexsdm <- explore_partition_geo(d_block_flexsdm,
                                           raster_variables = var[[1]])

# Plot data partition in geography
terra::plot(geo_block_flexsdm[[c("Presence", "Background")]])
```

<br>

```{r par_reset}
# Reset plotting parameters
par(original_par)
```

<br>

# Saving prepared data

The `prepared_data` object is crucial for the next step in the ENM workflow in `kuenm2`, model calibration. As this object is essentially a list, users can save it to a local directory using `saveRDS()`. Saving the object facilitates loading it back into your R session later using `readRDS()`. See an example below:

```{r save data, eval=FALSE}
# Set directory to save (here, in a temporary directory)
dir_to_save <- file.path(tempdir())

# Save the data
saveRDS(d, file.path(dir_to_save, "Data.rds"))

# Import data
d <- readRDS(file.path(dir_to_save, "Data.rds"))
```