bpbond
diff --git a/‎essd/essd_ms.Rmd‎
Lines changed: 42 additions & 50 deletions b/‎essd/essd_ms.Rmd‎
Lines changed: 42 additions & 50 deletions
diff --git a/‎essd/figures/figure3-structure.graffle/data.plist‎ renamed to ‎essd/figures/figure1-structure.graffle/data.plist‎ b/‎essd/figures/figure3-structure.graffle/data.plist‎ renamed to ‎essd/figures/figure1-structure.graffle/data.plist‎
diff --git a/‎essd/figures/figure3-structure.graffle/preview.jpeg‎ renamed to ‎essd/figures/figure1-structure.graffle/preview.jpeg‎ b/‎essd/figures/figure3-structure.graffle/preview.jpeg‎ renamed to ‎essd/figures/figure1-structure.graffle/preview.jpeg‎
@@ -92,20 +92,18 @@ Its development started in April 2019, and as of this writing (`r Sys.Date()`) t
 
 _Database and dataset structure_
 
-The database is structured as a collection of independent contributed _datasets_, all of which have been standardized to a common structure, units, etc. 
-Each dataset is given a reference name that links its constituent tables, provides a point of reference in reports, and is used when calling the R package accessor functions (see below).
+The database is structured as a collection of independent contributed datasets, all of which have been standardized to a common structure and units. 
+Each dataset is given a reference name (internal to COSORE) that links its constituent tables, and provides a point of reference in reports.
+Each constituent dataset normally has a series of separate data tables:
 
-Each constituent dataset normally has a series of separate data tables that are linked by keys.
-These tables include:
-
-* _description_ (**Table 2**) describing site and dataset characteristics;
-* _contributors_ (**Table 3**) listing individuals who contributed to the measurement, analysis,
+* _description_ (**Table 2**) describes site and dataset characteristics;
+* _contributors_ (**Table 3**) lists individuals who contributed to the measurement, analysis,
 curation, and/or submission of the dataset; 
-* _ports_ (**Table 4**) which gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
-* _data_ (**Table 5**), the central table of the dataset, which records flux observations;
-* _ancillary_ (**Supplementary Table S1**) summarizing site-level ancillary measurements;
-* _columns_ (**Supplementary Table S2**), mapping raw data columns to standard COSORE columns, providing a record for reproducibility; and
-* _diagnostics_ (**Supplementary Table S3**), which provides statistics on the data import process: errors, columns and rows dropped, etc.
+* _ports_ (**Table 4**) gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
+* _data_ (**Table 5**), the central table of the dataset, records flux observations;
+* _ancillary_ (**Supplementary Table S1**) summarizes site-level ancillary measurements;
+* _columns_ (**Supplementary Table S2**), maps raw data columns to standard COSORE columns, providing a record for reproducibility; and
+* _diagnostics_ (**Supplementary Table S3**) provides statistics on the data import process: errors, columns and rows dropped, etc.
 
 The common key linking these dataset tables is the CSR_DATASET field, which records the unique name assigned to the dataset. In addition, a CSR_PORT key field links the _ports_ and _data_ tables. These links make it straightforward to extract datasets that have measured particular fluxes in
 certain ecosystem types, or isolate only non-treatment (control) chamber fluxes, for example.
@@ -114,33 +112,26 @@ _Versioning and archiving_
 
 COSORE uses semantic versioning (https://semver.org/), meaning that its version numbers
 generally follow an "x.y.z" format, where _x_ is the major version number (changing only when there are major changes to the database or package structure and/or function, in a manner that may break existing scripts using the data); _y_ is the minor version number (typically changing with significant data updates); and _z_ the patch number (bug fixes, documentation upgrades, or other changes that are completely backwards compatible).
-Following each official (major) release a DOI will be issued and the data archived by Zenodo (https://zenodo.org/).
-All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI.
+Following each official (major) release, a DOI will be issued and the data permanently archived by Zenodo (https://zenodo.org/). 
+All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI; we anticipate this happening on an approximately annual basis.
 
 _Data license and citation_
 
-The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository. 
-This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1. 
-In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given. 
-We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019). 
-In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number. 
+The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository. This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1.
+In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given.
+We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019).
+In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number.
 This supports reproducibility of any analyses.
 
-Papers or other research products using COSORE should cite this publication. 
-In addition, users should also reference the specific version of the dataset they used (e.g. `r db_vers`), access date, and ideally the specific Git commit number. 
-This provides full reproducibility of any analyses.
-As noted above, we encourage data users to cite the primary publication for each dataset
-they use in analyses as well.
-
 **Data access and use**
 
-COSORE data releases are currently available via the GitHub "Releases" page at https://github.com/bpbond/cosore/releases, although we anticipate that institutional repositories such as ESS-DIVE (https://ess-dive.lbl.gov/) may host releases at some point in the future.
-Downloads via this page are flat-file CSV (comma-separated value), and readable by any modern computing system. 
-Missing values are encoded by a blank (i.e. two successive commas in the CSV format).
-A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication.
+Major COSORE data releases are available via Zenodo (as noted above), as well as the GitHub “Releases” page at https://github.com/bpbond/cosore/releases; we anticipate that institutional repositories such as ESS-DIVE (Environmental Systems Science Data Infrastructure for a Virtual Ecosystem, https://ess-dive.lbl.gov/) may host releases at some point in the future. 
+Downloads via this page are flat-file CSV (comma-separated values), and readable by any modern computing system. Missing values are encoded by a blank (i.e. two successive commas in the CSV format). 
+A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication. 
 
 ```{r dbsize, include=FALSE, cache=TRUE}
-db_memsize <- sum(vapply(db$CSR_DATASET, function(x) object.size(csr_dataset(x)$data),
+db_memsize <- sum(vapply(db$CSR_DATASET, 
+                         FUN = function(x) object.size(csr_dataset(x)$data),
                          FUN.VALUE = numeric(1))) / 1e6
 db_disksize <- system2("git", 
                        args = c("count-objects", "-vH"), 
@@ -149,11 +140,12 @@ db_disksize <- db_disksize[grepl("size:",db_disksize)]
 db_disksize <- gsub("size: ", "", db_disksize)
 ```
 
-An alternative way to access COSORE data is to install and use the _cosore_ R (TODO CITATION) package.
-This provides a robust framework, including dedicated access functions, dataset and database report generation, and QA/QC (see below).
-Because currently the flux data are included in the repository itself, the latter is quite large (compared to most Git repositories) to download, ~`r db_disksize`. 
-(Note that the data are stored in R's compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.) 
-It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages. Installing directly from GitHub is however straightforward using the _devtools_ or _remotes_ packages:
+An alternative way to access COSORE data, including minor updates between major releases, is to install and use the _cosore_ R (R Core Team, 2019) package. 
+This provides a robust framework, including dedicated access functions, dataset and database report generation, and quality assurance and checking (see below). 
+Because the flux data are currently included in the repository itself, the latter is quite large (compared to most Git repositories), ~`r db_disksize` MB. 
+(Note that the data are stored in R’s compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.) 
+It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages. 
+Installing directly from GitHub is however straightforward using the devtools or remotes packages:
 
 ```
 devtools::install_github("bpbond/cosore")
@@ -162,32 +154,32 @@ library(cosore)
 
 Four primary user-facing functions are available:
 
-* *csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured.
-* *csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above). 
-* *csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets.
+* *csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured;
+* *csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above);
+* *csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets;
 * *csr_metadata()* provides metadata information about all fields in all tables.
 
 Two additional reporting functions may also be useful to users:
 
-* *csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.
-* *csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, diagnostics, etc.
+* *csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.;
+* *csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, and diagnostics.
 
 Finally, a number of functions are targeted at developers, and include functionality to ingest contributed data, standardize data, and prepare a new release. See the package documentation for more details on these.
 
 _Documentation_
 
-The primary documentation for the COSORE database is this paper.
+The primary documentation for the COSORE database is this manuscript
 Both the flat-file releases and `cosore` R package include extensive documentation, including an in-depth vignette included both in the package and online (https://rpubs.com/bpbond/502069).
 The R package includes documentation available via R's standard help system.
 
 _Data quality and testing_
 
 When contributed data are imported into COSORE, the package code performs a number of quality assurance checks. These include:
 
-* Timestamp errors, for example illegal dates and times for the specified time zone
-* Bad email addresses or ORCID identifiers
-* Records with no flux value
-* Records for which the analyzer recorded an error condition
+* Timestamp errors, for example illegal dates and times for the specified time zone;
+* Bad email addresses or ORCID identifiers;
+* Records with no flux value;
+* Records for which the analyzer recorded an error condition.
 
 ```{r errors, cache=TRUE}
 # Calculate what percent of observations are removed across all datasets
@@ -230,7 +222,7 @@ subdaily_ds_pct <- sum(subdailies, na.rm = TRUE) / nrow(intervals) * 100
 subdaily_ds_N <- sum(intervals$N[subdailies], na.rm = TRUE) / sum(intervals$N, na.rm = TRUE) * 100
 ```
 
-The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice; `r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
+The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice. Currently `r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
 
 
 # Tables
@@ -309,7 +301,7 @@ make_table(db_fields, "columns")
 make_table(db_fields, "diagnostics")
 ```
 
-**Figure 1.** World satellite image with COSORE data submission sites shown as blue diamonds. Areas with multiple submissions are shown as darker.
+**Figure 2.** World satellite image with COSORE data submission sites shown as blue diamonds. Areas with multiple submissions are shown as darker.
 
 ```{r worldmap, message = FALSE, echo = FALSE, warning = FALSE}
 bbox <- make_bbox(lon = c(-160, 150), lat = c(-50, 70)) # make a coordinate box 
@@ -324,10 +316,10 @@ p <- ggmap(map) +
 print(p)
 
 ggsave_quiet <- function(...) suppressMessages(ggsave(...))
-ggsave_quiet("figures/figure1-map.png")
+ggsave_quiet("figures/figure2-map.png")
 ```
 
-**Figure 2.** Climate space figure.
+**Figure 3.** Climate space figure.
 
 ```{r worldmap-pkgs, include=FALSE}
 # Note raster masks 'tidyr::extract' and 'dplyr::select'
@@ -393,7 +385,7 @@ p <- ggplot() +
   theme_minimal() +
   labs(x = "MAT (°C)", y = "MAP (mm)")
 print(p)
-ggsave_quiet("figures/figure2-climate.png")
+ggsave_quiet("figures/figure3-climate.png")
 ```
 
 ```{r waffle-prep, echo=FALSE}