Skip to content

Commit 56d5848

Browse files
committed
Update text to match Google Doc
1 parent d3c4d35 commit 56d5848

File tree

3 files changed

+42
-50
lines changed

3 files changed

+42
-50
lines changed

essd/essd_ms.Rmd

Lines changed: 42 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -92,20 +92,18 @@ Its development started in April 2019, and as of this writing (`r Sys.Date()`) t
9292

9393
_Database and dataset structure_
9494

95-
The database is structured as a collection of independent contributed _datasets_, all of which have been standardized to a common structure, units, etc.
96-
Each dataset is given a reference name that links its constituent tables, provides a point of reference in reports, and is used when calling the R package accessor functions (see below).
95+
The database is structured as a collection of independent contributed datasets, all of which have been standardized to a common structure and units.
96+
Each dataset is given a reference name (internal to COSORE) that links its constituent tables, and provides a point of reference in reports.
97+
Each constituent dataset normally has a series of separate data tables:
9798

98-
Each constituent dataset normally has a series of separate data tables that are linked by keys.
99-
These tables include:
100-
101-
* _description_ (**Table 2**) describing site and dataset characteristics;
102-
* _contributors_ (**Table 3**) listing individuals who contributed to the measurement, analysis,
99+
* _description_ (**Table 2**) describes site and dataset characteristics;
100+
* _contributors_ (**Table 3**) lists individuals who contributed to the measurement, analysis,
103101
curation, and/or submission of the dataset;
104-
* _ports_ (**Table 4**) which gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
105-
* _data_ (**Table 5**), the central table of the dataset, which records flux observations;
106-
* _ancillary_ (**Supplementary Table S1**) summarizing site-level ancillary measurements;
107-
* _columns_ (**Supplementary Table S2**), mapping raw data columns to standard COSORE columns, providing a record for reproducibility; and
108-
* _diagnostics_ (**Supplementary Table S3**), which provides statistics on the data import process: errors, columns and rows dropped, etc.
102+
* _ports_ (**Table 4**) gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
103+
* _data_ (**Table 5**), the central table of the dataset, records flux observations;
104+
* _ancillary_ (**Supplementary Table S1**) summarizes site-level ancillary measurements;
105+
* _columns_ (**Supplementary Table S2**), maps raw data columns to standard COSORE columns, providing a record for reproducibility; and
106+
* _diagnostics_ (**Supplementary Table S3**) provides statistics on the data import process: errors, columns and rows dropped, etc.
109107

110108
The common key linking these dataset tables is the CSR_DATASET field, which records the unique name assigned to the dataset. In addition, a CSR_PORT key field links the _ports_ and _data_ tables. These links make it straightforward to extract datasets that have measured particular fluxes in
111109
certain ecosystem types, or isolate only non-treatment (control) chamber fluxes, for example.
@@ -114,33 +112,26 @@ _Versioning and archiving_
114112

115113
COSORE uses semantic versioning (https://semver.org/), meaning that its version numbers
116114
generally follow an "x.y.z" format, where _x_ is the major version number (changing only when there are major changes to the database or package structure and/or function, in a manner that may break existing scripts using the data); _y_ is the minor version number (typically changing with significant data updates); and _z_ the patch number (bug fixes, documentation upgrades, or other changes that are completely backwards compatible).
117-
Following each official (major) release a DOI will be issued and the data archived by Zenodo (https://zenodo.org/).
118-
All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI.
115+
Following each official (major) release, a DOI will be issued and the data permanently archived by Zenodo (https://zenodo.org/).
116+
All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI; we anticipate this happening on an approximately annual basis.
119117

120118
_Data license and citation_
121119

122-
The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository.
123-
This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1.
124-
In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given.
125-
We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019).
126-
In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number.
120+
The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository. This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1.
121+
In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given.
122+
We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019).
123+
In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number.
127124
This supports reproducibility of any analyses.
128125

129-
Papers or other research products using COSORE should cite this publication.
130-
In addition, users should also reference the specific version of the dataset they used (e.g. `r db_vers`), access date, and ideally the specific Git commit number.
131-
This provides full reproducibility of any analyses.
132-
As noted above, we encourage data users to cite the primary publication for each dataset
133-
they use in analyses as well.
134-
135126
**Data access and use**
136127

137-
COSORE data releases are currently available via the GitHub "Releases" page at https://github.com/bpbond/cosore/releases, although we anticipate that institutional repositories such as ESS-DIVE (https://ess-dive.lbl.gov/) may host releases at some point in the future.
138-
Downloads via this page are flat-file CSV (comma-separated value), and readable by any modern computing system.
139-
Missing values are encoded by a blank (i.e. two successive commas in the CSV format).
140-
A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication.
128+
Major COSORE data releases are available via Zenodo (as noted above), as well as the GitHub “Releases” page at https://github.com/bpbond/cosore/releases; we anticipate that institutional repositories such as ESS-DIVE (Environmental Systems Science Data Infrastructure for a Virtual Ecosystem, https://ess-dive.lbl.gov/) may host releases at some point in the future.
129+
Downloads via this page are flat-file CSV (comma-separated values), and readable by any modern computing system. Missing values are encoded by a blank (i.e. two successive commas in the CSV format).
130+
A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication.
141131

142132
```{r dbsize, include=FALSE, cache=TRUE}
143-
db_memsize <- sum(vapply(db$CSR_DATASET, function(x) object.size(csr_dataset(x)$data),
133+
db_memsize <- sum(vapply(db$CSR_DATASET,
134+
FUN = function(x) object.size(csr_dataset(x)$data),
144135
FUN.VALUE = numeric(1))) / 1e6
145136
db_disksize <- system2("git",
146137
args = c("count-objects", "-vH"),
@@ -149,11 +140,12 @@ db_disksize <- db_disksize[grepl("size:",db_disksize)]
149140
db_disksize <- gsub("size: ", "", db_disksize)
150141
```
151142

152-
An alternative way to access COSORE data is to install and use the _cosore_ R (TODO CITATION) package.
153-
This provides a robust framework, including dedicated access functions, dataset and database report generation, and QA/QC (see below).
154-
Because currently the flux data are included in the repository itself, the latter is quite large (compared to most Git repositories) to download, ~`r db_disksize`.
155-
(Note that the data are stored in R's compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.)
156-
It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages. Installing directly from GitHub is however straightforward using the _devtools_ or _remotes_ packages:
143+
An alternative way to access COSORE data, including minor updates between major releases, is to install and use the _cosore_ R (R Core Team, 2019) package.
144+
This provides a robust framework, including dedicated access functions, dataset and database report generation, and quality assurance and checking (see below).
145+
Because the flux data are currently included in the repository itself, the latter is quite large (compared to most Git repositories), ~`r db_disksize` MB.
146+
(Note that the data are stored in R’s compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.)
147+
It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages.
148+
Installing directly from GitHub is however straightforward using the devtools or remotes packages:
157149

158150
```
159151
devtools::install_github("bpbond/cosore")
@@ -162,32 +154,32 @@ library(cosore)
162154

163155
Four primary user-facing functions are available:
164156

165-
* *csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured.
166-
* *csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above).
167-
* *csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets.
157+
* *csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured;
158+
* *csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above);
159+
* *csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets;
168160
* *csr_metadata()* provides metadata information about all fields in all tables.
169161

170162
Two additional reporting functions may also be useful to users:
171163

172-
* *csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.
173-
* *csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, diagnostics, etc.
164+
* *csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.;
165+
* *csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, and diagnostics.
174166

175167
Finally, a number of functions are targeted at developers, and include functionality to ingest contributed data, standardize data, and prepare a new release. See the package documentation for more details on these.
176168

177169
_Documentation_
178170

179-
The primary documentation for the COSORE database is this paper.
171+
The primary documentation for the COSORE database is this manuscript
180172
Both the flat-file releases and `cosore` R package include extensive documentation, including an in-depth vignette included both in the package and online (https://rpubs.com/bpbond/502069).
181173
The R package includes documentation available via R's standard help system.
182174

183175
_Data quality and testing_
184176

185177
When contributed data are imported into COSORE, the package code performs a number of quality assurance checks. These include:
186178

187-
* Timestamp errors, for example illegal dates and times for the specified time zone
188-
* Bad email addresses or ORCID identifiers
189-
* Records with no flux value
190-
* Records for which the analyzer recorded an error condition
179+
* Timestamp errors, for example illegal dates and times for the specified time zone;
180+
* Bad email addresses or ORCID identifiers;
181+
* Records with no flux value;
182+
* Records for which the analyzer recorded an error condition.
191183

192184
```{r errors, cache=TRUE}
193185
# Calculate what percent of observations are removed across all datasets
@@ -230,7 +222,7 @@ subdaily_ds_pct <- sum(subdailies, na.rm = TRUE) / nrow(intervals) * 100
230222
subdaily_ds_N <- sum(intervals$N[subdailies], na.rm = TRUE) / sum(intervals$N, na.rm = TRUE) * 100
231223
```
232224

233-
The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice; `r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
225+
The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice. Currently `r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
234226

235227

236228
# Tables
@@ -309,7 +301,7 @@ make_table(db_fields, "columns")
309301
make_table(db_fields, "diagnostics")
310302
```
311303

312-
**Figure 1.** World satellite image with COSORE data submission sites shown as blue diamonds. Areas with multiple submissions are shown as darker.
304+
**Figure 2.** World satellite image with COSORE data submission sites shown as blue diamonds. Areas with multiple submissions are shown as darker.
313305

314306
```{r worldmap, message = FALSE, echo = FALSE, warning = FALSE}
315307
bbox <- make_bbox(lon = c(-160, 150), lat = c(-50, 70)) # make a coordinate box
@@ -324,10 +316,10 @@ p <- ggmap(map) +
324316
print(p)
325317
326318
ggsave_quiet <- function(...) suppressMessages(ggsave(...))
327-
ggsave_quiet("figures/figure1-map.png")
319+
ggsave_quiet("figures/figure2-map.png")
328320
```
329321

330-
**Figure 2.** Climate space figure.
322+
**Figure 3.** Climate space figure.
331323

332324
```{r worldmap-pkgs, include=FALSE}
333325
# Note raster masks 'tidyr::extract' and 'dplyr::select'
@@ -393,7 +385,7 @@ p <- ggplot() +
393385
theme_minimal() +
394386
labs(x = "MAT (°C)", y = "MAP (mm)")
395387
print(p)
396-
ggsave_quiet("figures/figure2-climate.png")
388+
ggsave_quiet("figures/figure3-climate.png")
397389
```
398390

399391
```{r waffle-prep, echo=FALSE}

0 commit comments

Comments
 (0)