You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: essd/essd_ms.Rmd
+42-50Lines changed: 42 additions & 50 deletions
Original file line number
Diff line number
Diff line change
@@ -92,20 +92,18 @@ Its development started in April 2019, and as of this writing (`r Sys.Date()`) t
92
92
93
93
_Database and dataset structure_
94
94
95
-
The database is structured as a collection of independent contributed _datasets_, all of which have been standardized to a common structure, units, etc.
96
-
Each dataset is given a reference name that links its constituent tables, provides a point of reference in reports, and is used when calling the R package accessor functions (see below).
95
+
The database is structured as a collection of independent contributed datasets, all of which have been standardized to a common structure and units.
96
+
Each dataset is given a reference name (internal to COSORE) that links its constituent tables, and provides a point of reference in reports.
97
+
Each constituent dataset normally has a series of separate data tables:
97
98
98
-
Each constituent dataset normally has a series of separate data tables that are linked by keys.
99
-
These tables include:
100
-
101
-
*_description_ (**Table 2**) describing site and dataset characteristics;
102
-
*_contributors_ (**Table 3**) listing individuals who contributed to the measurement, analysis,
99
+
*_description_ (**Table 2**) describes site and dataset characteristics;
100
+
*_contributors_ (**Table 3**) lists individuals who contributed to the measurement, analysis,
103
101
curation, and/or submission of the dataset;
104
-
*_ports_ (**Table 4**) which gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
105
-
*_data_ (**Table 5**), the central table of the dataset, which records flux observations;
*_columns_ (**Supplementary Table S2**), mapping raw data columns to standard COSORE columns, providing a record for reproducibility; and
108
-
*_diagnostics_ (**Supplementary Table S3**), which provides statistics on the data import process: errors, columns and rows dropped, etc.
102
+
*_ports_ (**Table 4**) gives the different _ports_ (generally equivalent to separate measurement chambers) in use, and what each is measuring: flux, species, and treatment, as well as characteristics of the measurement collar;
103
+
*_data_ (**Table 5**), the central table of the dataset, records flux observations;
*_columns_ (**Supplementary Table S2**), maps raw data columns to standard COSORE columns, providing a record for reproducibility; and
106
+
*_diagnostics_ (**Supplementary Table S3**) provides statistics on the data import process: errors, columns and rows dropped, etc.
109
107
110
108
The common key linking these dataset tables is the CSR_DATASET field, which records the unique name assigned to the dataset. In addition, a CSR_PORT key field links the _ports_ and _data_ tables. These links make it straightforward to extract datasets that have measured particular fluxes in
111
109
certain ecosystem types, or isolate only non-treatment (control) chamber fluxes, for example.
@@ -114,33 +112,26 @@ _Versioning and archiving_
114
112
115
113
COSORE uses semantic versioning (https://semver.org/), meaning that its version numbers
116
114
generally follow an "x.y.z" format, where _x_ is the major version number (changing only when there are major changes to the database or package structure and/or function, in a manner that may break existing scripts using the data); _y_ is the minor version number (typically changing with significant data updates); and _z_ the patch number (bug fixes, documentation upgrades, or other changes that are completely backwards compatible).
117
-
Following each official (major) release a DOI will be issued and the data archived by Zenodo (https://zenodo.org/).
118
-
All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI.
115
+
Following each official (major) release, a DOI will be issued and the data permanently archived by Zenodo (https://zenodo.org/).
116
+
All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI; we anticipate this happening on an approximately annual basis.
119
117
120
118
_Data license and citation_
121
119
122
-
The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository.
123
-
This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1.
124
-
In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given.
125
-
We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019).
126
-
In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number.
120
+
The database license is CC-BY-4 (https://creativecommons.org/licenses/by/4.0/); see the “LICENSE” file in the repository. This is identical to that used by e.g. FLUXNET Tier 1 and ICOS R1.
121
+
In general, this license provides that users may copy and redistribute the database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given.
122
+
We request that users cite this article and strongly encourage them to (i) cite all constituent dataset primary publications, and (ii) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001; Knox et al., 2019).
123
+
In addition, users should also reference the specific version of the dataset they used (e.g., v0.6.0), access date, and ideally the specific Git commit number.
127
124
This supports reproducibility of any analyses.
128
125
129
-
Papers or other research products using COSORE should cite this publication.
130
-
In addition, users should also reference the specific version of the dataset they used (e.g. `r db_vers`), access date, and ideally the specific Git commit number.
131
-
This provides full reproducibility of any analyses.
132
-
As noted above, we encourage data users to cite the primary publication for each dataset
133
-
they use in analyses as well.
134
-
135
126
**Data access and use**
136
127
137
-
COSORE data releases are currently available via the GitHub "Releases" page at https://github.com/bpbond/cosore/releases, although we anticipate that institutional repositories such as ESS-DIVE (https://ess-dive.lbl.gov/) may host releases at some point in the future.
138
-
Downloads via this page are flat-file CSV (comma-separated value), and readable by any modern computing system.
139
-
Missing values are encoded by a blank (i.e. two successive commas in the CSV format).
140
-
A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication.
128
+
Major COSORE data releases are available via Zenodo (as noted above), as well as the GitHub “Releases” page at https://github.com/bpbond/cosore/releases; we anticipate that institutional repositories such as ESS-DIVE (Environmental Systems Science Data Infrastructure for a Virtual Ecosystem, https://ess-dive.lbl.gov/) may host releases at some point in the future.
129
+
Downloads via this page are flat-file CSV (comma-separated values), and readable by any modern computing system. Missing values are encoded by a blank (i.e. two successive commas in the CSV format).
130
+
A release download is fully self-contained, with full data, metadata, and documentation; a file manifest; a copy of the data license; an introductory vignette; a summary report on the entire database; and an explanatory README with links to this publication.
An alternative way to access COSORE data is to install and use the _cosore_ R (TODO CITATION) package.
153
-
This provides a robust framework, including dedicated access functions, dataset and database report generation, and QA/QC (see below).
154
-
Because currently the flux data are included in the repository itself, the latter is quite large (compared to most Git repositories) to download, ~`r db_disksize`.
155
-
(Note that the data are stored in R's compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.)
156
-
It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages. Installing directly from GitHub is however straightforward using the _devtools_ or _remotes_ packages:
143
+
An alternative way to access COSORE data, including minor updates between major releases, is to install and use the _cosore_ R (R Core Team, 2019) package.
144
+
This provides a robust framework, including dedicated access functions, dataset and database report generation, and quality assurance and checking (see below).
145
+
Because the flux data are currently included in the repository itself, the latter is quite large (compared to most Git repositories), ~`r db_disksize` MB.
146
+
(Note that the data are stored in R’s compressed RDS file format; when loaded into memory, the entire database is significantly larger, ~`r round(db_memsize, 0)` MB.)
147
+
It thus cannot easily be hosted on CRAN (the Comprehensive R Archive Network), the canonical source for R packages.
148
+
Installing directly from GitHub is however straightforward using the devtools or remotes packages:
157
149
158
150
```
159
151
devtools::install_github("bpbond/cosore")
@@ -162,32 +154,32 @@ library(cosore)
162
154
163
155
Four primary user-facing functions are available:
164
156
165
-
**csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured.
166
-
**csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above).
167
-
**csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets.
157
+
**csr_database()* summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a high-level overview. It returns a selection of variables summarized in **Tables 2-8** below, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates, and variables measured;
158
+
**csr_dataset()* returns a single dataset: an R list structure, each element of which is a table (_description_, _contributors_, etc., as described above);
159
+
**csr_table()* collects, into a single data frame, one of the tables of the database, for any or all datasets;
168
160
**csr_metadata()* provides metadata information about all fields in all tables.
169
161
170
162
Two additional reporting functions may also be useful to users:
171
163
172
-
**csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.
173
-
**csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, diagnostics, etc.
164
+
**csr_report_database()* generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.;
165
+
**csr_report_dataset()* generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data, and diagnostics.
174
166
175
167
Finally, a number of functions are targeted at developers, and include functionality to ingest contributed data, standardize data, and prepare a new release. See the package documentation for more details on these.
176
168
177
169
_Documentation_
178
170
179
-
The primary documentation for the COSORE database is this paper.
171
+
The primary documentation for the COSORE database is this manuscript
180
172
Both the flat-file releases and `cosore` R package include extensive documentation, including an in-depth vignette included both in the package and online (https://rpubs.com/bpbond/502069).
181
173
The R package includes documentation available via R's standard help system.
182
174
183
175
_Data quality and testing_
184
176
185
177
When contributed data are imported into COSORE, the package code performs a number of quality assurance checks. These include:
186
178
187
-
* Timestamp errors, for example illegal dates and times for the specified time zone
188
-
* Bad email addresses or ORCID identifiers
189
-
* Records with no flux value
190
-
* Records for which the analyzer recorded an error condition
179
+
* Timestamp errors, for example illegal dates and times for the specified time zone;
180
+
* Bad email addresses or ORCID identifiers;
181
+
* Records with no flux value;
182
+
* Records for which the analyzer recorded an error condition.
191
183
192
184
```{r errors, cache=TRUE}
193
185
# Calculate what percent of observations are removed across all datasets
The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice;`r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
225
+
The interval between measurements ranges from `r round(min(intervals$Interval, na.rm = TRUE), 0)` to `r round(max(intervals$Interval, na.rm = TRUE), 0)` minutes, with 25%-50%-75% quantile values of `r q[2]`, `r q[3]`, and `r q[4]` minutes respectively. A one-hour interval between measurements is thus by far the most common choice. Currently`r round(subdaily_ds_pct, 0)`% of the datasets, and `r round(subdaily_ds_N, 3)`% of the data, provide sub-daily temporal resolution.
0 commit comments