Skip to content

Commit 24b60ad

Browse files
anayeayesmohiudd
andauthored
Docs: promoting to production (#162)
Update dataset ingest links and add new overview of publishing data to production --------- Co-authored-by: smohiudd <[email protected]>
1 parent 1a8b557 commit 24b60ad

File tree

5 files changed

+74
-444
lines changed

5 files changed

+74
-444
lines changed

.markdownlint-cli2.jsonc

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
{
22
"config": {
3-
"MD013": false // disable line length checks
3+
"MD013": false, // disable line length checks
4+
"MD033": {
5+
"allowed_elements": [ "a", "b", "details", "summary" ]
6+
}
47
}
58
}

README.md

+70-28
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,42 @@
22

33
[![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/nasa-impact/veda-data/ci.yaml?style=for-the-badge&label=CI)](https://github.com/NASA-IMPACT/veda-data/actions/workflows/ci.yaml)
44

5-
This repository houses data used to define a VEDA dataset to load into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying the dataset in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).
5+
This repository houses config data used to load datasets into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying datasets in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).
66

7-
The data provided here gets processed in the ingestion system [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), to which this repository is directly linked (as a Git submodule).
7+
The config data provided here gets processed in the [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow) ingestion system. See [Dataset Submission Process](#dataset-submission-process) for details about submitting work to the ingestion system.
88

9-
## Dataset Submission Process
9+
## Dataset submission process
1010

11-
The VEDA user docs explain the full [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/).
11+
![veda-data-publication][veda-data-publication]
1212

13-
Ultimately, submission to the VEDA catalog requires that you [open an issue with the "new dataset" template](https://github.com/NASA-IMPACT/veda-data/issues/new?assignees=&labels=dataset&projects=&template=new-dataset.yaml&title=New+Dataset%3A+%3Cdataset+title%3E). This template will require, at minimum:
13+
To add data to VEDA you will:
1414

15-
1. a description of the dataset
16-
2. the location of the data (in S3, CMR, etc.), and
17-
3. a point of contact for the VEDA team to collaborate with.
15+
1. **Stage your files:** Upload files to the staging bucket `s3://veda-data-store-staging` (which you can do with a VEDA JupyterHub account--request access [here](https://nasa-impact.github.io/veda-docs/services/jupyterhub.html)) or a self-hosted bucket in s3 has shared read access to VEDA service.
1816

19-
One or more notebooks showing how the data should be processed would be appreciated.
17+
2. **Generate STAC metadata in the staging catalog:** Metadata must first be added to the Staging Catalog [staging.openveda.cloud/api/stac](https://staging.openveda.cloud/api/stac). You will need to create a dataset config file and submit it to the `/workflows/dataset/publish` endpoint to generate STAC Collection metadata and generate Item records for the files you have uploaded in Step 1. See detailed steps for the [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/) in the contribuing section of [veda-docs](https://nasa-impact.github.io/veda-docs) where you can also find this full ingestion workflow example [geoglam ingest notebook](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/example-template/example-geoglam-ingest.html)
2018

21-
## Ingestion Data Structure
19+
3. **Acceptance testing\*:** Perform acceptance testing appropriate for your data. \*In most cases this will be opening a dataset PR in [veda-config](https://github.com/NASA-IMPACT/veda-config) to generate a dashboard preview of the data. See [veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html) for instructions on generating a dashboard preview).
2220

23-
When submitting STAC records to ingest, a pull request can be made with the data structured as described below.
21+
4. **Promote to production!** Open a PR in the [veda-data](https://github.com/NASA-IMPACT/veda-data) repo with the dataset config metadata you used to add your data to the Staging catalog in Step 2. Add your config to `ingestion-data/production/dataset-config`. When your PR is approved, this configuration will be used to generate records in the production VEDA catalog!
2422

25-
### `collections/`
23+
5. **[Optional] Share your data :** Share your data in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/) by submitting a PR to [veda-config](https://github.com/NASA-IMPACT/veda-config) ([see veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html)) and add jupyterhub hosted usage examples to [veda-docs/contributing/docs-and-notebooks](https://nasa-impact.github.io/veda-docs/contributing/docs-and-notebooks.html)
2624

27-
The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC).
25+
## Project ingestion data structure
26+
27+
When submitting STAC records for ingestion, a pull request can be made with the data structured as described below. The `ingestion-data/` directory contains artifacts of the ingestion configuration used to publish to the staging and production catalogs.
28+
29+
> **Note**
30+
Various ingestion workflows are supported and documented below but only the configuration metadata used to publish to the VEDA catalog are stored in this repo. It is not expected that every ingestion will follow exactly the same pattern nor will each ingested collection have have all types of configuration metadata here. The primary method used to ingest is [**`dataset-config`**](#stagedataset-config).
31+
32+
### `<stage>/collections/`
33+
34+
The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC). STAC Collection metadata can be generated from an id, title, description using Pystac. See this [veda-docs/contributing notebook example](https://nasa-impact.github.io/veda-docs/notebooks/veda-operations/stac-collection-creation.html) to get started.
2835

2936
Should follow the following format:
3037

38+
<details>
39+
<summary><b>/collections/collection_id.json</b></summary>
40+
3141
```json
3242
{
3343
"id": "<collection-id>",
@@ -105,40 +115,41 @@ Should follow the following format:
105115

106116
```
107117

108-
### `discovery-items/`
118+
</details>
119+
120+
### `<stage>/discovery-items/`
109121

110-
The `ingestion-data/discovery-items/` directory holds json files representing the step function inputs for initiating the discovery, ingest and publication workflows.
122+
The `ingestion-data/discovery-items/` directory holds json files representing the inputs for initiating the discovery, ingest and publication workflows.
111123
Can either be a single input event or a list of input events.
112124

113125
Should follow the following format:
114126

127+
<details>
128+
<summary><b>/discovery-items/collection_id.json</b></summary>
129+
115130
```json
116131
{
117132
"collection": "<collection-id>",
118-
"discovery": "<s3/cmr>",
119133

120134
## for s3 discovery
121135
"prefix": "<s3-key-prefix>",
122136
"bucket": "<s3-bucket>",
123137
"filename_regex": "<filename-regex>",
124138
"datetime_range": "<month/day/year>",
125139

126-
## for cmr discovery
127-
"version": "<collection-version>",
128-
"temporal": ["<start-date>", "<end-date>"],
129-
"bounding_box": ["<bounding-box-as-comma-separated-LBRT>"],
130-
"include": "<filename-pattern>",
131-
132140
### misc
133-
"cogify": "<true/false>",
134-
"upload": "<true/false>",
135-
"dry_run": "<true/false>",
141+
"dry_run": "<true/false>"
136142
}
137143
```
138144

139-
### `dataset-config/`
145+
</details>
140146

141-
The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` stac ingestor endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [jupyter notebook](./transformation-scripts/example-template/example-geoglam-ingest.ipynb).
147+
### `<stage>/dataset-config/`
148+
149+
The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` workflows endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [geoglam ingest notebook in nasa-impact.github.io/veda-docs/contributing/dataset-ingeston](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/example-template/example-geoglam-ingest.html).
150+
151+
<details>
152+
<summary><b>/dataset-config/collection_id.json</b></summary>
142153

143154
```json
144155
{
@@ -170,11 +181,40 @@ The `ingestion-data/dataset-config/` directory holds json files that can be used
170181
}
171182
]
172183
}
184+
185+
```
186+
187+
</details>
188+
189+
### `production/transfer-config`
190+
191+
This directory contains the configuration needed to execute a stand-alone airflow DAG that copies data from a specified staging bucket and prefix to a permanent location in `s3://veda-data-store` using the collection_id as a prefix.
192+
193+
Should follow the following format:
194+
195+
<details>
196+
<summary><b>/production/transfer-config/collection_id.json</b></summary>
197+
198+
```json
199+
{
200+
"collection": "<collection-id>",
201+
202+
## the location of the staged files
203+
"origin_bucket": "<s3-bucket>",
204+
"origin_prefix": "<s3-key-prefix>",
205+
"bucket": "<s3-bucket>",
206+
"filename_regex": "<filename-regex>",
207+
208+
### misc
209+
"dry_run": "<true/false>"
210+
}
173211
```
174212

213+
</details>
214+
175215
## Validation
176216

177-
This repository provides a script for validating all collections.
217+
This repository provides a script for validating all collections in the ingestion-data directory.
178218
First, install the requirements (preferably in a virtual environment):
179219

180220
```shell
@@ -212,3 +252,5 @@ pip-compile
212252
```
213253

214254
This will update `requirements.txt` with a complete, realized set of Python dependencies.
255+
256+
[veda-data-publication]: ./docs/publishing-data.excalidraw.png

docs/publishing-data.excalidraw.png

58.2 KB
Loading
Binary file not shown.

0 commit comments

Comments
 (0)