-
Notifications
You must be signed in to change notification settings - Fork 0
Description
As a researcher from outside CLIMAS who requires climate simultion data for their analysis I want a columnar dataset of geo-located climate simulation results so I can confidently use this information to further my own research
Problem
It is difficult for researchers outside of the climate field to make use of the sophisticated simulations that have been produced. There are so many climate simulation models with results written in units that will be unfamiliar to researchers outside of climate research. The netcdf file format is not a convenient file format for many researchers.
It would be of great value to the Illinois campus and the research world beyond if the experts in the CLIMAS department could select and curate a public dataset that can be applied to these other problem spaces.
Description
A data product is a curated, documented, and maintained dataset that has been thoughtfully prepared to be immediately useful for its intended audience. Unlike raw data, a data product incorporates consistent quality controls, clear documentation, and reliable update schedules – treating the data with the same rigor we apply to software products.
For researchers, a well-designed data product means spending less time on data wrangling and more time on actual research. By maintaining downscaled climate simulation data as a data product, we ensure that every researcher has access to consistently formatted, validated, and documented data that can be seamlessly integrated into their analysis pipelines. This approach eliminates the common frustrations of dealing with inconsistent formats, missing documentation, or unclear provenance. Just as importantly, when data is treated as a product, it creates a feedback loop between data providers and users, allowing for continuous improvements based on real research needs. This means the data product can evolve to better serve the climate research community while maintaining backward compatibility and version control – essential features for reproducible science.
Approach
Using the downscaled climate datasets identified by Maile Sasaki in her Climate Map Repo the department will attempt to reach consensus on how best to interpret this data for common down-stream uses. There will be a particular priority given to the Illinois Department of Public Health application, however other uses will be considered.
For development purposes, the first product will have a monthly temporal dimension. Presumably a second product with a daily frequency will be useful.
The product will be produced by a Dagster workflow which simply calls a python script to sumarize the raw datasets and produce a collection of GeoParquet files stored in a public OpenStorage Network bucket. It's possible for the workflow to use the DaskGateway to scale up the transformation. The Dask job will be launched and monitored by Dagster.
The resulting dataset will be assigned a version along with a DOI. This will encourage periodic improvements of the dataset as experience is gained from working with other researchers. Furthermore, as more raw data is published, the data product can be extended to include more dates or reflect new versions of the models.
The data product will also include metadata explanation of how the data was produced and the provenance of the data.
Downscaled Climate Simulation Data Product Specification
This is an AI Generated first draft of what the Data Product specification might look like. It should at least be a good jumping off point for discussion.
Overview
This specification defines the structure and format of downscaled climate simulation data provided by the University of Illinois Department of Climate Science. The data product provides monthly climate metrics at specified geographical coordinates.
Data Product Details
Owner
- Organization: University of Illinois
- Department: Climate, Meteorology & Atmospheric Sciences
- Contact: [To be specified]
Data Format
- File Format: GeoParquet
- Temporal Resolution: Monthly
- Spatial Resolution: [To be specified] degrees
Data Dimensions
-
Temporal
- Granularity: Monthly
- Format: YYYY-MM
- Time Zone: UTC
-
Spatial
-
Latitude
- Type: Float
- Range: -90.0 to 90.0
- Resolution: [To be specified] degrees
-
Longitude
- Type: Float
- Range: -180.0 to 180.0
- Resolution: [To be specified] degrees
-
Measures
-
Maximum Surface Air Temperature
- Description: Highest recorded surface temperature for the month
- Unit: Degrees Celsius
- Data Type: Float
- Precision: 0.1°C
- Missing Value Indicator: null
-
Minimum Surface Air Temperature
- Description: Lowest recorded surface temperature for the month
- Unit: Degrees Celsius
- Data Type: Float
- Precision: 0.1°C
- Missing Value Indicator: null
-
Precipitation
- Description: Total monthly precipitation
- Unit: Millimeters
- Data Type: Float
- Precision: 0.1mm
- Missing Value Indicator: null
File Structure
- File Naming Convention: climate_data_[YYYY-MM].parquet
- Partition Strategy: Monthly files
Data Quality Requirements
-
Completeness
- No missing values for spatial coordinates
- Missing measure values must be explicitly marked as null
- Complete coverage of specified geographical area
-
Validity
- All values must be within physically possible ranges
- Temperature: [-90.0°C to 60.0°C]
- Precipitation: [0.0mm to 2000.0mm per month]
-
Consistency
- Minimum temperature must not exceed maximum temperature
- No negative precipitation values
Metadata Requirements
The GeoParquet file must include the following metadata:
- Data generation timestamp
- Source model identifier
- Downscaling method used
- Original resolution
- Version number
- Quality control flags
- Processing history
Distribution
-
Access Method
- S3 bucket on Open Storage Network
- Public Access
- Anonymous authentication allowed
-
Update Frequency
- [To be specified: e.g., Monthly, Quarterly]
- Update notification mechanism
Version History
- Version: 1.0
- Date: [Current Date]
- Status: Draft
Supporting Documentation Requirements
-
Methodology documentation
- Description of downscaling approach
- Validation methods
- Uncertainty quantification
-
Usage guidelines
- Recommended applications
- Known limitations
- Best practices for data handling