Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Andy edits #16

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ manuscript:
format:
agu-html:
keep-tex: false
number-depth: 1
agu-pdf:
keep-tex: true
31 changes: 17 additions & 14 deletions paper.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ author:


abstract: |
The Hierarchical Data Format (HDF) is a common archival format for n-dimensional scientific data; it has been utilized to store valuable information from astrophysics to earth sciences and everything in between. As flexible and powerful as HDF can be, it comes with big tradeoffs when it’s accessed from remote storage systems, mainly because the file format and the client I/O libraries were designed for local and supercomputing workflows. As scientific data and workflows migrate to the cloud , efficient access to data stored in HDF format is a key factor that will accelerate or slow down “science in the cloud” across all disciplines.
The Hierarchical Data Format (HDF) is a common archival format for n-dimensional scientific data; it has been utilized to store valuable information from astrophysics to earth sciences and everything in between. As flexible and powerful as HDF can be, it comes with big tradeoffs when files on remote storage systems are accessed using HTTPS (Hypertext Transfer Protocol Secure), mainly because the file format and the client I/O libraries were designed for local and supercomputing workflows. As scientific data and workflows migrate to the cloud, efficient access to data stored in HDF format is a key factor that will accelerate “science in the cloud” across all disciplines.
We present an implementation of recently available features in the HDF5 stack that results in performant access to HDF from remote cloud storage. This performance is on par with modern cloud-native formats like Zarr but with the advantage of not having to reformat data or generate metadata sidecar files (DMR++, Kerchunk). Our benchmarks also show potential cost-savings for data producers if their data are processed using cloud-optimized strategies.

keywords: ["cloud-native","cloud", "HDF5", "NASA", "ICESat-2"]
Expand All @@ -86,43 +86,46 @@ keep-tex: true
date: last-modified
---

## Problem
## The Problem

Scientific data from NASA and other agencies are increasingly being distributed from the commercial cloud. Cloud storage enables large-scale workflows and should reduce local storage costs. It also allows the use of scalable on-demand cloud computing resources by individual scientists and the broader scientific community. However, the majority of this scientific data is stored in a format that was not designed for the cloud: The Hierarchical Data format or HDF.
Scientific data from NASA and other agencies are increasingly being archived in and distributed from the commercial cloud. Cloud storage enables large-scale workflows and should reduce local storage costs. It also allows the use of scalable on-demand cloud computing resources. However, the majority of this scientific data is stored in a format that was not designed for the cloud: The Hierarchical Data format or HDF.

The most recent version of the Hierarchical Data Format is HDF5, a common archival format for n-dimensional scientific data; it has been utilized to store valuable information from astrophysics to earth sciences and everything in between. As flexible and powerful as HDF5 can be, it comes with big trade-offs when it’s accessed from remote storage systems.
The most recent version of the Hierarchical Data Format is HDF5, a common archival format for n-dimensional scientific data. HDF5 is used to store data from astrophysics to earth sciences and everything in between. However, as flexible and powerful as HDF5 can be, it comes with big trade-offs when it’s accessed from remote storage systems.

HDF5 is a complex file format; we can think of it as a file system using a tree-like structure with multiple data types and native data structures. Because of this complexity, the most reliable way of accessing data stored in this format is using the HDF5 C API. Regardless of access pattern, nearly all tools ultimately rely on the HDF5-C library and this brings a couple issues that affect the efficiency of accessing this format over the network:
HDF5 is a complex file format; we can think of it as a file system using a tree-like structure with multiple data types and native data structures. Because of this complexity, the most reliable way of accessing data stored in this format is using the HDF5 C API^[Application Programming Interface]. Regardless of access pattern, nearly all tools ultimately rely on the HDF5-C library and this brings a couple issues that affect the efficiency of accessing this format over the network: metadata fragmentation and global API lock.

---

#### **Metadata fragmentation**
##### _Metadata fragmentation_

When working with large datasets, especially those that include numerous variables and nested groups, the storage of file-level metadata can become a challenge. By default, metadata associated with each dataset is stored in chunks of 4 kilobytes (KB). This chunking mechanism was originally intended to optimize storage efficiency and access speed on disks with hardware resources available more than 20 years ago. In datasets with many variables and/or complex hierarchical structures, these 4KB chunks can lead to significant fragmentation.
When working with large datasets, especially those that include numerous variables and nested groups, the storage of file-level metadata can become a challenge. In HDF5, by default, metadata associated with each dataset is stored in chunks of 4 kilobytes (KB). This chunking mechanism was originally intended to optimize storage efficiency and access speed on disks with hardware resources available when HDF was originally designed in the 1990s. In datasets with many variables and/or complex hierarchical structures, these 4KB chunks can lead to significant fragmentation.

Fragmentation occurs when this metadata is spread out across multiple non-contiguous chunks within the file. This results in inefficiencies when accessing or modifying data because compatible libraries need to read from multiple, scattered locations in the file. Over time, as the dataset grows and evolves, this fragmentation can compound, leading to degraded performance and increased storage overhead. In particular, operations that involve reading or writing metadata, such as opening a file, checking attributes, or modifying variables, can become slower and more resource-intensive.
Fragmentation occurs when this metadata is spread out across multiple non-contiguous chunks within the file. This makes accessing or modifying data inefficient because the HDF library has to read from multiple, scattered locations in the file (@fig-1). Over time, as the dataset grows and evolves, this fragmentation can compound, leading to degraded performance and increased storage overhead. In particular, operations that involve reading or writing metadata, such as opening a file, checking attributes, or modifying variables, can become slower and more resource-intensive. When cloud-hosted files are accessed over a network using HTTPS, multiple reads increases the time to access data.

#### **Global API Lock**
##### _Global API Lock_

Because of the historical complexity of operations with the HDF5 format[@The_HDF_Group_Hierarchical_Data_Format], there has been a necessity to make the library thread-safe and similarly to what happens in the Python language, the simplest mechanism to implement this is to have a global API lock. This global lock is not as big of an issue when we read data from local disk but it becomes a major bottleneck when we read data over the network because each read is sequential and latency in the cloud is exponentially bigger than local access [@Mozilla-latency-2024] [@scott-2020].
Because of the historical complexity of operations with the HDF5 format [@The_HDF_Group_Hierarchical_Data_Format], there has been a necessity to make the library thread-safe and similarly to what happens in the Python language, the simplest mechanism to implement this is to have a global API lock. This global lock is not as big of an issue when we read data from local disk but it becomes a major bottleneck when we read data over the network because each read has to be sequential and latency in the cloud is exponentially bigger than local access [@Mozilla-latency-2024] [@scott-2020].

---

::: {#fig-1 fig-env="figure*"}

![](figures/figure-1.png)

shows how reads (Rn) are done in order to access file metadata, In the first read, R0, the HDF5 library verifies the file signature from the superblock, subsequent reads, R1, R2,...Rn, read file metadata, 4kb at the time.
A cartoon showing the distribution of fragmented file-level metadata and data in an HDF5 file. In the first read, R0, the HDF5 library verifies the file signature from the superblock. Subsequent reads, R1, R2,...Rn, read file-level metadata, 4kb at the time. The information in these metadata reads allow the actual data stored in the file to be accessed.

:::

#### **Background and data selection**
Products from NASA's Ice, Cloud, and Land Elevation Satellite 2 (ICESat-2) mission were some of the first NASA datasets to be migrated from on-premises storage at the National Snow and Ice Data Center Distributed Active Archive Center (NSIDC-DAAC) to Amazon Web Services (AWS) Simple Storage Service (S3). The ICESat-2 observatory carries the Advanced Topographic Laser Altimeter System (ATLAS), a photon-counting lidar altimeter that measures the round-trip travel time of low-energy green laser light to retrieve the heights of ice sheets, sea ice, land surfaces, vegetation, oceans, and clouds [@NEUMANN2019111325]. The geolocated photon height product (ATL03), which is the basic science product from which other higher-level, surface-specific, products are derived, retrieves the latitude, longitude and ellipsoid heights of photons every 70 cm along six ground tracks as the satellite orbits the Earth. All ICESat-2 data products are in HDF5 format. The ICESat-2 mission produces # TerraBytes of data each day. ATL03 files range between # GB and # TB, with an average file size # GB. Other surface-specific products tend to have smaller file sizes. Individual file sizes and the volume of data required for most science goals make traditional download-and-analyse workflows time consuming and require local storage. However, cloud-based workflows, in which ICESat-2 file objects in AWS S3 buckets are accessed from cloud compute instances in the same AWS region as the data, cannot take full advantage of direct download and scaleability offered by these workflows because of the large latency caused by metadata fragmentation and global api lock.

NSIDC started the Cloud Optimized Format Investigation (COFI) project to improve access to HDF5 from the ICESat-2 mission (and other missions) in response to feedback from the ICESat-2 community and the experiences of particpants attending ICESat-2 hack weeks, organized University of Washington eScience Institute. The initial effort on the COFI project was made at the 2023 hack week by a collection of developers from NASA DAACs, private companies, and scientists and students [@h5cloud2023]. This collaborative effort has continued led by NSIDC with the HDF Group and NASA engineers.

As a result of community feedback and “hack weeks” organized by NSIDC and UW eScience Institute in 2023[@h5cloud2023], NSIDC started the Cloud Optimized Format Investigation (COFI) project to improve access to HDF5 from the ICESat-2 mission, a spaceborne lidar that retrieves surface topography of the Earth’s ice sheets, land and oceans [@NEUMANN2019111325]. Because of its complexity, large size and importance for cryospheric studies we targeted the ATL03 data product. The most relevant variable in ATL03 are geolocated photon heights from the ICESat-2 ATLAS instrument. Each ATL03 file contains 1003 geophysical variables in 6 data groups. Although our research was focused on this dataset, most of our findings are applicable to any dataset stored in HDF5 and NetCDF4.

## Methodology

We tested access times to original and different configurations of cloud-optimized HDF5 [ATL03 files](https://its-live-data.s3.amazonaws.com/index.html#test-space/cloud-experiments/h5cloud/) stored in AWS S3 buckets in region us-west-2, the region hosting NASA’s Earthdata Cloud archives. Files were accessed using Python tools commonly used by Earth scientists: h5py and Xarray[@Hoyer2017-su]. h5py is a Python wrapper around the HDF5 C API. xarray^[`h5py` is a dependency of Xarray] is a widely used Python package for working with n-dimensional data. We also tested access times using h5coro, a python package optimized for reading HDF5 files from S3 buckets and kerchunk, a tool that creates an efficient lookup table for file chunks to allow performant partial reads of files.
We have targeted the ATL03, geolocated photon height, product because of its complexity, and large size, and because this data is the input to all standard surface-specific data products generated by NASA, as well as the starting point for research and analysis using ICESat-2 products. ATL03 HDF5 files contain 1003 variables^[In HDF5, variables are called datasets] organized in # groups^[Define groups here]. The most relevant variable in ATL03 are geolocated photon heights from the ICESat-2 ATLAS instrument. Although our research was focused on this dataset, most of our findings are applicable to any dataset stored in HDF5 and NetCDF4.

We tested access times to original and different configurations of cloud-optimized HDF5 [ATL03 files](https://its-live-data.s3.amazonaws.com/index.html#test-space/cloud-experiments/h5cloud/) stored in AWS S3 buckets in region us-west-2, the region hosting NASA’s Earthdata Cloud archives. Files were accessed using Python tools commonly used by Earth scientists: h5py and xarray[@Hoyer2017-su]. h5py is a Python wrapper around the HDF5 C API. xarray^[`h5py` is a dependency of Xarray] is a widely used Python package for working with n-dimensional data. We also tested access times using h5coro, a python package optimized for reading HDF5 files from S3 buckets and kerchunk, a tool that creates an efficient lookup table for file chunks to allow performant partial reads of files.

The test files were originally cloud optimized by “repacking” them, using a relatively new feature in the HDF5 C API called “paged aggregation”. Page aggregation does 2 things: first, it collects file-level metadata from datasets and stores it on dedicated metadata blocks at the front of the file; second, it forces the library to write both data and metadata using these fixed-size pages. Aggregation allows client libraries to read file metadata with only a few requests using the page size as a fixed request size, overriding the 1 request per chunk behavior.

Expand Down