Skip to content

Introduce new organizational schema format for storing Cholla data#427

Merged
evaneschneider merged 3 commits into
cholla-hydro:devfrom
mabruzzo:py-snap-builder
May 7, 2025
Merged

Introduce new organizational schema format for storing Cholla data#427
evaneschneider merged 3 commits into
cholla-hydro:devfrom
mabruzzo:py-snap-builder

Conversation

@mabruzzo
Copy link
Copy Markdown
Collaborator

@mabruzzo mabruzzo commented Apr 25, 2025

Overview

This Pull Request introduces a new data format for storing Cholla's concatenated field data in snapshots. We introduce 3 changes: specify a new format, modifying python scripts, and changing documentation

Background

As we all know, a Cholla simulation partitions spatial data into blocks.

  • each MPI-process evolves a separate block
  • The blocks are conceptually organized on a 3D "block-location" grid of shape (BLx, BLy, BLz). When a simulation starts, this shape is always chosen to match (n_proc_x, n_proc_y, n_proc_z).
  • Conceptually, all fields live on a global Domain-grid composed of (nDx,nDy,nDz) cells. In practice, the underlying field data is partitioned among the blocks. A given block is responsible for tracking data on (nBx,nBy,nBz) cells, where (nDx,nDy,nDz) = (BLx*nBx, BLy*nBy, BLz*nBz).

When Cholla writes a snapshot of the field data, each block is written to a distinct file.

By default, Cholla uses HDF5 files. Cholla's standard "schema" (strategy for organizing data) is "flat" (i.e. there is no hierarchy). We represent this schema down below:

/                                      # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── <dataset-0>
 ├── <dataset-1>
 └── ...

For convenience, we typically concatenate files as a post-processing step. Historically, this procedure always produces an HDF5 file with the "Flat Format," where the data is stitched together in a way that roughly approximates the file that would be produced by a simulation run with a single block that spanned the entire domain.

Motivation for a new schema

The classic organizational schema works quite well. The simplicity is extremely useful! However, we start to encounter issues when we get to really large data files

The main impetus for making this PR is relates to issues with existing strategy and yt1 for large datasets:

  1. specifically, yt is designed to break calculations up block-by-block. When we have a single monolithic block:
    • we can't leverage yt's ability to operate on large datasets where the size of the dataset exceeds the available ram.
    • we can't make use of yt's parallelism
  2. Most importantly, yt starts to break for simulations of size ~2048^3 (presumably due to 32-bit index overflows)

Theoretically, we can configure yt to operate on distributed files (see yt-project/yt#4702). But, this is not an optimal general solution for a few reasons:

  • file-system accesses are slow on parallel file systems (it becomes especially noticeable when you thousands of blocks or are using LUSTRE file systems)
  • the previous point is compounded by the fact that to infer the relative spatial locations of every block logic in the above link requires that you open every single file. This is solvable by including more information within the files. But, this effectively changes the structure of the files (and if we are going to change it, we might as well make a bigger change)
  • Frankly, it seems a little silly that the standard post-processing step would make data-analysis less efficient

More generally (whether you use yt or not), the current approach is sub-optimal for any sort of analysis of a spatial sub-region. At the moment, the fastest way to access all of the data in simulation, you need to iterate over all values along the z-axis. This is pretty inefficient if you want to access just a spatial subregion (say you just want to consider the region around a galactic disk).

Change 1: Description of New Format:

NOTE: it may be helpful to look at the newly proposed documentation (linked down below).

The proposed schema looks like the following:

/                                  # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── domain/       (REQUIRED)
 │    ├── blockid_location_arr     # shape: (BLx,BLy,BLz)
 │    └── stored_blockid_list      # shape: (nBStored,)
 └── field/
      ├── <field-0>     # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
      ├── <field-1>
      └── ...

In the above diagram:

  • we follow numpy conventions for describing arrays with C-contiguous layouts.
    In other words, the fastest index is last axis
  • BLx,BLy,BLz refer to the number of blocks per axis.
  • nBx,nBy,nBz refer to the number of cells per block. This is the shape of a cell-centered field.
  • nBStored is the number of blocks in the file. It should nominally be 1 or
    nBx*nBy*nBz
  • "domain/blockid_location_arr" specifies the relative locations of the
    blocks
  • the data at {field/<field-0>}[i, ...] corresponds to the data of the
    block with blockid specified by {domain/stored_blockid_list}[i]

Note

Files in this format ALWAYS provide the "dims" attribute (in HEADER-ATTRS) and the "domain" group. Importantly, "dims" specifies (nDx,nDy,nDz), the number of cells on the conceptual global Domain-grid, and the shape of "domain/blockid_location_arr" is (BLx,BLy,BLz). Thus, you always can infer (nBx,nBy,nBz) = (nDx/BLx, nDy/BLy, nDz/BLz). Consequently, you can determine whether a field/<field-0> is cell-centered, face-centered, etc. by looking at the field's shape.

The above format is intended to be forward-compatible with a schema that also stores particle-data and gravity-data in the same file. We illustrate what this could look like down below (in the collapsible section)

ASIDE: Preview of more general schema

Here we sketch a more-general hypothetical schema that can also include particle and gravity data:

/                                      # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── domain/       (REQUIRED)
 │    ├── blockid_location_arr        # shape: (BLx,BLy,BLz)
 │    └── stored_blockid_list         # shape: (nBStored,)
 ├── field/
 │    ├── <field-0>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    ├── <field-1>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    └── ...              # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 ├── particle/
 │    ├── ATTR:total_particle_count    # i64
 │    ├── stop_particle_idx            # 1D shape: (nBStored,)
 │    ├── <particle-prop-0>            # 1D shape: (stop_particle_idx[-1],)
 │    ├── <particle-prop-1>            # 1D shape: (stop_particle_idx[-1],)
 │    └── ...                          # 1D shape: (stop_particle_idx[-1],)
 └── gravity/
      └── gravity          # 4D shape: (nBStored,nBx,nBy,nBz)

A few notes about particle group in this hypothetical extension:

  • particle/ATTR:total_particle_count specifies the total number of particles in the ENTIRE simulation.
  • particle/stop_particle_idx holds monotonically non-decreasing values.
    • when nBStored == nBx*nBy*nBz, then {particle/stop_particle_idx}[-1] =={particle/ATTR:total_particle_count}
    • in other cases, {particle/stop_particle_idx}[-1] <={particle/ATTR:total_particle_count}
  • The values of that describe particles for the blockid specified by {domain/stored_blockid_list}[i] are given by {particle/<particle-prop-0>}[slc], where slc is:
    • 0:{particle/stop_particle_idx}[0], when i is 0
    • {particle/stop_particle_idx}[i-1]:{particle/stop_particle_idx}[i], in all other cases

Change 2: Modifications to the Python Scripts:

This is the only set of changes that concretely correspond to modifications in the repository.

  1. The largest change is that we introduce snaprepack.py to create new files with the hierarchical schema based on previously concatenated files.
    • invoking this tool from the command line looks something like:
      ./snaprepack.py -s PATH/TO/SOURCE/FILE.h5 -o PATH/TO/OUTPUT/DIRECTORY
    • if the input file is missing the "nprocs" header attribute, you would specify it with the --missing-nprocs-triple flag. For example, if "nprocs" should hold (4,2,2), then you could invoke
      ./snaprepack.py -s PATH/TO/SOURCE/FILE.h5 -o PATH/TO/OUTPUT/DIRECTORY --missing-nprocs-triple 4 2 2
  2. I modified the contents of concat_3d_data.py such that resulting files now have the hierarchical format.

Important

This PR has no impact on the concatenation of particles or 2D datasets (e.g. slices, projections)

While snaprepack.py is a little long, a lot of the contents are related to documenting the code. The code is written in a way that it will be easy to extend to also handle particle-data as well as gravity-data. Once we finish that, I think I could probably simplify the logic a little more. But, for now I think the code is "good enough."

Change 3: Proposed Changes to the documentation:

If/When this PR is merged, we will need to modify the Outputs page of the Wiki.. I have already drafted the changes to this page, while I was trying to figure out how to best describe the file format.

  • You can find the draft of the changes here, in a free-floating wiki page
  • When this PR is merged, we will obviously need to copy the changes from the free-floating wiki page to the Output page (and delete the free-floating page).
  • You can use this page to highlight all of my proposed changes.

Other thoughts

Potential improvements to the proposed scheme:

  • When using the hierarchical format for files that only store data for a single block, it's worth considering whether we want the domain group to only be present in the files for block 0, or in all files (we can't currently make these files, but it would be worth deciding this sooner or later).
  • Theoretically, we could reduce the number of datasets in the "field" group by using 5D datasets instead of 4D datasets. In this picture, we would have a separate dataset for all cell-centered fields, all fields on x-faces (namely "magnetic_x", all fields on y-faces (namely "magnetic_y", etc.). Theoretically, this would give us the flexibility opportunity to store the fields for a given block in closer proximity to each other (which could be useful for analysis). In practice, I skeptical that this additional complexity is warranted.

In the future:

  • As already noted, it would be useful to start putting gravity-data and particle-data into this format.
  • It would be straight-forward to let cholla start reading from files in the hierarchical format
  • It would be useful (and straight-forward) to have Cholla start writing snapshot data in hierarchical format
  • we might consider storing derived data in hierarchical format. This could theoretically let us speed up slice-concatenation (by letting us skip unneeded files). It could also let us store multiple kinds of derived data in a single file. In practice, it may not be worth the effort.

Footnotes

  1. As we're all aware, yt is not necessarily the most efficient tool in the world (i.e. it isn't hard to write a script/python module to perform a particular task more efficiently than yt), but it is a useful "toolbox" for data exploration and basic analysis.

@mabruzzo
Copy link
Copy Markdown
Collaborator Author

This was a bear to write up. I definitely got a little lazy at the end. So definitely let me know if something doesn't make sense

@brantr
Copy link
Copy Markdown
Collaborator

brantr commented Apr 25, 2025 via email

@mabruzzo mabruzzo mentioned this pull request Apr 25, 2025
@evaneschneider
Copy link
Copy Markdown
Collaborator

@brantr Thanks! First, I definitely agree that it would be great to meet and discuss this further. I'll send a poll around later today to see if we can find a good meeting time.
Second, I think it's worth noting that this PR does not change anything about how cholla outputs and stores data, so any existing analysis scripts you have would not be impacted by this change, unless you have been using the concatenation scripts in the existing python_scripts repository that this PR touches. (My impression has been that folks at UCSC had mostly been using versions of the scripts Bruno uploaded and/or other scripts that don't exist in this repository at all, so definitely let me know if you've been using the upgraded versions of the concatenation scripts here.)

@mabruzzo
Copy link
Copy Markdown
Collaborator Author

mabruzzo commented Apr 25, 2025

Maybe there is a time people can find to meet and discuss?

I'm happy to meet and discuss. But I'll do my best to answer your questions here.

Changes to the Snapshot Format

The snapshot format has been changing a lot in recent times ...

Would you mind reminding us of some of these changes? Since I've been involved with Cholla (~20 months), I don't think there were any backwards incompatible changes to snapshot formats.

Am I forgetting something super obvious?

Compatibility

... , and these new changes to accommodate use with yt appear to require (at least eventually) substantial reworking of the analysis methods those of us who don’t use yt have invested in.

This is a fair point. There are 3 parts to this that I want to address.

I. A Solution: Virtual Datasets

I had actually come up with a "solution" to this problem, but didn't implement the solution before making the Pull Request. (In reality, this "solution" still breaks backwards compatibility, but it only requires a tiny tweak to get your existing analysis scripts to work) again.

I just now pushed a commit that adds the "solution" to the Pull Request. This solution use HDF5's virtual datasets.1 As you may know, a virtual dataset is a type of dataset that acts as an interface layer for mapping together regions of other HDF5 datasets. (Lot's of documentation about virtual datasets advertise the ability to map to HDF5 datasets in other files, but we do not use that feature).

I just now pushed a commit that adds the "solution" to the Pull Request. To make use of this feature, you should pass the optional --legacy-field argument to the concat_3d_data.py script or to the snaprepack.py script. When you pass this argument, the scripts will create the "field_legacy" HDF5-group in your output files. The "field_legacy" virtual dataset named for each field. Each dataset looks and acts just like the datasets that have historically been created by Cholla's concatenation scripts. But under the hood, it remaps accesses to the corresponding dataset in the "field" group.

To be concrete, if your analysis script currently expects to access the "density" and "Energy" datasets, you would only need to modify the script to access "field_legacy/density" and "field_legacy/Energy" fields.

II. This is arguably better for analyzing spatial subregions of large-scale datasets (whether you use yt or not)

As I tried (probably unsuccessfully) to argue in the main PR description, the proposed data concatenation strategy is probably better for any sort of analysis that involves analyzes a subregion of the simulation domain. If you imagine a spatial rectangular box (smaller than the entire domain) the values of field that lie within that box will generally be stored closer together on disk.

I can elaborate more.

III. More generally: Is this a sign that we should be providing a python module to help read Cholla data?

While I personally have no plans to ever modify Cholla's field-data snapshot format ever again, this may be an indication that we should all be using a python module to read in Cholla data? This would make analysis scripts far less vulnerable to these sorts of changes... I know that Athena++ is an example of a codebase that provides this sort of logic. (Personally, I've historically used yt for this particular purpose -- reading in the data).

Plans for particle and gravity data

Some of the comments about plans for the particle and gravity data are unclear to me, so I’m interested in understanding better what potential future changes you have in mind.

I suspect that the schema I sketched out for the particle and gravity data may not have appeared in email-form since it was in a collapsible section (denoted by HTML tags). I reproduced that section at the end of this message.

The big picture idea is to make things composable: there should be a clear distinction in the "kind" of data such that you can either store different kinds of data in different files OR you can store different kinds of data in a single file. This should be done so that you can use the same logic in both cases.

Let's go through the different kinds of data:

  • field-data: To recap, this PR already proposes putting the "field" datasets inside of a "field" HDF5 group (and yes, there is a little bit of reorganization of concatenated data)
  • particle-data: I primarily plan to simply all datasets inside of a "particle" group
    • existing scripts that access "" would instead need to access "particle/" (which is clearly an extremely minor tweak)
    • I also plan to add the "particle/stop_particle_idx" dataset to make it easy to identify the block that a particle belongs to (this is described at the very end of this message)
    • currently the particle datafiles also store a "density" field. It includes the deposited particle density field2. Since that is a purely a derived quantity (that isn't even used for restarts), my inclination is to make it possible for us to drop that information during concatenation.
  • gravity-data: I primarily plan to move
    • the easiest thing to do is to move the "gravity" dataset into a "gravity" group (the dataset would be "gravity/gravity"). A compelling argument could be made for putting it in the "field" dataset, but I don't think I care enough to undertake that work
    • I was planning to follow the same strategy for concatenation that we use for fluid fields.3 But more generally, any concatenation strategy will need to reformat the structure of the gravity data.4

Below is the section sketching out a hypothetical schema that also includes particle and gravity data:

Here we sketch a more-general hypothetical schema that can also include particle and gravity data:

/                                      # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── domain/       (REQUIRED)
 │    ├── blockid_location_arr        # shape: (BLx,BLy,BLz)
 │    └── stored_blockid_list         # shape: (nBStored,)
 ├── field/
 │    ├── <field-0>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    ├── <field-1>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    └── ...              # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 ├── particle/
 │    ├── ATTR:total_particle_count    # i64
 │    ├── stop_particle_idx            # 1D shape: (nBStored,)
 │    ├── <particle-prop-0>            # 1D shape: (stop_particle_idx[-1],)
 │    ├── <particle-prop-1>            # 1D shape: (stop_particle_idx[-1],)
 │    └── ...                          # 1D shape: (stop_particle_idx[-1],)
 └── gravity/
      └── gravity          # 4D shape: (nBStored,nBx,nBy,nBz)

A few notes about particle group in this hypothetical extension:

  • particle/ATTR:total_particle_count specifies the total number of particles in the ENTIRE simulation.
  • particle/stop_particle_idx holds monotonically non-decreasing values.
    • when nBStored == nBx*nBy*nBz, then {particle/stop_particle_idx}[-1] =={particle/ATTR:total_particle_count}
    • in other cases, {particle/stop_particle_idx}[-1] <={particle/ATTR:total_particle_count}
  • The values of that describe particles for the blockid specified by {domain/stored_blockid_list}[i] are given by {particle/<particle-prop-0>}[slc], where slc is:
    • 0:{particle/stop_particle_idx}[0], when i is 0
    • {particle/stop_particle_idx}[i-1]:{particle/stop_particle_idx}[i], in all other cases

Footnotes

  1. Virtual datasets were introduced in HDF5 1.10.0 (released over 9 years ago).

  2. Off the top of my head, I can't remember if it includes gas density contributions.

  3. There currently isn't an official way to concatenate gravity files -- so we are free to concatenate data however we want.

  4. Currently, we write the gravity-field as a 1d dataset that includes ghost zones and is ordered such that the x-axis is the fast-access axis. In contrast all hydro fields are written as 3d datasets that don't include ghost zones and where the z-axis is the fast-access axis.

@brantr
Copy link
Copy Markdown
Collaborator

brantr commented Apr 25, 2025 via email

@mabruzzo
Copy link
Copy Markdown
Collaborator Author

Regarding changes, I’m talking about #367. I was personally not working much for ~6 of those 12 months since, so Feb 2024 seems maybe more recent for me than for you.

Gotcha. I hadn't really considered it a significant backwards-incompatible change since I went out of my way in that PR to try to make it possible to easily continue using the legacy behavior.

I’m happy to discuss, I’ve filled out the form. Regarding Evan’s comment, I think the PR description confused me because the first sentence says "This Pull Request introduces a new data format for storing Cholla's field data in snapshots,” which sounds like it changes how cholla stores data. So I’ll have to review in more detail to understand.

Yeah, that was totally my fault.

  • I just edited the first line to reflect that this PR primarily changes the format for concatenated field data. As Evan notes, the only concrete change made by the PR is to the concatenation script.
  • The PR does present the format a little more broadly. I mostly meant that we could use it in the future for Cholla outputs (Ideally, it would be nice to have a single format that self-documents the number of blocks that are used). But, actually going through with that is a topic for another PR. But I'm not sure just how clear that is

I am not a fan of combining the gas, particle, and gravity data into a single file. It ~>triples the data volume required to analyze just the gas or particles or gravity field for the whole box. Happy to discuss this all at the meeting and catch up.

That's totally fair and I largely agree for truly massive simulations where a single snapshot takes up lots of space. Part of the idea was to make it optional to store it all in a single file: so that you could put them in a single file (if all the data is formatted in a composable manner -- the distinction of using 1 or 3 files is like 3 lines of code).

@evaneschneider evaneschneider merged commit c468043 into cholla-hydro:dev May 7, 2025
11 checks passed
@mabruzzo mabruzzo deleted the py-snap-builder branch August 18, 2025 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants