Introduce new organizational schema format for storing Cholla data#427
Conversation
|
This was a bear to write up. I definitely got a little lazy at the end. So definitely let me know if something doesn't make sense |
|
Maybe there is a time people can find to meet and discuss? The snapshot format has been changing a lot in recent times, and these new changes to accommodate use with yt appear to require (at least eventually) substantial reworking of the analysis methods those of us who don’t use yt have invested in. Some of the comments about plans for the particle and gravity data are unclear to me, so I’m interested in understanding better what potential future changes you have in mind.Thanks!On Apr 25, 2025, at 6:17 AM, Matthew Abruzzo ***@***.***> wrote:
mabruzzo left a comment (cholla-hydro/cholla#427)
This was a bear to write up. I definitely got a little lazy at the end. So definitely let me know if something doesn't make sense
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
@brantr Thanks! First, I definitely agree that it would be great to meet and discuss this further. I'll send a poll around later today to see if we can find a good meeting time. |
I'm happy to meet and discuss. But I'll do my best to answer your questions here. Changes to the Snapshot Format
Would you mind reminding us of some of these changes? Since I've been involved with Cholla (~20 months), I don't think there were any backwards incompatible changes to snapshot formats.
Am I forgetting something super obvious? Compatibility
This is a fair point. There are 3 parts to this that I want to address. I. A Solution: Virtual DatasetsI had actually come up with a "solution" to this problem, but didn't implement the solution before making the Pull Request. (In reality, this "solution" still breaks backwards compatibility, but it only requires a tiny tweak to get your existing analysis scripts to work) again. I just now pushed a commit that adds the "solution" to the Pull Request. This solution use HDF5's virtual datasets.1 As you may know, a virtual dataset is a type of dataset that acts as an interface layer for mapping together regions of other HDF5 datasets. (Lot's of documentation about virtual datasets advertise the ability to map to HDF5 datasets in other files, but we do not use that feature). I just now pushed a commit that adds the "solution" to the Pull Request. To make use of this feature, you should pass the optional To be concrete, if your analysis script currently expects to access the II. This is arguably better for analyzing spatial subregions of large-scale datasets (whether you use yt or not)As I tried (probably unsuccessfully) to argue in the main PR description, the proposed data concatenation strategy is probably better for any sort of analysis that involves analyzes a subregion of the simulation domain. If you imagine a spatial rectangular box (smaller than the entire domain) the values of field that lie within that box will generally be stored closer together on disk. I can elaborate more. III. More generally: Is this a sign that we should be providing a python module to help read Cholla data?While I personally have no plans to ever modify Cholla's field-data snapshot format ever again, this may be an indication that we should all be using a python module to read in Cholla data? This would make analysis scripts far less vulnerable to these sorts of changes... I know that Athena++ is an example of a codebase that provides this sort of logic. (Personally, I've historically used yt for this particular purpose -- reading in the data). Plans for particle and gravity data
I suspect that the schema I sketched out for the particle and gravity data may not have appeared in email-form since it was in a collapsible section (denoted by HTML tags). I reproduced that section at the end of this message. The big picture idea is to make things composable: there should be a clear distinction in the "kind" of data such that you can either store different kinds of data in different files OR you can store different kinds of data in a single file. This should be done so that you can use the same logic in both cases. Let's go through the different kinds of data:
Below is the section sketching out a hypothetical schema that also includes particle and gravity data: Here we sketch a more-general hypothetical schema that can also include particle and gravity data: A few notes about particle group in this hypothetical extension:
Footnotes
|
|
Regarding changes, I’m talking about #367. I was personally not working much for ~6 of those 12 months since, so Feb 2024 seems maybe more recent for me than for you.
I’m happy to discuss, I’ve filled out the form. Regarding Evan’s comment, I think the PR description confused me because the first sentence says "This Pull Request introduces a new data format for storing Cholla's field data in snapshots,” which sounds like it changes how cholla stores data. So I’ll have to review in more detail to understand.
I am not a fan of combining the gas, particle, and gravity data into a single file. It ~>triples the data volume required to analyze just the gas or particles or gravity field for the whole box.
Happy to discuss this all at the meeting and catch up.
…-Brant
----------------------
Brant Robertson
Professor of Astronomy and Astrophysics
University of California, Santa Cruz
***@***.***
On Apr 25, 2025, at 11:50 AM, Matthew Abruzzo ***@***.***> wrote:
mabruzzo
left a comment
(cholla-hydro/cholla#427)
<#427 (comment)>
Maybe there is a time people can find to meet and discuss?
I'm happy to meet and discuss. But I'll do my best to answer your questions here.
Changes to the Snapshot Format
The snapshot format has been changing a lot in recent times ...
Would you mind reminding us of some of these changes? Since I've been involved with Cholla (~20 months), I don't think there were any backwards incompatible changes to snapshot formats.
#341 <#341> -- I added a "cholla" attribute
#367 <#367> -- where I changed the default structure for organizing output files to use directories by default (as part of that PR we made it possible to keep using the original flat structure)
Am I forgetting something super obvious?
Compatibility
... , and these new changes to accommodate use with yt appear to require (at least eventually) substantial reworking of the analysis methods those of us who don’t use yt have invested in.
This is a fair point. There are 3 parts to this that I want to address.
I. A Solution: Virtual Datasets
I had actually come up with a "solution" to this problem, but didn't implement the solution before making the Pull Request.
(In reality, this "solution" still breaks backwards compatibility, but it only requires a tiny tweak to get your existing analysis scripts to work) again.
I just now pushed a commit that adds the "solution" to the Pull Request. This solution involving HDF5's virtual datasets <https://support.hdfgroup.org/documentation/hdf5/latest/d4/d79/_v_d_s_t_n.html>.1 <x-msg://4/#user-content-fn-1-509f78c042d9e0478534a0acb3353335> As you may know, a virtual dataset is a type of dataset is an interface layer that maps together regions of other HDF5 datasets. (Lot's of documentation about virtual datasets advertise the ability to map to HDF5 datasets in other files, but we do not use that feature).
I just now pushed a commit that adds the "solution" to the Pull Request. To make use of this feature, you should pass the optional --legacy-field argument to the concat_3d_data.py script or to the snaprepack.py script. When you pass this argument, the scripts will create the "field_legacy" HDF5-group in your output files. The "field_legacy" virtual dataset named for each field. Each dataset looks and acts just like the datasets that have historically been created by Cholla's concatenation scripts. But under the hood, it remaps accesses to the corresponding dataset in the "field" group.
To be concrete, if your analysis script currently expects to access the "density" and "Energy" datasets, you would only need to modify the script to access "field_legacy/density" and "field_legacy/Energy" fields.
II. This is arguably better for analyzing spatial subregions of large-scale datasets (whether you use yt or not)
As I tried (probably unsuccessfully) to argue in the main PR description, the proposed data concatenation strategy is probably better for any sort of analysis that involves analyzes a subregion of the simulation domain. If you imagine a spatial rectangular box (smaller than the entire domain) the values of field that lie within that box will generally be stored closer together on disk.
I can elaborate more.
III. More generally: Is this a sign that we should be providing a python module to help read Cholla data?
While I personally have no plans to ever modify Cholla's field-data snapshot format ever again, this may be an indication that we should all be using a python module to read in Cholla data? This would make analysis scripts far less vulnerable to these sorts of changes... I know that Athena++ is an example of a codebase that provides this sort of logic. (Personally, I've historically used yt for this particular purpose -- reading in the data).
Plans for particle and gravity data
Some of the comments about plans for the particle and gravity data are unclear to me, so I’m interested in understanding better what potential future changes you have in mind.
I suspect that the schema I sketched out for the particle and gravity data may not have appeared in email-form since it was in a collapsible section (denoted by HTML tags). I reproduced that section at the end of this message.
The big picture idea is to make things composable: there should be a clear distinction in the "kind" of data such that you can either store different kinds of data in different files OR you can store different kinds of data in a single file. This should be done so that you can use the same logic in both cases.
Let's go through the different kinds of data:
field-data: To recap, this PR already proposes putting the "field" datasets inside of a "field" HDF5 group (and yes, there is a little bit of reorganization of concatenated data)
particle-data: I primarily plan to simply all datasets inside of a "particle" group
existing scripts that access "" would instead need to access "particle/" (which is clearly an extremely minor tweak)
I also plan to add the "particle/stop_particle_idx" dataset to make it easy to identify the block that a particle belongs to (this is described at the very end of this message)
currently the particle datafiles also store a "density" field. It includes the deposited particle density field2 <x-msg://4/#user-content-fn-2-509f78c042d9e0478534a0acb3353335>. Since that is a purely a derived quantity (that isn't even used for restarts), my inclination is to make it possible for us to drop that information during concatenation.
gravity-data: I primarily plan to move
the easiest thing to do is to move the "gravity" dataset into a "gravity" group (the dataset would be "gravity/gravity"). A compelling argument could be made for putting it in the "field" dataset, but I don't think I care enough to undertake that work
I was planning to follow the same strategy for concatenation that we use for fluid fields.3 <x-msg://4/#user-content-fn-3-509f78c042d9e0478534a0acb3353335> But more generally, any concatenation strategy will need to reformat the structure of the gravity data.4 <x-msg://4/#user-content-fn-4-509f78c042d9e0478534a0acb3353335>
Below is the section sketching out a hypothetical schema that also includes particle and gravity data:
Here we sketch a more-general hypothetical schema that can also include particle and gravity data:
/ # root group
├── HEADER-ATTRS (REQUIRED)
├── domain/ (REQUIRED)
│ ├── blockid_location_arr # shape: (BLx,BLy,BLz)
│ └── stored_blockid_list # shape: (nBStored,)
├── field/
│ ├── <field-0> # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
│ ├── <field-1> # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
│ └── ... # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
├── particle/
│ ├── ATTR:total_particle_count # i64
│ ├── stop_particle_idx # 1D shape: (nBStored,)
│ ├── <particle-prop-0> # 1D shape: (stop_particle_idx[-1],)
│ ├── <particle-prop-1> # 1D shape: (stop_particle_idx[-1],)
│ └── ... # 1D shape: (stop_particle_idx[-1],)
└── gravity/
└── gravity # 4D shape: (nBStored,nBx,nBy,nBz)
A few notes about particle group in this hypothetical extension:
particle/ATTR:total_particle_count specifies the total number of particles in the ENTIRE simulation.
particle/stop_particle_idx holds monotonically non-decreasing values.
when nBStored == nBx*nBy*nBz, then {particle/stop_particle_idx}[-1] =={particle/ATTR:total_particle_count}
in other cases, {particle/stop_particle_idx}[-1] <={particle/ATTR:total_particle_count}
The values of that describe particles for the blockid specified by {domain/stored_blockid_list}[i] are given by {particle/<particle-prop-0>}[slc], where slc is:
0:{particle/stop_particle_idx}[0], when i is 0
{particle/stop_particle_idx}[i-1]:{particle/stop_particle_idx}[i], in all other cases
Footnotes
Virtual datasets were introduced in HDF5 1.10.0 (released over 9 years ago) <https://www.hdfgroup.org/2016/03/31/welcome-hdf5-1-10-0/>. ↩ <x-msg://4/#user-content-fnref-1-509f78c042d9e0478534a0acb3353335>
Off the top of my head, I can't remember if it includes gas density contributions. ↩ <x-msg://4/#user-content-fnref-2-509f78c042d9e0478534a0acb3353335>
There currently isn't an official way to concatenate gravity files -- so we are free to concatenate data however we want. ↩ <x-msg://4/#user-content-fnref-3-509f78c042d9e0478534a0acb3353335>
Currently, we write the gravity-field as a 1d dataset that includes ghost zones and is ordered such that the x-axis is the fast-access axis. In contrast all hydro fields are written as 3d datasets that don't include ghost zones and where the z-axis is the fast-access axis. ↩ <x-msg://4/#user-content-fnref-4-509f78c042d9e0478534a0acb3353335>
—
Reply to this email directly, view it on GitHub <#427 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAKR6XOADY52XX6ABQCNM323J7W5AVCNFSM6AAAAAB33TWCBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMZRGE3TQMBYGY>.
You are receiving this because you were mentioned.
|
Gotcha. I hadn't really considered it a significant backwards-incompatible change since I went out of my way in that PR to try to make it possible to easily continue using the legacy behavior.
Yeah, that was totally my fault.
That's totally fair and I largely agree for truly massive simulations where a single snapshot takes up lots of space. Part of the idea was to make it optional to store it all in a single file: so that you could put them in a single file (if all the data is formatted in a composable manner -- the distinction of using 1 or 3 files is like 3 lines of code). |
Overview
This Pull Request introduces a new data format for storing Cholla's concatenated field data in snapshots. We introduce 3 changes: specify a new format, modifying python scripts, and changing documentation
Background
As we all know, a Cholla simulation partitions spatial data into blocks.
(BLx, BLy, BLz). When a simulation starts, this shape is always chosen to match(n_proc_x, n_proc_y, n_proc_z).(nDx,nDy,nDz)cells. In practice, the underlying field data is partitioned among the blocks. A given block is responsible for tracking data on(nBx,nBy,nBz)cells, where(nDx,nDy,nDz) = (BLx*nBx, BLy*nBy, BLz*nBz).When Cholla writes a snapshot of the field data, each block is written to a distinct file.
By default, Cholla uses HDF5 files. Cholla's standard "schema" (strategy for organizing data) is "flat" (i.e. there is no hierarchy). We represent this schema down below:
For convenience, we typically concatenate files as a post-processing step. Historically, this procedure always produces an HDF5 file with the "Flat Format," where the data is stitched together in a way that roughly approximates the file that would be produced by a simulation run with a single block that spanned the entire domain.
Motivation for a new schema
The classic organizational schema works quite well. The simplicity is extremely useful! However, we start to encounter issues when we get to really large data files
The main impetus for making this PR is relates to issues with existing strategy and yt1 for large datasets:
Theoretically, we can configure yt to operate on distributed files (see yt-project/yt#4702). But, this is not an optimal general solution for a few reasons:
More generally (whether you use yt or not), the current approach is sub-optimal for any sort of analysis of a spatial sub-region. At the moment, the fastest way to access all of the data in simulation, you need to iterate over all values along the z-axis. This is pretty inefficient if you want to access just a spatial subregion (say you just want to consider the region around a galactic disk).
Change 1: Description of New Format:
NOTE: it may be helpful to look at the newly proposed documentation (linked down below).
The proposed schema looks like the following:
In the above diagram:
In other words, the fastest index is last axis
BLx,BLy,BLzrefer to the number of blocks per axis.nBx,nBy,nBzrefer to the number of cells per block. This is the shape of a cell-centered field.nBStoredis the number of blocks in the file. It should nominally be1ornBx*nBy*nBz"domain/blockid_location_arr"specifies the relative locations of theblocks
{field/<field-0>}[i, ...]corresponds to the data of theblock with blockid specified by
{domain/stored_blockid_list}[i]Note
Files in this format ALWAYS provide the
"dims"attribute (in HEADER-ATTRS) and the"domain"group. Importantly,"dims"specifies(nDx,nDy,nDz), the number of cells on the conceptual global Domain-grid, and the shape of "domain/blockid_location_arr" is(BLx,BLy,BLz). Thus, you always can infer(nBx,nBy,nBz) = (nDx/BLx, nDy/BLy, nDz/BLz). Consequently, you can determine whether afield/<field-0>is cell-centered, face-centered, etc. by looking at the field's shape.The above format is intended to be forward-compatible with a schema that also stores particle-data and gravity-data in the same file. We illustrate what this could look like down below (in the collapsible section)
ASIDE: Preview of more general schema
Here we sketch a more-general hypothetical schema that can also include particle and gravity data:
A few notes about particle group in this hypothetical extension:
particle/ATTR:total_particle_countspecifies the total number of particles in the ENTIRE simulation.particle/stop_particle_idxholds monotonically non-decreasing values.nBStored == nBx*nBy*nBz, then{particle/stop_particle_idx}[-1] =={particle/ATTR:total_particle_count}{particle/stop_particle_idx}[-1] <={particle/ATTR:total_particle_count}{domain/stored_blockid_list}[i]are given by{particle/<particle-prop-0>}[slc], whereslcis:0:{particle/stop_particle_idx}[0], wheniis 0{particle/stop_particle_idx}[i-1]:{particle/stop_particle_idx}[i], in all other casesChange 2: Modifications to the Python Scripts:
This is the only set of changes that concretely correspond to modifications in the repository.
snaprepack.pyto create new files with the hierarchical schema based on previously concatenated files."nprocs"header attribute, you would specify it with the--missing-nprocs-tripleflag. For example, if "nprocs" should hold (4,2,2), then you could invokeconcat_3d_data.pysuch that resulting files now have the hierarchical format.Important
This PR has no impact on the concatenation of particles or 2D datasets (e.g. slices, projections)
While
snaprepack.pyis a little long, a lot of the contents are related to documenting the code. The code is written in a way that it will be easy to extend to also handle particle-data as well as gravity-data. Once we finish that, I think I could probably simplify the logic a little more. But, for now I think the code is "good enough."Change 3: Proposed Changes to the documentation:
If/When this PR is merged, we will need to modify the Outputs page of the Wiki.. I have already drafted the changes to this page, while I was trying to figure out how to best describe the file format.
Other thoughts
Potential improvements to the proposed scheme:
"magnetic_x", all fields on y-faces (namely"magnetic_y", etc.). Theoretically, this would give us the flexibility opportunity to store the fields for a given block in closer proximity to each other (which could be useful for analysis). In practice, I skeptical that this additional complexity is warranted.In the future:
Footnotes
As we're all aware, yt is not necessarily the most efficient tool in the world (i.e. it isn't hard to write a script/python module to perform a particular task more efficiently than yt), but it is a useful "toolbox" for data exploration and basic analysis. ↩