-
Notifications
You must be signed in to change notification settings - Fork 299
Description
(We could call this Project: NENTI)
As evidenced recently in PRs #5250 , #5224 , #5211 , and some others, not everything needs the index. In fact, in general the reason that some of those things require the index is not because they actually need the index, but because one of the most pernicious Enzo-isms in yt is the lingering dependency on the index for some things.
The Problems
There are a few overlapping problems here. The first is that the construction of the index conflates two things -- constructing/parsing the index and getting all the necessary information from it, and construction the in-memory representation of things like grids.
The other big problem is that there are a handful of peculiarities (not necessarily problems, to be clear -- I am absolutely not disparaging Enzo here) in how Enzo stores some specific pieces of information. The two specific pieces of information in question are the maximum grid level of grids and the fields that are available in a given dataset.
The former is addressed in PR #5211 but the latter is more in #5250 and #5224.
What Fields Are There
Enzo's fields are stored in a hierarchy of HDF5 leaves, for instance:
/Grid000001/Density
/Grid000001/Temperature
/Grid000001/particle_position_x
There is secondary information in the "parameter file" that, in many cases, prevents us needing to do a census of the entire set of HDF5 leaves. These are lines like:
DataLabel[0] = Density
DataUnits[0] = none
#DataCGSConversionFactor[0] = 2.76112e-30
DataLabel[1] = x-velocity
DataUnits[1] = none
#DataCGSConversionFactor[1] = 1.32477e+06
DataLabel[2] = y-velocity
DataUnits[2] = none
#DataCGSConversionFactor[2] = 1.32477e+06
DataLabel[3] = z-velocity
DataUnits[3] = none
#DataCGSConversionFactor[3] = 1.32477e+06
DataLabel[4] = TotalEnergy
DataUnits[4] = none
DataLabel[5] = Bx
DataUnits[5] = none
DataLabel[6] = By
DataUnits[6] = none
DataLabel[7] = Bz
DataUnits[7] = none
(There are many more lines here.) Unfortunately, there are two reasonably important cases that the DataLabel lines don't include -- derived fields (such as Dark_Matter_Density) and particle fields, particularly "Active Particle" fields but not limited to those.
So if we assume that we do not want to make multiple modifications to the field_info object, then for Enzo specifically we need to do a full-on census to ensure we have every single field. And, a census is expensive; so, we defer it to the other time we do an expensive operation, which is when we construct the index.
(IF we relaxed this requirement, we would not need to do this. But, generating the derived field list can be expensive, so if we have a field that requires fields A, B and C and only A and B are identified pre-census, then by the time we get C added we don't know we need to enable that field, and we'd have to do the whole field derivation process again. It gets even stickier if, for instance, yt can generate C as a fallback -- like yt can for temperature, dark matter density, etc.)
Maybe It's Not So Bad
Well, perhaps it's quite -- no, it's exactly that bad. Sorry.
What If We...
Maybe we should just do this differently. Maybe we should move all the field stuff back into the Dataset. And, for frontends like Enzo, we make them require the index. (This would be the reverse of what we're doing now -- Enzo will not be the default behavior.) We'd likely still want derived fields to be generated later, as they can be expensive.
And, for the things like max level and whatnot, in many cases we don't actually require the index be parsed to get that information; Enzo is a bit of a special case. (And I think that it's not impossible to get the info without fully parsing the info in RAMSES, but I might be wrong.) So let's just move that around, too.
Moving Forward
Even if this gets resolved, the biggest problem that remains might be that the construction of derived fields is slow. The reason for this is that we try to generate everything and use failures to mark as "not available," which is the only way I could see to get around the fallback fields mentioned above.
This has big impact on testing, since the creation of a dataset for each test adds the overhead of field detection, which really adds up.
There have been a few alternatives proposed to this, but one that is standing out right at this moment is field plugin validation. The problem with this is that we need a way to identify which plugins are relevant without requiring NxM adapters be written to connect N different plugins to M different frontends.
Metrics of Success
Without generating the index, we should be able to:
- Create any selection-based data object
- Query fields accessible on-disk (and maybe aliased or derived, but let's not get too far ahead of ourselves)
We should also probably see how quickly we can make instantiation, too.