DISC: Not Everything Needs The Index

(We could call this `Project: NENTI`)

As evidenced recently in PRs #5250 , #5224 , #5211 , and some others, not everything needs the index.  In fact, in general the reason that some of those things require the index is not because they *actually* need the index, but because one of the most pernicious Enzo-isms in yt is the lingering dependency on the index for some things.

## The Problems

There are a few overlapping problems here.  The first is that the construction of the index conflates two things -- constructing/parsing the index and getting all the necessary information from it, and construction the in-memory representation of things like grids.

The other big problem is that there are a handful of peculiarities (not necessarily problems, to be clear -- I am absolutely not disparaging Enzo here) in how Enzo stores some specific pieces of information.  The two specific pieces of information in question are the maximum grid level of grids and the fields that are available in a given dataset.

The former is addressed in PR #5211 but the latter is more in #5250 and #5224.

## What Fields Are There

Enzo's fields are stored in a hierarchy of HDF5 leaves, for instance:

```
/Grid000001/Density
/Grid000001/Temperature
/Grid000001/particle_position_x
```

There is secondary information in the "parameter file" that, in many cases, prevents us needing to do a census of the entire set of HDF5 leaves.  These are lines like:

```
DataLabel[0]              = Density
DataUnits[0]              = none
#DataCGSConversionFactor[0] = 2.76112e-30
DataLabel[1]              = x-velocity
DataUnits[1]              = none
#DataCGSConversionFactor[1] = 1.32477e+06
DataLabel[2]              = y-velocity
DataUnits[2]              = none
#DataCGSConversionFactor[2] = 1.32477e+06
DataLabel[3]              = z-velocity
DataUnits[3]              = none
#DataCGSConversionFactor[3] = 1.32477e+06
DataLabel[4]              = TotalEnergy
DataUnits[4]              = none
DataLabel[5]              = Bx
DataUnits[5]              = none
DataLabel[6]              = By
DataUnits[6]              = none
DataLabel[7]              = Bz
DataUnits[7]              = none
```

(There are many more lines here.)  Unfortunately, there are two reasonably important cases that the `DataLabel` lines don't include -- derived fields (such as `Dark_Matter_Density`) and particle fields, particularly "Active Particle" fields but not limited to those.

So if we assume that we do not want to make multiple modifications to the `field_info` object, then *for Enzo specifically* we need to do a full-on census to ensure we have every single field.  And, a census is expensive; so, we defer it to the other time we do an expensive operation, which is when we construct the index.

(**IF** we relaxed this requirement, we would not need to do this.  But, generating the derived field list can be expensive, so if we have a field that requires fields A, B and C and only A and B are identified pre-census, then by the time we get C added we don't know we need to enable that field, and we'd have to do the whole field derivation process again.  It gets even stickier if, for instance, yt can generate C as a fallback -- like yt can for temperature, dark matter density, etc.)

## Maybe It's Not So Bad

Well, perhaps it's quite -- no, it's exactly that bad.  Sorry.

## What If We...

Maybe we should just do this differently.  Maybe we should move all the field stuff back into the `Dataset`.  And, for frontends like Enzo, we make them require the index.  (This would be the reverse of what we're doing now -- Enzo will not be the default behavior.)  We'd likely still want derived fields to be generated later, as they can be expensive.

And, for the things like max level and whatnot, in *many* cases we don't actually require the index be parsed to get that information; Enzo is a bit of a special case.  (And I think that it's not impossible to get the info without fully parsing the info in RAMSES, but I might be wrong.)  So let's just move that around, too.

## Moving Forward

Even if this gets resolved, the biggest problem that remains might be that the construction of derived fields is slow.  The reason for this is that we try to generate everything and use failures to mark as "not available," which is the only way I could see to get around the fallback fields mentioned above.

This has big impact on testing, since the creation of a dataset for each test adds the overhead of field detection, which **really** adds up.

There have been a few alternatives proposed to this, but one that is standing out right at this moment is field plugin validation.  The problem with this is that we need a way to identify which plugins are relevant without requiring NxM adapters be written to connect N different plugins to M different frontends.

## Metrics of Success

Without generating the index, we should be able to:

- Create any *selection*-based data object
- Query fields accessible *on-disk* (and maybe aliased or derived, but let's not get too far ahead of ourselves)

We should also probably see how quickly we can make instantiation, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DISC: Not Everything Needs The Index #5252

The Problems

What Fields Are There

Maybe It's Not So Bad

What If We...

Moving Forward

Metrics of Success

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DISC: Not Everything Needs The Index #5252

Description

The Problems

What Fields Are There

Maybe It's Not So Bad

What If We...

Moving Forward

Metrics of Success

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions