Enforce more specific PUDL resource name constraints.#4438
Conversation
zaneselvans
left a comment
There was a problem hiding this comment.
Most of the changed lines are fixing names of test tables, both in the unit tests and in the docstring examples / tests in classes.py Substantiative changes in:
classes.py(not the docstrings)dbt_helper.pysources.py
| min_length=1, strict=True, pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$" | ||
| min_length=1, | ||
| strict=True, | ||
| pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$", |
There was a problem hiding this comment.
No changes here -- just forced args onto individual lines.
| PudlResourceName = Annotated[ | ||
| str, | ||
| StringConstraints( | ||
| min_length=1, | ||
| strict=True, | ||
| pattern=rf"^(_?core_|out_)({'|'.join(ALL_PUDL_SOURCES)})__([a-z0-9_]+)$", | ||
| ), | ||
| ] | ||
| """A valid PUDL resource name conforming to our naming conventions :class:`str`.""" |
There was a problem hiding this comment.
This was the lightest weight way I could see to do this. However, it means that the runtime enforcement of type conformance needs to be done with the function below. So maybe it would be better if it were a "real" Pydantic model, if it's not being used to type a field in a Pydantic model (as it is in the Resource class and everything that hangs off of that)
|
|
||
| ALL_PUDL_SOURCES = sorted(set(SOURCES.keys()).union(["epa", "eia", "ferc"])) |
There was a problem hiding this comment.
- We use the
eiaandferc"sources" for tables that are generic to the whole agency (i.e. they come from more than one data source). - However, we haven't been great about really defining what that means -- there are many tables with multiple "sources" listed which nevertheless have a single form listed in the resource name.
- The
epasource on the other hand is a rogue, used I think becauseepacamd_eiapertains to two data sources (it's glue...) and is the only data source we have that contains an underscore, which would mess up our table naming if we used it as the data source. - So maybe
epacamd_eiashouldn't actually be allowed here. Maybe we should rename that source.
| def get_data_source(table_name: str) -> str: | ||
| def get_data_source(table_name: PudlResourceName) -> str: | ||
| """Return the data source element of the table's name.""" | ||
| if not is_valid_pudl_resource_name(table_name): | ||
| raise ValueError(f"Invalid PUDL resource name: {table_name}") |
There was a problem hiding this comment.
Avoids the possibility of a malformed table name resulting in undefined behavior.
Overview
core_coreandoutlayers (since those are the only ones we write to the database or Parquet).Closes #4436
Testing
To-do list