Integrate Sylvan Energy heat rates analysis#5190
Conversation
|
|
||
| from . import ( | ||
| allocate_gen_fuel, | ||
| derived_plant_characteristics, |
There was a problem hiding this comment.
i chose this name, but maybe it's not right
| HEAT_RATE_ANALYSIS_CONFIG_SCHEMA = { | ||
| "final_year": Field( | ||
| int, | ||
| default_value=2024, # derive from dataset settings instead of hard coding? |
There was a problem hiding this comment.
As i said in comment: derive from dataset settings instead of hard coding?
| ), | ||
| "eia_epa_mapping_year": Field( | ||
| int, | ||
| default_value=2024, # derive from dataset settings instead of hard coding? |
There was a problem hiding this comment.
again this could be derived from dataset settings
| if isinstance(core_epacems__hourly_emissions, pl.LazyFrame): | ||
| cems_lf = core_epacems__hourly_emissions.select(cems_columns).filter( | ||
| pl.col("year").is_between(start_year, final_year, closed="both") | ||
| ) | ||
| if states: | ||
| cems_lf = cems_lf.filter(pl.col("state").is_in(states)) | ||
|
|
||
| cems = cems_lf.collect(engine="streaming").to_pandas() | ||
| else: | ||
| cems = core_epacems__hourly_emissions.loc[ | ||
| core_epacems__hourly_emissions["year"].between(start_year, final_year), | ||
| cems_columns, | ||
| ].copy() | ||
|
|
||
| if states: | ||
| cems = cems[cems["state"].isin(states)].copy() |
There was a problem hiding this comment.
materializing the asset was crashing for me when loading cems as a pandas dataframe, so i had to keep it as a polars lazy frame for as long as possible
| """Estimate EPA CEMS unit operational characteristics.""" | ||
| heat_rate_config = _get_heat_rate_analysis_config(context) | ||
| return ( | ||
| pl.from_pandas( |
There was a problem hiding this comment.
Pandera yelled at me when I tried returning this as a pandas dataframe, so i made it a polars lazy frame. not sure what convention is here.
| if valid_for_binning.any(): | ||
| binned = ( | ||
| cems.loc[valid_for_binning] | ||
| .groupby(unit_cols, group_keys=False)[load_factor_col] | ||
| .apply(lambda s: pd.cut(s, bins=10, right=True, include_lowest=False)) | ||
| .astype("object") | ||
| ) |
There was a problem hiding this comment.
this groupby apply is not vectorized... but couldn't figure out how to avoid it while still matching the output of jaxon's script
| return (~(same_unit & consecutive_hour & same_state)).cumsum() | ||
|
|
||
|
|
||
| def _assign_groupwise_load_factor_bins( |
There was a problem hiding this comment.
This diverges slightly from the original script. It no longer loops over the units and does everything within the loop; it now does per-unit pd.cut just for bin assignment, then feeds those bins into (mostly) vectorized operations.
| return ramp_input.groupby(unit_cols).apply(summarize_unit).reset_index() | ||
|
|
||
|
|
||
| def estimate_operational_characteristics_by_unit( |
There was a problem hiding this comment.
This does several things: per-unit max-load prep, binning, stable-run detection, stable-bin selection, heat-rate summarization, up/down run summarization, and final column shaping. It could probably be broken out into more pieces.
| return cems | ||
|
|
||
|
|
||
| def _summarize_ramp_rates( |
There was a problem hiding this comment.
This is intentionally not fully vectorize, because all the vectorized implementations I tried led to meaningful changes in the final output compared to the script's final output.
| stable_runs = ( | ||
| binned_cems.loc[binned_cems["load_factor_bin_ordinal"] > 1] | ||
| .groupby( | ||
| unit_cols | ||
| + [ | ||
| "load_factor_bin_ordinal", | ||
| "load_factor_bin_left", | ||
| "load_factor_bin", | ||
| "bin_run_id", | ||
| ], | ||
| as_index=False, | ||
| ) | ||
| .size() | ||
| .rename(columns={"size": "run_length"}) | ||
| ) |
There was a problem hiding this comment.
Here this identifies candidate stable runs from binned_cems, then chooses one stable bin per unit to calculate heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr.
There are two different types of "stable bin" here:
load_factor_bin_ordinal: a numeric ordering of bins within a unit
min_stable_bin: the actual pd.Interval object from pd.cut
The ordinal is good for sorting and picking the first qualifying stable bin. But to match the outputs of the script (for the heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr ), the actual interval object matters, because the original script selects rows based on exact bin membership rather than just bin order.
| ), | ||
| "unit": "MW", | ||
| }, | ||
| "min_down_time_hr": { |
There was a problem hiding this comment.
should probably be "min_down_time_hours"? and same for "min_up_time_hours".
| "usage_warnings": ["estimated_values"], | ||
| "additional_details_text": """This table summarizes several inferred | ||
| operational characteristics for each EPA CEMS emissions unit using hourly CEMS | ||
| gross load and fuel heat content over a configurable multi-year window. |
There was a problem hiding this comment.
To-do: For basically all users this will not be configurable - should we just note that this is 3 years by default?
| if ramp.limit(1).collect().is_empty(): | ||
| return np.nan, np.nan | ||
|
|
||
| # Cast to pandas to qcut bins |
There was a problem hiding this comment.
Consider whether to change binning method to work in polars.
Overview
Running Sylvan's script was actually quite fast for me, the slow part was reading in / downloading CEMS. But still, parallelizing and making it available to run on all states at once seems good.
Closes #5106 .
What problem does this address?
Dagsterizes Sylvan's analysis of plant characteristics and heat rate calculations.
I was able to vectorize almost all of it, except some of the binning (see comments). It runs pretty fast for me in Dagster, but I haven't tried running it on all of the states. Maybe the binning needs to be vectorized as well, but then it likely won't match Jaxon's script outputs exactly.
I didn't yet create assets for the other CSVs produced by the script (eia_860_plant_unit_summary.csv, eia_860_plant_summary.csv, eia_860_plant_gen_summary.csv), but not sure if these all need to be assets.
What did you change?
Documentation
Make sure to update relevant aspects of the documentation:
docs/data_sources/templates).src/metadata).Testing
How did you make sure this worked? How can a reviewer verify this?
I tested this by materializing
out_epacems__yearly_operational_characteristicson just California with 3 years of data and comparing the asset to theepa_op_char_output_df.csvoutput from Jaxon's script. They were identical except for a .1 difference inheat_rate_at_min_stable_level_mmbtu_per_mwh. This could be a rounding error?I didn't touch anything
dbt, so that will need to get added in. It's currently failing the unit tests because of this.To-do list
out of scope
Testing todos
dbttests.pixi run prek-runto run linters and static code analysis checks.pixi run pytest-cilocally to ensure that the merge queue will accept your PR.build-deploy-pudlGitHub Action manually and ensure that it succeeds.