|
| 1 | +## EIA Dataset Overview |
| 2 | + |
| 3 | +The U.S. Energy Information Administration (EIA) provides a comprehensive set of open datasets covering all major aspects of the global energy system. |
| 4 | +These datasets are available through the **EIA Open Data API**, which organizes data into high-level categories and numerous subroutes. |
| 5 | + |
| 6 | +### Query Builder |
| 7 | + |
| 8 | +`scripts/eia_query_builder.py` contains a query builder to request datasets via the EIA API. Since the API only supplies 5000 rows / API call it might be tedious to do this by hand. The script takes a generated [API URL](https://www.eia.gov/opendata/browser) and polls it year by year. |
| 9 | + |
| 10 | +*Note: You need to create an account and provide you own API key to query new data!* |
| 11 | + |
| 12 | +*Also Note: There already implemented unit / integration tests for the query builder.* |
| 13 | + |
| 14 | +### Main Categories (EIA API Root Endpoints) |
| 15 | + |
| 16 | +| Main Category | Description | |
| 17 | +|----------------|-------------| |
| 18 | +| **Electricity** | Covers generation, consumption, transmission networks, and capacity statistics. | |
| 19 | +| **Natural Gas** | Includes production, consumption, storage, and pricing data for natural gas. | |
| 20 | +| **Petroleum** | Provides detailed data on refining, product output, inventories, and utilization. | |
| 21 | +| **Crude Oil Imports** | Tracks imported crude volumes by origin, port, and transport method. | |
| 22 | +| **Coal** | Reports production, exports, prices, and stock levels for coal and derivatives. | |
| 23 | +| **Densified Biomass** | Contains data on pellet and biomass production and storage. | |
| 24 | +| **Nuclear Plant Outages** | Lists generation outages, reactor capacity, and downtime. | |
| 25 | +| **Outlook & Projections** | Provides energy market forecasts, fuel price projections, and CO₂ demand outlooks. | |
| 26 | +| **Total Energy** | Aggregated statistics on total energy production and consumption across all sectors. | |
| 27 | +| **State Energy Data System (SEDS)** | State-level energy production, consumption, and emissions breakdown. | |
| 28 | +| **CO₂ Emissions** | Tracks emissions by source type, energy carrier, and sector. | |
| 29 | +| **International Energy** | Global datasets on production, trade flows, and consumption by country. | |
| 30 | + |
| 31 | +### Context for Shell-Related Analysis |
| 32 | + |
| 33 | +Relevant main EIA categories for Shell refinery and energy forecasting: |
| 34 | + |
| 35 | +- Petroleum |
| 36 | +- Crude Oil Imports |
| 37 | +- Natural Gas |
| 38 | +- Electricity |
| 39 | +- CO₂ Emissions |
| 40 | +- Total Energy |
| 41 | + |
| 42 | +### EDA |
| 43 | + |
| 44 | +Since the whole dataset is far too big for exploration the following subset is analyzed: |
| 45 | + |
| 46 | +#### Rafinery Yield |
| 47 | + |
| 48 | +*Dataset Structure* |
| 49 | + |
| 50 | +| # | Column | Non-Null Count | Dtype | |
| 51 | +|--:|----------------------|---------------:|:-------| |
| 52 | +| 0 | period | 120 186 | object | |
| 53 | +| 1 | duoarea | 120 186 | object | |
| 54 | +| 2 | area-name | 120 186 | object | |
| 55 | +| 3 | product | 120 186 | object | |
| 56 | +| 4 | product-name | 120 186 | object | |
| 57 | +| 5 | process | 120 186 | object | |
| 58 | +| 6 | process-name | 120 186 | object | |
| 59 | +| 7 | series | 120 186 | object | |
| 60 | +| 8 | series-description | 120 186 | object | |
| 61 | +| 9 | value | 116 668 | object | |
| 62 | +|10 | units | 120 186 | object | |
| 63 | + |
| 64 | +**Total columns:** 11 |
| 65 | +**Dtypes:** all `object` |
| 66 | + |
| 67 | +*Dataset Overview* |
| 68 | + |
| 69 | +| Column | Count | Unique | Top | Freq | |
| 70 | +|----------------------|-------:|-------:|------------------------------------------|------:| |
| 71 | +| period | 120 186 | 229 | 2010-03 | 1 408 | |
| 72 | +| duoarea | 120 186 | 16 | NUS | 9 457 | |
| 73 | +| area-name | 120 186 | 7 | NA | 70 551 | |
| 74 | +| product | 120 186 | 56 | EPJKC | 6 527 | |
| 75 | +| product-name | 120 186 | 56 | Commercial Kerosene-Type Jet Fuel | 6 527 | |
| 76 | +| process | 120 186 | 1 | YPY | 120 186 | |
| 77 | +| process-name | 120 186 | 1 | Refinery Net Production | 120 186 | |
| 78 | +| series | 120 186 | 1 575 | M_EPJKM_YPY_R50_MBBL | 227 | |
| 79 | +| series-description | 120 186 | 1 575 | West Coast (PADD 5) Refinery Net Production of… | 227 | |
| 80 | +| value | 116 668 | 13 794 | 0 | 5 951 | |
| 81 | +| units | 120 186 | 2 | MBBL | 60 909 | |
| 82 | + |
| 83 | +*Missing Values:* |
| 84 | + |
| 85 | +| Column | Missing Count | Missing % | |
| 86 | +|---------------|---------------:|-----------:| |
| 87 | +| value | 3 518 | 2.93 % | |
| 88 | + |
| 89 | +Not every month from query is in the actual dataset `1992 - 2025`: |
| 90 | + |
| 91 | +| Metric | Value | |
| 92 | +|------------------------|-------:| |
| 93 | +| Unique periods (months) | 229 | |
| 94 | + |
| 95 | +*Example plots* |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | +*Data Cleanup is needed to asses missing values when using not aggregated data!* |
| 100 | + |
| 101 | +*Seasonality:* |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +| duoarea | trend_strength | seasonal_strength | n_points | |
| 106 | +|----------|----------------|------------------|-----------| |
| 107 | +| NUS | 0.886752 | 0.933713 | 228 | |
| 108 | +| R30 | 0.856845 | 0.902187 | 229 | |
| 109 | +| R3B | 0.856639 | 0.913179 | 228 | |
| 110 | + |
| 111 | + |
| 112 | + |
| 113 | +| duoarea | trend_strength | seasonal_strength | n_points | |
| 114 | +|----------|----------------|------------------|-----------| |
| 115 | +| NUS | 0.410395 | 0.796275 | 80 | |
| 116 | +| R30 | 0.342718 | 0.762639 | 80 | |
| 117 | +| R3B | 0.224469 | 0.765235 | 80 | |
| 118 | + |
| 119 | +- **Before 2006:** |
| 120 | + - Significantly higher *trend_strength* and *seasonal_strength* values (~0.85–0.9). |
| 121 | + - Likely **distorted by outliers or structural changes** (e.g., unit, measurement, or reporting adjustments). |
| 122 | + - Metrics **do not represent actual time-series behavior** but reflect historical inconsistencies. |
| 123 | + |
| 124 | +- **After 2006:** |
| 125 | + - Values decrease to more realistic levels (*trend_strength* ≈ 0.2–0.4, *seasonal_strength* ≈ 0.75–0.8). |
| 126 | + - Indicates **stable, consistent annual cycles** with limited long-term drift. |
| 127 | + - Data are **more homogeneous and suitable for forecasting**. |
| 128 | + |
| 129 | +**Conclusion:** |
| 130 | +Historical data can provide broader coverage and longer time horizons, but **older records may distort time-series characteristics** due to structural changes, inconsistent measurement, or reporting differences. |
| 131 | +Such effects need to be **evaluated carefully** (e.g., variance or outlier analysis), and the **relevant modeling period should be selected based on data stability** rather than maximum historical length. |
| 132 | + |
| 133 | +### Lookout |
| 134 | + |
| 135 | +Further evaluation of **different data subsets** is required to verify whether similar issues (e.g., structural shifts, outliers, or inconsistent reporting) occur in other parts of the dataset. |
| 136 | +Each subset should be **assessed individually** to ensure data stability and reliability before inclusion in forecasting models. |
0 commit comments