Skip to content

Commit 910dbfa

Browse files
authored
Merge pull request #41 from amosproj/dataset_eda
Dataset eda + reorganisation of resource folder
2 parents 67ff298 + 958cf3a commit 910dbfa

File tree

31 files changed

+7981
-65
lines changed

31 files changed

+7981
-65
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
## EIA Dataset Overview
2+
3+
The U.S. Energy Information Administration (EIA) provides a comprehensive set of open datasets covering all major aspects of the global energy system.
4+
These datasets are available through the **EIA Open Data API**, which organizes data into high-level categories and numerous subroutes.
5+
6+
### Query Builder
7+
8+
`scripts/eia_query_builder.py` contains a query builder to request datasets via the EIA API. Since the API only supplies 5000 rows / API call it might be tedious to do this by hand. The script takes a generated [API URL](https://www.eia.gov/opendata/browser) and polls it year by year.
9+
10+
*Note: You need to create an account and provide you own API key to query new data!*
11+
12+
*Also Note: There already implemented unit / integration tests for the query builder.*
13+
14+
### Main Categories (EIA API Root Endpoints)
15+
16+
| Main Category | Description |
17+
|----------------|-------------|
18+
| **Electricity** | Covers generation, consumption, transmission networks, and capacity statistics. |
19+
| **Natural Gas** | Includes production, consumption, storage, and pricing data for natural gas. |
20+
| **Petroleum** | Provides detailed data on refining, product output, inventories, and utilization. |
21+
| **Crude Oil Imports** | Tracks imported crude volumes by origin, port, and transport method. |
22+
| **Coal** | Reports production, exports, prices, and stock levels for coal and derivatives. |
23+
| **Densified Biomass** | Contains data on pellet and biomass production and storage. |
24+
| **Nuclear Plant Outages** | Lists generation outages, reactor capacity, and downtime. |
25+
| **Outlook & Projections** | Provides energy market forecasts, fuel price projections, and CO₂ demand outlooks. |
26+
| **Total Energy** | Aggregated statistics on total energy production and consumption across all sectors. |
27+
| **State Energy Data System (SEDS)** | State-level energy production, consumption, and emissions breakdown. |
28+
| **CO₂ Emissions** | Tracks emissions by source type, energy carrier, and sector. |
29+
| **International Energy** | Global datasets on production, trade flows, and consumption by country. |
30+
31+
### Context for Shell-Related Analysis
32+
33+
Relevant main EIA categories for Shell refinery and energy forecasting:
34+
35+
- Petroleum
36+
- Crude Oil Imports
37+
- Natural Gas
38+
- Electricity
39+
- CO₂ Emissions
40+
- Total Energy
41+
42+
### EDA
43+
44+
Since the whole dataset is far too big for exploration the following subset is analyzed:
45+
46+
#### Rafinery Yield
47+
48+
*Dataset Structure*
49+
50+
| # | Column | Non-Null Count | Dtype |
51+
|--:|----------------------|---------------:|:-------|
52+
| 0 | period | 120 186 | object |
53+
| 1 | duoarea | 120 186 | object |
54+
| 2 | area-name | 120 186 | object |
55+
| 3 | product | 120 186 | object |
56+
| 4 | product-name | 120 186 | object |
57+
| 5 | process | 120 186 | object |
58+
| 6 | process-name | 120 186 | object |
59+
| 7 | series | 120 186 | object |
60+
| 8 | series-description | 120 186 | object |
61+
| 9 | value | 116 668 | object |
62+
|10 | units | 120 186 | object |
63+
64+
**Total columns:** 11
65+
**Dtypes:** all `object`
66+
67+
*Dataset Overview*
68+
69+
| Column | Count | Unique | Top | Freq |
70+
|----------------------|-------:|-------:|------------------------------------------|------:|
71+
| period | 120 186 | 229 | 2010-03 | 1 408 |
72+
| duoarea | 120 186 | 16 | NUS | 9 457 |
73+
| area-name | 120 186 | 7 | NA | 70 551 |
74+
| product | 120 186 | 56 | EPJKC | 6 527 |
75+
| product-name | 120 186 | 56 | Commercial Kerosene-Type Jet Fuel | 6 527 |
76+
| process | 120 186 | 1 | YPY | 120 186 |
77+
| process-name | 120 186 | 1 | Refinery Net Production | 120 186 |
78+
| series | 120 186 | 1 575 | M_EPJKM_YPY_R50_MBBL | 227 |
79+
| series-description | 120 186 | 1 575 | West Coast (PADD 5) Refinery Net Production of… | 227 |
80+
| value | 116 668 | 13 794 | 0 | 5 951 |
81+
| units | 120 186 | 2 | MBBL | 60 909 |
82+
83+
*Missing Values:*
84+
85+
| Column | Missing Count | Missing % |
86+
|---------------|---------------:|-----------:|
87+
| value | 3 518 | 2.93 % |
88+
89+
Not every month from query is in the actual dataset `1992 - 2025`:
90+
91+
| Metric | Value |
92+
|------------------------|-------:|
93+
| Unique periods (months) | 229 |
94+
95+
*Example plots*
96+
97+
![alt text](data/img/r3b_epjkc.png)
98+
99+
*Data Cleanup is needed to asses missing values when using not aggregated data!*
100+
101+
*Seasonality:*
102+
103+
![alt text](data/img/monthly_full.png)
104+
105+
| duoarea | trend_strength | seasonal_strength | n_points |
106+
|----------|----------------|------------------|-----------|
107+
| NUS | 0.886752 | 0.933713 | 228 |
108+
| R30 | 0.856845 | 0.902187 | 229 |
109+
| R3B | 0.856639 | 0.913179 | 228 |
110+
111+
![alt text](data/img/monthly_2006.png)
112+
113+
| duoarea | trend_strength | seasonal_strength | n_points |
114+
|----------|----------------|------------------|-----------|
115+
| NUS | 0.410395 | 0.796275 | 80 |
116+
| R30 | 0.342718 | 0.762639 | 80 |
117+
| R3B | 0.224469 | 0.765235 | 80 |
118+
119+
- **Before 2006:**
120+
- Significantly higher *trend_strength* and *seasonal_strength* values (~0.85–0.9).
121+
- Likely **distorted by outliers or structural changes** (e.g., unit, measurement, or reporting adjustments).
122+
- Metrics **do not represent actual time-series behavior** but reflect historical inconsistencies.
123+
124+
- **After 2006:**
125+
- Values decrease to more realistic levels (*trend_strength* ≈ 0.2–0.4, *seasonal_strength* ≈ 0.75–0.8).
126+
- Indicates **stable, consistent annual cycles** with limited long-term drift.
127+
- Data are **more homogeneous and suitable for forecasting**.
128+
129+
**Conclusion:**
130+
Historical data can provide broader coverage and longer time horizons, but **older records may distort time-series characteristics** due to structural changes, inconsistent measurement, or reporting differences.
131+
Such effects need to be **evaluated carefully** (e.g., variance or outlier analysis), and the **relevant modeling period should be selected based on data stability** rather than maximum historical length.
132+
133+
### Lookout
134+
135+
Further evaluation of **different data subsets** is required to verify whether similar issues (e.g., structural shifts, outliers, or inconsistent reporting) occur in other parts of the dataset.
136+
Each subset should be **assessed individually** to ensure data stability and reliability before inclusion in forecasting models.
142 KB
Loading
114 KB
Loading
133 KB
Loading

0 commit comments

Comments
 (0)