Skip to content

Commit 11b0269

Browse files
committed
Redesigns system to store data from each AEMET API separately, to keep original variable names, and to be robust
Cleared old standardized data - Moved daily_station_historical.csv and backups out of the way Verified the new system works - The historical collection ran successfully with original AEMET variable names Launched full 4-dataset collection - Job 17923 is now running with 6-hour time limit
1 parent dfec1ed commit 11b0269

16 files changed

+1360
-74
lines changed

README.md

Lines changed: 103 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,61 @@
11
# Spanish Weather Data Collection System
22

3-
Automated collection and processing of Spanish meteorological data from AEMET (Agencia Estatal de Meteorología) OpenData API.
3+
Automated collection and processing of Spanish meteorological data from AEMET (Agencia Estatal de Meteorología) OpenData API with **original variable names preserved for data integrity**.
44

5+
## Overview
6+
7+
This system collects **four distinct weather datasets** from separate AEMET APIs, maintaining original variable names to ensure data integrity and traceability:
58

6-
## Current Data Status
9+
1. **Historical Daily Stations**: Long-term daily observations from AEMET historical climatological API
10+
2. **Current Daily Stations**: Recent daily data aggregated from hourly observations (gap-filling)
11+
3. **Hourly Station Ongoing**: Real-time hourly measurements from current observation API
12+
4. **Municipal Forecasts**: 7-day forecasts for all Spanish municipalities (ongoing validation collection)
713

8-
*Last updated: 2025-08-26 21:00:46.834172 *
14+
**Key Principle**: Each dataset preserves original AEMET variable names and is kept separate to avoid data mixing issues.
915

10-
### daily station historical
11-
- **Records**: 4543
12-
- **Variables**: 37
13-
- **Last Modified**: 2025-08-26 20:56:47
14-
- **File Size**: 0.91 MB
16+
## Data Outputs
1517

16-
### daily municipal extended
17-
- **Records**: 18232
18-
- **Variables**: 26
19-
- **Last Modified**: 2025-08-26 20:56:48
20-
- **File Size**: 3.01 MB
18+
The system produces four main datasets with **original AEMET variable names**:
2119

22-
### hourly station ongoing
23-
- **Records**: 0
24-
- **Variables**: 5
25-
- **Last Modified**: 2025-08-26 20:56:48
26-
- **File Size**: 0 MB
20+
### 1. `daily_stations_historical.csv.gz`
21+
Daily weather measurements from AEMET historical climatological API.
2722

23+
**Key Variables**: `fecha`, `indicativo`, `tmed`, `tmax`, `tmin`, `prec`, `hrMedia`, `velmedia`, `presMax`, `presMin`
2824

29-
## Overview
25+
**Coverage**: 4,000+ stations across Spain
26+
**Time Range**: 2013 to T-4 days (historical API coverage)
27+
**Update**: Daily
28+
**Source**: AEMET historical climatological endpoint
3029

31-
This system collects three standardized weather datasets covering all Spanish weather stations and municipalities:
30+
### 2. `daily_stations_current.csv.gz`
31+
Recent daily weather data aggregated from hourly observations to fill the gap between historical and present.
3232

33-
- **Daily Station Data**: Historical daily measurements from 4,000+ weather stations
34-
- **Municipal Forecasts**: 7-day forecasts for all 8,000+ Spanish municipalities
35-
- **Hourly Station Data**: High-frequency measurements for recent periods
33+
**Key Variables**: Same as historical (`indicativo`, `fecha`, `ta`/`tmed`, `hr`, `vv`, `pres`, etc.)
3634

37-
Data is automatically collected, quality-controlled, and aggregated into standardized CSV files ready for analysis.
35+
**Coverage**: Same stations as hourly data
36+
**Time Range**: T-4 days to yesterday (gap period)
37+
**Update**: Daily
38+
**Source**: AEMET hourly API aggregated to daily
3839

39-
## Data Outputs
40+
### 3. `hourly_station_ongoing.csv.gz`
41+
High-frequency meteorological measurements from Spanish weather stations.
4042

41-
The system produces three main datasets with standardized variable names:
43+
**Key Variables**: `fint`, `idema`, `measure`, `value` (long format) or direct variables like `ta`, `hr`, `vv`, `pres`
4244

43-
### 1. `daily_station_historical.csv`
44-
Daily weather measurements from Spanish meteorological stations.
45+
**Coverage**: 1,000+ active stations
46+
**Time Range**: Recent hourly observations
47+
**Update**: Every 6 hours
48+
**Source**: AEMET current hourly observation API
4549

46-
**Key Variables**: `date`, `station_id`, `temp_mean`, `temp_max`, `temp_min`, `precipitation`, `humidity_mean`, `wind_speed`, `pressure_max`, `pressure_min`
50+
### 4. `daily_municipal_forecast.csv.gz`
51+
Municipal weather forecasts for validation and analysis (ongoing accumulation).
4752

48-
**Coverage**: 4,000+ stations across Spain
49-
**Time Range**: Recent daily observations
50-
**Update**: Daily at 2 AM
53+
**Key Variables**: `fecha`, `municipio`, `temp_max`, `temp_min`, `temp_avg`, `humid_max`, `humid_min`, `wind_speed`
54+
55+
**Coverage**: 8,000+ Spanish municipalities
56+
**Time Range**: Ongoing collection of 7-day forecasts
57+
**Update**: Daily (accumulates for validation)
58+
**Source**: AEMET municipal forecast API
5159

5260
### 2. `daily_municipal_extended.csv`
5361
Municipal-level weather data combining forecasts with station aggregations.
@@ -59,49 +67,62 @@ Municipal-level weather data combining forecasts with station aggregations.
5967
**Update**: Daily at 2 AM
6068

6169
### 3. `hourly_station_ongoing.csv`
62-
High-frequency station measurements for detailed analysis.
70+
## Variable Names
6371

64-
**Key Variables**: `datetime`, `station_id`, `variable_type`, `value`
72+
**All datasets preserve original AEMET variable names** for data integrity. See [docs/variable_names_reference.md](docs/variable_names_reference.md) for complete variable explanations.
6573

66-
**Coverage**: Selected weather stations
67-
**Update**: Daily at 2 AM
74+
**Key Original Variables**:
75+
- `fecha` = Date
76+
- `indicativo`/`idema` = Station ID
77+
- `tmed`/`ta` = Temperature
78+
- `prec` = Precipitation
79+
- `hrMedia`/`hr` = Humidity
80+
- `velmedia`/`vv` = Wind speed
81+
- `municipio` = Municipality ID
6882

6983
## Data Flow
7084

7185
```
72-
AEMET OpenData API
73-
74-
Data Collection
75-
(scripts/r/*.R)
86+
AEMET APIs (Separate Sources)
7687
77-
Quality Control
78-
& Standardization
88+
┌─────────────────────────────────────────┐
89+
│ 1. Historical API → daily_stations_ │
90+
│ historical.csv.gz │
91+
├─────────────────────────────────────────┤
92+
│ 2. Hourly API → hourly_station_ │
93+
│ (aggregated) ongoing.csv.gz │
94+
│ → daily_stations_ │
95+
│ current.csv.gz │
96+
├─────────────────────────────────────────┤
97+
│ 3. Municipal API → daily_municipal_ │
98+
│ forecast.csv.gz │
99+
└─────────────────────────────────────────┘
79100
80-
Municipal Aggregation
81-
(Station → Municipal)
101+
Quality Validation
102+
(Original names preserved)
82103
83-
Final Datasets
84-
(data/output/*.csv)
104+
Four Separate Datasets
105+
(No cross-contamination)
85106
```
86107

87108
## Technical Implementation
88109

89110
### Collection System
90111
- **Language**: R with SLURM job scheduling
91112
- **API Access**: AEMET OpenData with rate limiting
92-
- **Performance**: climaemet package provides 48x speedup for municipal forecasts
93-
- **Execution Time**: 2-4 hours total (previously 33+ hours)
113+
- **Data Integrity**: Original variable names preserved to prevent confusion
114+
- **Separation**: Each API source produces distinct datasets to avoid mixing issues
94115

95116
### Data Processing
96-
- **Variable Standardization**: English names with documented units
97-
- **Quality Control**: Temperature and precipitation validation
98-
- **Gap Management**: Automatic detection and filling of missing data
99-
- **Municipality Codes**: CUMUN format from AEMET (documented for merge compatibility)
117+
- **No Variable Renaming**: Keeps original AEMET names for traceability
118+
- **Quality Control**: Basic validation without altering source structure
119+
- **Gap Management**: Separate current daily dataset covers historical-to-present gap
120+
- **Municipality Codes**: Preserved as provided by each API (different formats noted)
100121

101122
### Automation
102-
- **Daily Collection**: 2:00 AM via SLURM scheduler
103-
- **Gap Filling**: Weekly on Sundays at 1:00 AM
104-
- **Documentation Updates**: Daily at 6:00 AM
123+
- **Daily Collection**: Runs all 4 dataset collections
124+
- **Validation Collection**: Municipal forecasts accumulated over time for model validation
125+
- **Documentation**: Auto-updated with original variable references
105126

106127
## Getting Started
107128

@@ -115,45 +136,54 @@ AEMET OpenData API
115136
2. Configure API key in `auth/keys.R`
116137
3. Install crontab automation:
117138
```bash
118-
# Add these lines to crontab -e
119-
0 2 * * * cd /path/to/project && sbatch scripts/bash/update_weather_hybrid.sh
120-
0 6 * * * cd /path/to/project && sbatch scripts/bash/update_readme_summary.sh
121-
0 1 * * 0 cd /path/to/project && sbatch scripts/bash/fill_gaps.sh
139+
# Add to crontab for daily collection of 4 separate datasets
140+
0 2 * * * cd /path/to/project && sbatch scripts/bash/update_weather_original_names.sh
122141
```
123142

124143
### Manual Execution
125144
```bash
126-
# Full data collection
127-
sbatch scripts/bash/update_weather_hybrid.sh
128-
129-
# Gap analysis and filling
130-
sbatch scripts/bash/fill_gaps.sh
131-
132-
# Update documentation
133-
sbatch scripts/bash/update_readme_summary.sh
145+
# Full collection (4 datasets with original names)
146+
Rscript scripts/r/collect_four_datasets.R
147+
148+
# Or run individual dataset collections
149+
Rscript scripts/r/aggregate_daily_stations_historical.R
150+
Rscript scripts/r/aggregate_daily_stations_current.R
151+
Rscript scripts/r/aggregate_hourly_station_ongoing.R
152+
Rscript scripts/r/aggregate_daily_municipal_forecast.R
153+
```
154+
Rscript scripts/r/aggregate_hourly_station_ongoing.R
155+
Rscript scripts/r/aggregate_daily_municipal_forecast.R
156+
157+
# Main orchestrator
158+
Rscript scripts/r/collect_all_datasets_original_names.R
134159
```
135160
136161
## File Structure
137162
138163
```
139164
scripts/
140-
├── r/ # R collection and analysis scripts
165+
├── r/ # R collection scripts (4 separate datasets)
141166
├── bash/ # SLURM job scripts
142-
└── archive/ # Archived/unused scripts
167+
└── archive/ # Legacy collection methods
143168

144169
data/
145-
├── output/ # Final standardized datasets
146-
├── backup/ # Data backups and archives
170+
├── output/ # Four main datasets with original variable names:
171+
│ ├── daily_stations_historical.csv.gz
172+
│ ├── daily_stations_current.csv.gz
173+
│ ├── hourly_station_ongoing.csv.gz
174+
│ └── daily_municipal_forecast.csv.gz
147175
└── input/ # Reference data (station lists, etc.)
148176

149-
docs/ # Technical documentation
177+
docs/ # Documentation including variable name reference
150178
auth/ # API credentials (excluded from git)
151179
logs/ # SLURM job outputs
152180
```
153181
154182
## Variable Documentation
155183
156-
All datasets use standardized English variable names. Municipality IDs use CUMUN codes from AEMET. See `docs/variable_standardization.md` for complete mapping from original AEMET variable names.
184+
All datasets preserve **original AEMET variable names** for data integrity. See [docs/variable_names_reference.md](docs/variable_names_reference.md) for complete explanations of what each original variable represents.
185+
186+
**No variable renaming or standardization** - this ensures data traceability and prevents confusion about what each measurement represents.
157187
158188
## Performance Notes
159189

0 commit comments

Comments
 (0)