Skip to content

Commit 9aa789b

Browse files
committed
Adds system for generating and updating data collection status on README.md
1 parent c6de5c9 commit 9aa789b

File tree

4 files changed

+16762
-62
lines changed

4 files changed

+16762
-62
lines changed

README.md

Lines changed: 222 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -2,95 +2,255 @@
22

33
This repository provides scripts to download, update, and manage weather data from AEMET weather stations across Spain, producing three comprehensive datasets for analysis and research.
44

5+
6+
## 📊 Current Data Collection Status
7+
8+
*Last updated: 2025-08-25 21:23:49*
9+
10+
### Dataset 1: Daily Station Data
11+
- **Records**: 2,250 station-days
12+
- **Stations**: 838 weather stations
13+
- **Coverage**: 2025-08-17 to 2025-08-25
14+
- **Data Quality**: Coverage analysis pending
15+
- **Latest File**: `daily_station_aggregated_2025-08-25.csv.gz` (0 MB)
16+
17+
### Dataset 2: Municipal Daily Data
18+
- **Records**: 2,001 municipality-days
19+
- **Municipalities**: 724 municipalities
20+
- **Historical Data**: 1,910 records
21+
- **Forecast Data**: 91 records (7 days coverage)
22+
- **Coverage**: 2025-08-17 to 2025-08-31
23+
- **Data Quality**: Coverage analysis pending
24+
- **Latest File**: `municipal_aggregated_2025-08-25.csv` (0.3 MB)
25+
26+
### Dataset 3: Hourly Station Data
27+
- **Records**: 180,393 hourly observations
28+
- **Stations**: ~752 stations (sample estimate)
29+
- **Variables**: 7 meteorological measures
30+
- **Coverage**: 2025-08-25 to 2025-08-25
31+
- **Recent Activity**: Analysis pending observations (last 30 days)
32+
- **Archive Size**: 0.8 MB compressed
33+
34+
### 🔄 Collection System Status
35+
- **Collection Method**: Hybrid system using `climaemet` package + custom API calls
36+
- **Performance**: ~5.4x faster than previous approach
37+
- **Schedule**: Daily collection at 2 AM via crontab
38+
- **Last Gap Analysis**: Not available
39+
40+
### 📈 Data Growth Tracking
41+
| Dataset | Current Size | Growth Rate | Last Updated |
42+
|---------|-------------|-------------|--------------|
43+
| Station Daily | 0 MB | ~281 records/day | 2025-08-25 |
44+
| Municipal Data | 0.3 MB | ~250 records/day | 2025-08-25 |
45+
| Hourly Archive | 0.8 MB | ~TBD records/day | 2025-08-25 |
46+
47+
---
48+
49+
550
## Three Output Datasets
651

7-
### 📊 Dataset 1: Daily Station Data (`daily_station_historical.csv.gz`)
8-
Daily aggregated weather data by station, combining:
9-
- Historical data from AEMET historical endpoint (2013+)
10-
- Recent data from current observations (last 4 days)
11-
- Variables: daily min/max/mean temperature, precipitation totals, etc.
52+
### Dataset 1: Daily Station Data
53+
**File**: `daily_station_aggregated_YYYY-MM-DD.csv.gz`
54+
55+
Daily aggregated weather data by station:
56+
- Data sources: AEMET daily climatological endpoint + hourly observations aggregated to daily
57+
- Variables: daily min/max/mean temperature, precipitation, wind, humidity, pressure
58+
- Coverage: Active weather stations across Spain
59+
- Quality control: Temperature range validation, realistic value bounds
60+
61+
### Dataset 2: Municipal Daily Data
62+
**File**: `municipal_aggregated_YYYY-MM-DD.csv.gz`
63+
64+
Daily weather data by municipality:
65+
- Data sources: Station data aggregated by municipality + 7-day municipal forecasts
66+
- Coverage: 8,129 Spanish municipalities
67+
- Temporal range: Historical station aggregates through 7-day forecasts
68+
- Source tracking: Distinguishes between station-derived data and forecast data
69+
70+
### Dataset 3: Hourly Station Data
71+
**File**: `hourly_station_ongoing.csv.gz`
1272

13-
### 🏘️ Dataset 2: Municipal Daily Data (`municipal_daily_combined.csv.gz`)
14-
Daily weather data by municipality (8,129 Spanish municipalities), combining:
15-
- Station data aggregated to municipal level (historical + recent)
16-
- 7-day municipal forecasts from AEMET
17-
- Complete temporal coverage: historical through 7-day forecast
73+
Hourly observations from AEMET stations:
74+
- Data format: Long format (measure/value pairs) for 7 core variables
75+
- Update frequency: Daily collection with continuous archiving
76+
- Purpose: Building comprehensive historical hourly archive
1877

19-
### ⏰ Dataset 3: Hourly Station Data (`hourly_station_observations.csv.gz`)
20-
Hourly observations from all AEMET stations:
21-
- Real-time collection building our own historical archive
22-
- Expanded variable set (7 safe variables)
23-
- Updated every 2 hours
78+
## Data Collection System
2479

25-
## Data Collection Workflow
80+
### Collection Methods
81+
- **Station Daily Data**: Custom API calls to AEMET climatological endpoints
82+
- **Municipal Forecasts**: `climaemet` R package for robust API interaction
83+
- **Hourly Data**: Direct API calls to AEMET observational endpoints
84+
85+
### Automation
86+
- **Schedule**: Daily collection via SLURM batch system
87+
- **Gap Detection**: Automated identification of missing data
88+
- **Gap Filling**: Weekly targeted collection for missing records
89+
- **Quality Control**: Automated validation of temperature ranges and data consistency
90+
91+
### Data Processing Pipeline
2692

2793
```mermaid
2894
flowchart TD
29-
A[AEMET API] --> B[Historical Endpoint]
30-
A --> C[Current Observations]
95+
A[AEMET API] --> B[Station Daily Endpoint]
96+
A --> C[Hourly Observations]
3197
A --> D[Municipal Forecasts]
3298
33-
B --> E[get_historical_data.R]
99+
B --> E[get_station_daily_hybrid.R]
34100
C --> F[get_latest_data.R]
35-
D --> G[get_forecast_data.R]
101+
D --> G[get_forecast_data_hybrid.R]
36102
37-
E --> H[daily_station_historical.csv.gz]
38-
F --> I[hourly_station_ongoing.csv.gz]
39-
G --> J[Municipal Forecast Data]
103+
E --> H[Station Daily Data]
104+
F --> I[Hourly Archive]
105+
G --> J[Municipal Forecasts]
40106
41-
H --> K[aggregate_municipal_daily.R]
42-
J --> K
43-
K --> L[daily_municipal_extended.csv.gz]
107+
H --> K[aggregate_daily_station_data_hybrid.R]
108+
I --> K
109+
K --> L[Dataset 1: Daily Station Aggregated]
44110
45-
H --> M[Dataset 1: Daily Station]
46-
L --> N[Dataset 2: Municipal Extended]
47-
I --> O[Dataset 3: Hourly Station]
111+
L --> M[aggregate_municipal_data_hybrid.R]
112+
J --> M
113+
M --> N[Dataset 2: Municipal Aggregated]
48114
49-
M --> P[Zenodo Publication]
50-
N --> P
51-
O --> P
115+
I --> O[Dataset 3: Hourly Archive]
52116
```
53117

54-
## Dataset Temporal Coverage
118+
## File Structure
55119

56-
```mermaid
57-
gantt
58-
title Weather Data Temporal Coverage
59-
dateFormat YYYY-MM-DD
60-
section Dataset 1 - Daily Station
61-
Historical Records :done, hist1, 2013-01-01, 2025-08-17
62-
Recent Observations :active, recent1, 2025-08-17, 2025-08-21
63-
64-
section Dataset 2 - Municipal Extended
65-
Historical Period :done, hist2, 2013-01-01, 2025-08-17
66-
Recent Period :active, recent2, 2025-08-17, 2025-08-21
67-
Forecast Period :forecast, 2025-08-21, 2025-08-28
68-
69-
section Dataset 3 - Hourly Station
70-
Accumulating Archive :active, archive, 2025-08-01, 2025-08-21
120+
### Core Collection Scripts
121+
```
122+
code/
123+
├── get_station_daily_hybrid.R # Station daily data collection
124+
├── get_forecast_data_hybrid.R # Municipal forecast collection
125+
├── get_latest_data.R # Hourly data collection
126+
├── collect_all_datasets_hybrid.R # Coordinated collection of all datasets
127+
├── aggregate_daily_station_data_hybrid.R # Station data aggregation
128+
└── aggregate_municipal_data_hybrid.R # Municipal data aggregation
129+
```
130+
131+
### Data Quality & Monitoring
132+
```
133+
code/
134+
├── check_data_gaps.R # Gap detection and analysis
135+
├── fill_data_gaps.R # Targeted gap filling
136+
├── generate_data_summary.R # Dataset statistics generation
137+
└── update_readme_with_summary.R # Automated documentation updates
138+
```
139+
140+
### SLURM Integration
141+
```
142+
├── update_weather_hybrid.sh # Main collection job
143+
└── CRONTAB_LINES_TO_ADD.txt # Scheduling configuration
144+
```
145+
146+
## Setup and Usage
147+
148+
### Prerequisites
149+
```bash
150+
# Load required modules (HPC environment)
151+
module load GDAL/3.10.0-foss-2024a
152+
module load R/4.4.2-gfbf-2024a
153+
154+
# Install required R packages
155+
Rscript -e "install.packages(c('climaemet', 'tidyverse', 'data.table', 'lubridate'))"
156+
```
157+
158+
### Manual Collection
159+
```bash
160+
# Collect all three datasets
161+
sbatch update_weather_hybrid.sh
162+
163+
# Or run individual components
164+
Rscript code/get_forecast_data_hybrid.R # Municipal forecasts
165+
Rscript code/get_station_daily_hybrid.R # Station daily data
166+
Rscript code/get_latest_data.R # Hourly data
71167
```
72168

73-
## Monitoring & Dashboard Integration
169+
### Automated Schedule
170+
Add these lines to your crontab:
171+
```bash
172+
# Daily collection (2 AM)
173+
0 2 * * * cd /path/to/weather-data-collector-spain && sbatch update_weather_hybrid.sh
174+
175+
# Daily status update (6 AM)
176+
0 6 * * * cd /path/to/weather-data-collector-spain && Rscript code/update_readme_with_summary.R
177+
178+
# Weekly gap filling (Sunday 1 AM)
179+
0 1 * * 0 cd /path/to/weather-data-collector-spain && Rscript code/fill_data_gaps.R
180+
```
181+
182+
## Data Quality
183+
184+
### Coverage and Success Rates
185+
- **Station Daily**: Typical success rates of 30-50% per collection run (normal for AEMET API)
186+
- **Municipal Forecasts**: 65-95% success rates with automatic retry logic
187+
- **Hourly Data**: >99% success rate for active stations
74188

75-
### 🖥️ **Real-time Monitoring**
76-
This project integrates with the [mosquito-alert-model-monitor](https://github.com/Mosquito-Alert/mosquito-alert-model-monitor) dashboard for real-time job monitoring.
189+
### Quality Control Measures
190+
- **Temperature validation**: Range checks (min ≤ mean ≤ max, realistic bounds)
191+
- **Duplicate handling**: Automatic deduplication with source prioritization
192+
- **Gap tracking**: Systematic identification and filling of missing data
193+
- **Source attribution**: Clear distinction between observed vs forecast data
77194

78-
**Monitored Jobs:**
79-
- `weather-forecast`: Municipal forecasts (every 6 hours) - **CRITICAL PRIORITY**
80-
- `weather-hourly`: Station observations (every 2 hours) - **MEDIUM PRIORITY**
81-
- `weather-historical`: Historical data updates (daily) - **LOW PRIORITY**
82-
- `municipal-forecast-priority`: Immediate municipal data - **CRITICAL PRIORITY**
195+
## Data Access
83196

84-
**Setup Dashboard Monitoring:**
197+
### Output Location
198+
All datasets are saved in `data/output/` with date-stamped filenames:
199+
- `daily_station_aggregated_YYYY-MM-DD.csv[.gz]`
200+
- `municipal_aggregated_YYYY-MM-DD.csv[.gz]`
201+
- `hourly_station_ongoing.csv.gz`
202+
203+
### File Formats
204+
- **CSV format**: Compatible with R, Python, and standard analysis tools
205+
- **Compressed versions**: `.gz` files for efficient storage
206+
- **Consistent schemas**: Standardized column names across collection runs
207+
208+
## Monitoring and Maintenance
209+
210+
### Gap Analysis
211+
```bash
212+
# Check for missing data
213+
Rscript code/check_data_gaps.R
214+
215+
# Fill identified gaps
216+
Rscript code/fill_data_gaps.R
217+
```
218+
219+
### Data Statistics
85220
```bash
86-
# Test integration
87-
./scripts/test_dashboard_integration.sh
221+
# Generate current dataset summary
222+
Rscript code/generate_data_summary.R
88223

89-
# Check dashboard at: ~/research/mosquito-alert-model-monitor/docs/index.html
224+
# Update README with latest statistics
225+
Rscript code/update_readme_with_summary.R
90226
```
91227

92-
**Status Reporting:**
93-
All SLURM scripts automatically report job status, progress, and resource usage to the monitoring dashboard.
228+
### Error Handling
229+
- **API Rate Limits**: Automatic detection and waiting
230+
- **SSL Connection Issues**: Built-in retry logic in `climaemet` package
231+
- **Server Errors**: Exponential backoff for temporary failures
232+
- **Missing Data**: Systematic gap detection and targeted re-collection
233+
234+
## Technical Details
235+
236+
### API Integration
237+
- **AEMET OpenData API**: Primary data source requiring valid API key
238+
- **Rate Limiting**: Respectful API usage with automatic throttling
239+
- **Error Recovery**: Robust handling of temporary failures and connection issues
240+
241+
### Dependencies
242+
- **R Packages**: `climaemet`, `tidyverse`, `data.table`, `lubridate`, `httr`, `jsonlite`
243+
- **System Requirements**: GDAL/3.10.0, R/4.4.2
244+
- **Environment**: HPC cluster with SLURM job scheduler
245+
246+
### Performance
247+
- **Collection Time**: Typically 2-4 hours for complete daily collection
248+
- **Resource Usage**: 8GB RAM, single CPU core sufficient
249+
- **Storage Growth**: Approximately 50-100MB per day across all datasets
250+
251+
---
252+
253+
*This system provides reliable, automated collection of comprehensive weather data for Spain with built-in quality control and gap management.*
94254

95255
## Features
96256
- **Real-time Observations**: Fetches current hourly weather from all AEMET stations

0 commit comments

Comments
 (0)