-
Notifications
You must be signed in to change notification settings - Fork 43
Description
What I want to do:
I'm trying to download daily weather data for two stations, in this case from the Port Hardy A station name. These two stations don't overlap in ranges. Station 202 goes from 1944 until 2013, while station 51319 picks up from 2013 until today. Basically, I would just like a single time-series of data that accounts for where each station leaves off or picks up.
Issue?
Basically, the download is creating a single data-frame but duplicating the two-time series: one for each station ID. While I am getting the real data from each station (which is what I am asking for), I am also getting missing data for each station outside the range for each station. It appears to duplicate NA's for each date I requested.
I'm not sure whether this behaviour for merging data across stations is intended or not. I could attempt to remove the duplicated dates manually, but I might have to do some quality control on that. Suggestions?
Example:
Here's the stations for Port Hardy. Notice Port Hardy A has two station IDs and two different ranges that don't overlap.
stations_search("Port Hardy",interval="day")
Then I download those two stations:
portHardy_pg <- weather_dl(station_ids = c(202, 51319), start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)
And we can start to see the problem as we look at the temperature for station 202 at the start and end of the range
head(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 1975-01-01 3.9 "" 2
2 PORT HARDY A 202 1975-01-02 6.1 "" 3.1
3 PORT HARDY A 202 1975-01-03 3.9 "" 2
4 PORT HARDY A 202 1975-01-04 3.9 "" 2.3
5 PORT HARDY A 202 1975-01-05 5 "" 3.6
6 PORT HARDY A 202 1975-01-06 2.8 "" 0.9
tail(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])
Here we see the duplicated NAs for station 202 at the end of the range
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 2018-12-26 NA "" NA
2 PORT HARDY A 202 2018-12-27 NA "" NA
3 PORT HARDY A 202 2018-12-28 NA "" NA
4 PORT HARDY A 202 2018-12-29 NA "" NA
5 PORT HARDY A 202 2018-12-30 NA "" NA
6 PORT HARDY A 202 2018-12-31 NA "" NA
I get similar issues for station 5139 at the start and end of the range:
head(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)]) # here we see the duplicated NAs for station 5139 at the beginning of the range
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 51319 1975-01-01 NA "" NA
2 PORT HARDY A 51319 1975-01-02 NA "" NA
3 PORT HARDY A 51319 1975-01-03 NA "" NA
4 PORT HARDY A 51319 1975-01-04 NA "" NA
5 PORT HARDY A 51319 1975-01-05 NA "" NA
6 PORT HARDY A 51319 1975-01-06 NA "" NA
tail(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 51319 2018-12-26 5.5 "" 2.8
2 PORT HARDY A 51319 2018-12-27 5.1 "" 2.3
3 PORT HARDY A 51319 2018-12-28 4.4 "" 4
4 PORT HARDY A 51319 2018-12-29 10.8 "" 7.4
5 PORT HARDY A 51319 2018-12-30 7 "" 3.2
6 PORT HARDY A 51319 2018-12-31 5.2 "" 2.1
Interestingly, if I download only one station but specify a "bad range", then the data download trims itself to the observation period.
For example:
new_dl <- weather_dl(station_ids = 202, start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)
tail(new_dl[new_dl$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
station_name station_id date max_temp max_temp_flag mean_temp
<chr> <dbl> <date> <dbl> <chr> <dbl>
1 PORT HARDY A 202 2013-06-07 16.4 "" 13.1
2 PORT HARDY A 202 2013-06-08 13.1 "" 11.4
3 PORT HARDY A 202 2013-06-09 13.8 "" 10.1
4 PORT HARDY A 202 2013-06-10 15.1 "" 10.5
5 PORT HARDY A 202 2013-06-11 14.8 "" 12.3
6 PORT HARDY A 202 2013-06-12 15.5 "" 12.5
My Environment
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] weathercan_0.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.10 magrittr_1.5 tidyselect_0.2.5 R6_2.4.1 rlang_0.4.1
[7] fansi_0.4.0 stringr_1.4.0 httr_1.4.1 dplyr_0.8.3 tools_3.6.1 packrat_0.5.0
[13] utf8_1.1.4 cli_1.1.0 ellipsis_0.3.0 assertthat_0.2.1 lifecycle_0.1.0 tibble_2.1.3
[19] crayon_1.3.4 tidyr_1.0.0 purrr_0.3.3 vctrs_0.2.0 curl_4.2 zeallot_0.1.0
[25] glue_1.3.1 stringi_1.4.3 compiler_3.6.1 pillar_1.4.2 backports_1.1.5 lubridate_1.7.4
[31] pkgconfig_2.0.3