Skip to content

Conversation

@ldecicco-USGS
Copy link
Collaborator

No description provided.

ldecicco-USGS and others added 30 commits August 8, 2025 09:28
Merge branch 'main' of github.com:DOI-USGS/dataRetrieval into develop

# Conflicts:
#	tutorials/images/publish.png
Merge branch 'main' of github.com:DOI-USGS/dataRetrieval

# Conflicts:
#	tutorials/basic_slides_deck.qmd
#	tutorials/changes_slides_deck.qmd
Update dependency requirement: `curl` >=7.0.0
@ldecicco-USGS
Copy link
Collaborator Author

@ehinman or @jzemmels - if either of you could take a quick look - this PR adds the latest-daily service. Additionally, an update from roxygen causes all the Rd files to be changed, but it's pretty subtle. A few other minor updates are in this that actually came from pre-shutdown.

@ehinman
Copy link
Collaborator

ehinman commented Nov 18, 2025

I can take a look this afternoon into tomorrow.

bbox = NA,
limit = NA,
max_results = NA,
convertType = TRUE){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the order of the input parameters in the documentation and the function itself? The orders are different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order in the documentation doesn't matter. The order in the function matters in that a user doesn't necessarily need to name each argument - but if they don't then they need to input them in the correct order. I'll double check the order is similar to the other functions, because what we don't want to do is change it later.

return_list <- data.frame(matrix(nrow = 0, ncol = length(properties)))
return_list <- data.frame(matrix(nrow = 0,
ncol = length(properties)))
return_list <- lapply(return_list, as.character)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Ensures that each empty column's format is character?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, otherwise it comes back as logical.

# no service specified:
availableData <- read_waterdata_ts_meta(monitoring_location_id = "USGS-05114000")
expect_equal(ncol(availableData), 18)
expect_gte(ncol(availableData), 18)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's snazzy.

```

That's a LOT of columns that come back. We won't look at them here, but let's jump over to RStudio to look through the results.
That's a LOT of columns that come back. We won't look at them here, but you can use `View` in RStudio to explore on your own.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That's a LOT of columns that come back. We won't look at them here, but you can use `View` in RStudio to explore on your own.
That's a LOT of columns that came back. We won't look at them here, but you can use `View` in RStudio to explore on your own.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm now re-reading I'm not sure which would be better. Could say "That's a LOT of columns returned."

pal <- colorNumeric("viridis", latest_dane_county_daily$value)
leaflet(data = latest_dane_county_daily |>
sf::st_transform(crs = leaflet_crs)) |>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest individually defining leaflet_crs in case someone is skipping to a particular section and doesn't know where it is originally defined.

Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass looks good, just want to test out the new function briefly tomorrow.

#' @param monitoring_location_id `r get_params("latest-daily")$monitoring_location_id`
#' @param parameter_code `r get_params("latest-daily")$parameter_code`
#' @param statistic_id `r get_params("latest-daily")$statistic_id`
#' @param time `r get_params("latest-daily")$time`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time seems like a weird input here. I tried the following:

test = read_waterdata_latest_daily(monitoring_location_id = "USGS-12456500", time = "2018-02-12T23:20:50Z")

And it returns zero results. Is there ever a time this input would be helpful? Perhaps I used it incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you did an exact time, probably not. But you could maybe have a scenario where you have a list of sites or bounding box and say "give me the latest results for for the last 30 days", and that way you'd only get sites back that have new data (instead of maybe the latest data point was 30 years ago or something).

#' @param time `r get_params("latest-daily")$time`
#' @param value `r get_params("latest-daily")$value`
#' @param unit_of_measure `r get_params("latest-daily")$unit_of_measure`
#' @param approval_status `r get_params("latest-daily")$approval_status`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding somewhere (whether in function documentation or in examples) that the "latest" will often be "Provisional" unless the gage is currently non-functional and the latest is from a while ago.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we petition the WaterData API folks to do that so our documentations don't start to deviate.

#'
#' \donttest{
#' site <- "USGS-02238500"
#' pcode <- "00060"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You set the pcode here but do not use it below in the examples.

#' @param skipGeometry This option can be used to skip response geometries for
#' each feature. The returning object will be a data frame with no spatial
#' information.
#' @param convertType logical, defaults to `TRUE`. If `TRUE`, the function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last_modified column is POSIXct regardless of whether convertType is TRUE or FALSE.

Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, Laura. Thanks for the opportunity to review. I went through the diffs, tested the latest daily function, and took a look at the built documentation page for the new function.

Left some minor comments peppered throughout. My only additional thought is about the time input. Your response makes total sense for filtering to sites that have a latest daily measurement within a certain time range. One thing I noticed when looking at this example:
test = read_waterdata_latest_daily(bbox = c(-112.249246,40.516572,-111.293435,41.247515), parameter_code = '00060', convertType = TRUE, time = "2025-11-18T00:00:00Z/..")

As expected, it returns all latest daily measurements taken yesterday, but the last_modified column is from ~6:30AM UTC (midnight-ish central time, I think?) on 11-19-2025. Presumably, this is when the "daily" calculations are made for this group of gages. I'm wondering if somewhere we should alert people that (depending on their time zone) they shouldn't expect the latest daily from the previous day to be available at midnight, but maybe one or two hours after midnight: for automated pulls and the like.

Copy link
Collaborator

@jzemmels jzemmels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good. Tested out the new function a few ways and didn't run into issues, other than a minor comment. Approved.

Also tested out building the documentation/pkgdown sites. Some findings, which you're free to ignore:

  • pkgdown.yml needs to have read_waterdta_latest_daily added to the reference section.
  • These lines threw an error when building vignettes. The fix was to remove the list() wrapping around function arguments.
  • Suggest adding zoo and maps packages to DESCRIPTION Suggest section, as they're required for running the vignettes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to remember: was there a reason why we didn't want to add parameter-codes as a possible endpoint for this function? I tested out a few CQL statements and it seemed to work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zoo and maps packages are only in vignettes that we added to .Rbuildignore. We want the CRAN version of dataRetrieval to be nice and streamlined with as few imports and suggested packages as possible. So that is why they are not on the Suggest list.

And whoo-boy...that movingAverage vignette is old-school! If you can't tell from the graphs it is the precursor to HASP. I feel like we should delete all the text and say if you are interested in doing those calculations, see:
https://doi-usgs.github.io/HASP/
(which I am very very close to pushing a big update that will use all the waterdata functions).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_waterdata function was first developed before the parameter-code was an endpoint. I'll update it - it also needs field measurements and now latest-daily.

time = NA_character_,
bbox = NA,
limit = NA,
max_results = NA,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm still misunderstanding the usage of these two arguments, but max_results doesn't seem to change the size of the output. Examples:

read_waterdata_latest_daily(monitoring_location_id = "USGS-01491000")
read_waterdata_latest_daily(monitoring_location_id = "USGS-01491000", max_results = 10)
read_waterdata_latest_daily(monitoring_location_id = "USGS-01491000", limit = 10)
read_waterdata_latest_daily(monitoring_location_id = "USGS-01491000", max_results = 5, limit = 10)

The same happens for other read_waterdata_ functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! args got moved around and max_results became meaningless. Should be good to go now

return_list$value <- as.numeric()
}

if(convertType && "contributing_drainage_area" %in% names(return_list)){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ran into a new, rather specific error.

Reprex:

dplyr::bind_rows(
    read_waterdata_monitoring_location(monitoring_location_id = "USGS-092403167282001"),
    read_waterdata_monitoring_location(monitoring_location_id = "USGS-01435000")
)

class(read_waterdata_monitoring_location(monitoring_location_id = "USGS-092403167282001")$drainage_area)
class(read_waterdata_monitoring_location(monitoring_location_id = "USGS-01435000")$drainage_area)

Only dplyr::bind_rows complains, rbind just casts the numeric drainage area as a character.

I may be wrong, but I think adding an if statement for "drainage_area" here should fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize there was contributing_drainage_area and drainage_area. We have the contributing_drainage_area in there, just need to add drainage_area to the list in cleanup_cols

convertType = TRUE){

service <- "latest-daily"
output_id <- "latest_daily_id"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought: should this just be "daily_id"?

Copy link
Collaborator Author

@ldecicco-USGS ldecicco-USGS Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! nope

@ldecicco-USGS
Copy link
Collaborator Author

This looks good, Laura. Thanks for the opportunity to review. I went through the diffs, tested the latest daily function, and took a look at the built documentation page for the new function.

Left some minor comments peppered throughout. My only additional thought is about the time input. Your response makes total sense for filtering to sites that have a latest daily measurement within a certain time range. One thing I noticed when looking at this example: test = read_waterdata_latest_daily(bbox = c(-112.249246,40.516572,-111.293435,41.247515), parameter_code = '00060', convertType = TRUE, time = "2025-11-18T00:00:00Z/..")

As expected, it returns all latest daily measurements taken yesterday, but the last_modified column is from ~6:30AM UTC (midnight-ish central time, I think?) on 11-19-2025. Presumably, this is when the "daily" calculations are made for this group of gages. I'm wondering if somewhere we should alert people that (depending on their time zone) they shouldn't expect the latest daily from the previous day to be available at midnight, but maybe one or two hours after midnight: for automated pulls and the like.

I'm going to keep this in mind for the continuous endpoint (coming next!), where we have to deal a lot more with timezones.

@ldecicco-USGS ldecicco-USGS merged commit 0d998c4 into DOI-USGS:develop Nov 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants