It seems like our crawls have been consistently timing out on https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf for the last few weeks. IA also seems to be failing to capture it: https://web.archive.org/web/20251001001420/https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf
So I suspect we are getting blocked. I need to investigate whether we can change any settings (maybe the user agent string) to work around it.
- Try just hitting this URL, or just this and a handful of others. If we only make a few requests, do we still get blocked?
- Try with different user agent strings.
Worth noting: we monitor a bunch of URLs at https://www.fema.gov/* but this is the only one at https://www.fema.gov/sites/*, which looks like where static files are stored, and are probably served directly by a proxy or lower-level server without going through the application that serves HTML pages. So it maybe that this URL is getting handled differently from other FEMA URLs we monitor.
It seems like our crawls have been consistently timing out on https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf for the last few weeks. IA also seems to be failing to capture it: https://web.archive.org/web/20251001001420/https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf
So I suspect we are getting blocked. I need to investigate whether we can change any settings (maybe the user agent string) to work around it.
Worth noting: we monitor a bunch of URLs at
https://www.fema.gov/*but this is the only one athttps://www.fema.gov/sites/*, which looks like where static files are stored, and are probably served directly by a proxy or lower-level server without going through the application that serves HTML pages. So it maybe that this URL is getting handled differently from other FEMA URLs we monitor.