Investigate crawl failures for FEMA's National Resilience Guidance

It seems like our crawls have been consistently timing out on https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf for the last few weeks. IA also seems to be failing to capture it: https://web.archive.org/web/20251001001420/https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf

So I suspect we are getting blocked. I need to investigate whether we can change any settings (maybe the user agent string) to work around it.
- Try just hitting this URL, or just this and a handful of others. If we only make a few requests, do we still get blocked?
- Try with different user agent strings.

Worth noting: we monitor a bunch of URLs at `https://www.fema.gov/*` but this is the only one at `https://www.fema.gov/sites/*`, which looks like where static files are stored, and are probably served directly by a proxy or lower-level server without going through the application that serves HTML pages. So it maybe that this URL is getting handled differently from other FEMA URLs we monitor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate crawl failures for FEMA's National Resilience Guidance #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate crawl failures for FEMA's National Resilience Guidance #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions