Skip to content

Investigate crawl failures for FEMA's National Resilience Guidance #15

@Mr0grog

Description

@Mr0grog

It seems like our crawls have been consistently timing out on https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf for the last few weeks. IA also seems to be failing to capture it: https://web.archive.org/web/20251001001420/https://www.fema.gov/sites/default/files/documents/fema_national-resilience-guidance_august2024.pdf

So I suspect we are getting blocked. I need to investigate whether we can change any settings (maybe the user agent string) to work around it.

  • Try just hitting this URL, or just this and a handful of others. If we only make a few requests, do we still get blocked?
  • Try with different user agent strings.

Worth noting: we monitor a bunch of URLs at https://www.fema.gov/* but this is the only one at https://www.fema.gov/sites/*, which looks like where static files are stored, and are probably served directly by a proxy or lower-level server without going through the application that serves HTML pages. So it maybe that this URL is getting handled differently from other FEMA URLs we monitor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Prioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions