Skip to content
This repository was archived by the owner on Nov 20, 2025. It is now read-only.
This repository was archived by the owner on Nov 20, 2025. It is now read-only.

Firewall issues when crawling some websites #83

@CarlosR122

Description

@CarlosR122

Here are a few examples of where Heritrix has been prevented by a firewall or captchas:

<style> </style>
Target Website Example instance or latest instance Comment
https://www.webarchive.org.uk/act/targets/128627 https://www.signatureaviation.com/ https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ seems ok now
https://www.webarchive.org.uk/act/targets/3706 http://www.crawleyobserver.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ seems ok now
https://www.webarchive.org.uk/act/targets/136007 https://www.teachwire.net/ https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ seem ok now
https://www.webarchive.org.uk/act/targets/147300 https://www.schuh.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ still not crawling
https://www.webarchive.org.uk/act/targets/155587#crawlpolicy https://cilexjournal.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ still not crawling
https://www.webarchive.org.uk/act/targets/149261 https://teamnnuh.co.uk/   no captures, no info in logs
https://www.webarchive.org.uk/act/targets/156010 https://hospicefoundation.ie/ https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ still not crawling
https://www.webarchive.org.uk/act/targets/156865 https://www.odeon.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/157334 https://muslimcharity.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/159206 https://www.greencoat-renewables.com/ https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ still an issue
https://www.webarchive.org.uk/act/targets/158590 https://www.diehardia.com/   no captures, no info in logs
https://www.webarchive.org.uk/act/targets/157211 https://www.poferries.com/   not crawling since March 2022, -5000, -5002
https://www.webarchive.org.uk/act/targets/3851 https://www.thetimes.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ still an issue, cloudfront
https://www.webarchive.org.uk/act/targets/160154 https://www.techagainstterrorism.org/ https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/160474 https://www.riverstonellc.com/   not crawling since May 2022, -5002
https://www.webarchive.org.uk/act/targets/161338 https://www.missguided.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ still an issue, captcha
https://www.webarchive.org.uk/act/targets/10645 https://www.fortnumandmason.com/ https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/161938 https://www.amnh.org/ https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/131772 https://cumbriacrack.com/ https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/149065 https://ort.org/ https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/164270 https://www.vistrygroup.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ not crawling, -5002
Target Website Example instance or latest instance Comment https://www.webarchive.org.uk/act/targets/128627 https://www.signatureaviation.com/ https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ seems ok now https://www.webarchive.org.uk/act/targets/3706 http://www.crawleyobserver.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ seems ok now https://www.webarchive.org.uk/act/targets/136007 https://www.teachwire.net/ https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ seem ok now https://www.webarchive.org.uk/act/targets/147300 https://www.schuh.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ still not crawling https://www.webarchive.org.uk/act/targets/155587#crawlpolicy https://cilexjournal.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ still not crawling https://www.webarchive.org.uk/act/targets/149261 https://teamnnuh.co.uk/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/156010 https://hospicefoundation.ie/ https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ still not crawling https://www.webarchive.org.uk/act/targets/156865 https://www.odeon.co.uk/ [https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/](https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https:/www.odeon.co.uk/) still an issue, cloudflare https://www.webarchive.org.uk/act/targets/157334 https://muslimcharity.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/159206 https://www.greencoat-renewables.com/ https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ still an issue https://www.webarchive.org.uk/act/targets/158590 https://www.diehardia.com/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/157211 https://www.poferries.com/ not crawling since March 2022, -5000, -5002 https://www.webarchive.org.uk/act/targets/3851 https://www.thetimes.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ still an issue, cloudfront https://www.webarchive.org.uk/act/targets/160154 https://www.techagainstterrorism.org/ https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/160474 https://www.riverstonellc.com/ not crawling since May 2022, -5002 https://www.webarchive.org.uk/act/targets/161338 https://www.missguided.co.uk/ [https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/](https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https:/www.missguided.co.uk/) still an issue, captcha https://www.webarchive.org.uk/act/targets/10645 https://www.fortnumandmason.com/ [https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/](https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https:/www.fortnumandmason.com/) still an issue, cloudflare https://www.webarchive.org.uk/act/targets/161938 https://www.amnh.org/ https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU still an issue, cloudflare https://www.webarchive.org.uk/act/targets/131772 https://cumbriacrack.com/ https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/149065 https://ort.org/ https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/164270 https://www.vistrygroup.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ not crawling, -5002

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions