Skip to content

ogow/waybackrobots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

waybackrobots

Collect old robots.txt files from waybackmachine and download the Disallow paths

Install:

$ go install github.com/ogow/waybackrobots@latest 

Usage:

Just give the flag -domain a valid domain and it will start downloading all the archived responses from 2015

$ waybackrobots -domain google.com 

Get all response from a custom year:

$ waybackrobots -domain google.com -fd 2020

Sometimes the wayback api can return alot of results which will take along time to download, to avoid this the -strat flag can be used. The -strat flag takes one of these values day, month, digest, digest is default. Usually digest will be the go to value, but if we look at a domain like google.com that has been archived alot the digest filter will still return alot of results. If this is the case we can try to use the day filter which gets one snapshot each day.

filters in use explanation: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#collapsing

comparing digest with day and month

$ go run . -domain google.com -strat digest
[i] found 38261 old robots.txt files

$ go run . -domain google.com -strat day 
[i] found 473 old robots.txt files

$ go run . -domain google.com -strat month
[i] found 122 old robots.txt files

Help:

Usage of waybackrobots:
  -domain string
        which domain to find old robots for
  -fd int
        choose date from when to get robots from format: 2015 (default 2015)
  -strat string
        interval to get robots for, possible values: digest, day, month (default "digest") 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages