Description
Hi, we have recently switched from Reppy to Protego for our robots.txt parser. All seems fine, except we noticed a few differences between Reppy and Protego in the URLs we are crawling - essentially, Protego appeared to be allowing access to URLs that should be blocked. It seems that Protego follows the Google specification and Reppy does not, so differences should be expected. However, we noticed that the official Google Robots Tester blocked access to these URLs also - so there seems to be an error here.
The string in the robots.txt file that appeared to be being ignored is /en-uk/*q=*relevance*
and an example of a URL that was not being filtered by this string is /en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Here is the output from Google Robots Tester showing that this URL should be blocked by the aforementioned string:
Having looked at the Protego code, we believe that we have found where this apparent error comes from. We also think we have a fix for it, and will happily submit the fix for your scrutiny as we'd like to know if there are unforeseen consequences from this change.
The problem involves the ASCII hex-encoding of the URL string. Protego splits the URL into parts, e.g.:
scheme="http://", netloc='www.website.com', path='/en-uk/new-wildlife-range/c/4018', params='', query='q=%3Arelevance%3Atype%3AHedgehog%2BFood', fragment=''
It then encodes symbols in the "path" part of this, removes the "scheme" and "netloc" parts, and reassembles the URL to compare with all the rules in robots.txt. The issue we're seeing is that it only encodes the symbols in the "path" part. The "query" part is left alone.
We end up with this as the URL to be checked:
/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Which when a regex search is applied to it using this string: /en-uk/.*?q%3D.*?relevance.*?
a match isn't found as the =
in the URL hasn't been encoded to %3D
.
The fix we have is simple, it just encodes the "query" part in the same way as the "path" part. So instead we end up with URL:
/en-uk/new-wildlife-range/c/4018?q%3D%3Arelevance%3Atype%3AHedgehog%2BFood
the URL is matched with the regex string correctly, and crawler access is blocked.
Is this likely to cause any unforeseen issues?
Thanks