Skip to content

Conversation

GrayHat12
Copy link

@GrayHat12 GrayHat12 commented Nov 11, 2020

Hello.
I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you.
read_disallows(url) : takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url.
I've tested it by providing "https://github.com/GrayHat12" as input to the function
It extracted the baseurl "https://github.com" and went on to read robots.txt using a GET request on "https://github.com/robots.txt"
Then I used a regex to extract all disallowed urls.
Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com)
for example :
One disallowed url is : "/*/stargazers"
I converted it to : "/[^/]*/stargazers" compiled it to a pattern object and added it to a disallowed list which is returned by the function.

Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers" with pattern ""/[^/]*/stargazers"" there will be a match found using re.match and you can choose to not crawl it.

Hope this was explanatory enough. I didn't understand the ai.txt part in the issue though. Will be great if someone could elaborate on that. 🐰

Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant