Check robot.txt and ai.txt #13

GrayHat12 · 2020-11-11T12:16:29Z

Hello.
I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you.
read_disallows(url) : takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url.
I've tested it by providing "https://github.com/GrayHat12" as input to the function
It extracted the baseurl "https://github.com" and went on to read robots.txt using a GET request on "https://github.com/robots.txt"
Then I used a regex to extract all disallowed urls.
Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com)
for example :
One disallowed url is : "/*/stargazers"
I converted it to : "/[^/]*/stargazers" compiled it to a pattern object and added it to a disallowed list which is returned by the function.

Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers" with pattern ""/[^/]*/stargazers"" there will be a match found using re.match and you can choose to not crawl it.

Hope this was explanatory enough. I didn't understand the ai.txt part in the issue though. Will be great if someone could elaborate on that. 🐰

Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through

Create robots.py

fbfdf7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check robot.txt and ai.txt #13

Check robot.txt and ai.txt #13

Uh oh!

GrayHat12 commented Nov 11, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Check robot.txt and ai.txt #13

Are you sure you want to change the base?

Check robot.txt and ai.txt #13

Uh oh!

Conversation

GrayHat12 commented Nov 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GrayHat12 commented Nov 11, 2020 •

edited

Loading