A Python-based data collection tool that demonstrates API rate limiting, proxy rotation, authentication, and organized data management using the GitHub REST API. Efficiently fetches user profiles, repository information, and README files while respecting API constraints and bypassing IP-based rate limits through intelligent proxy management.
This project showcases how to collect and structure public GitHub data in a scalable, organized manner. It implements custom rate limiting, automated proxy rotation, and secure authentication to ensure API compliance while maximizing data collection capacity. All collected data is saved in timestamped, hierarchical folders for easy access and analysis.
- Automatic IP Rotation: Cycles through multiple proxies to distribute requests
- Request Tracking: Monitors usage per proxy (60 requests/proxy limit)
- Auto-Switching: Automatically rotates to next proxy when limit reached
- Load Balancing: Evenly distributes API calls across all available proxies
- Proxy Management: Built-in Webshare API integration for proxy retrieval
- Controls API call frequency with configurable
@rate_limitdecorator - Prevents exceeding GitHub's API rate limits
- Demonstrates safe and scalable API usage patterns
- Works in conjunction with proxy rotation for maximum throughput
- User Profiles: Fetches comprehensive user details via
/users/{username} - Repositories: Retrieves full repository lists via
/users/{username}/repos - README Files: Downloads and decodes README content via
/repos/{username}/{repo}/readme
- Timestamped Folders: Each run creates a unique folder (e.g.,
data/2025-10-28_16-30-12/) - User Directories: Individual folders for each user containing
details.json - Aggregated CSV: Combined repository data across all users in
github_repos.csv - README Storage: Organized README files in
readmes/{username}/structure
- Supports GitHub Personal Access Tokens (PATs)
- Increases rate limits from 60 to 5,000 requests/hour
- Secure token handling for API authentication
- Compatible with proxy authentication
.
βββ data/
β βββ 2025-10-28_16-24-20/ # Timestamped collection
β βββ torvalds/
β β βββ details.json # User profile & repos
β βββ BhavinOndhiya/
β β βββ details.json
β βββ github_repos.csv # Aggregated repository data
βββ readmes/
β βββ torvalds/
β β βββ linux_README.md # Decoded README files
β βββ BhavinOndhiya/
β βββ sample_README.md
βββ proxy.json # Proxy configuration file
βββ .env # Environment variables
βββ fetch_github_data.py # Main scraper script
βββ webshare_proxy_manager.py # Proxy management script
βββ helper_class.py # Helper utilities
βββ README.md
- Python 3.7 or higher
- pip package manager
- Webshare account (for proxy service) - Optional but recommended for large-scale scraping
- Clone the repository
git clone https://github.com/yourusername/github-data-fetcher.git
cd github-data-fetcher- Install dependencies
pip install requests pandas python-dotenv- Configure Environment Variables
Create a .env file in the project root:
# GitHub Authentication (Optional but recommended)
GITHUB_TOKEN=your_personal_access_token_here
# Webshare Proxy Configuration (Required for proxy rotation)
API_KEY=your_webshare_api_key
BASE_URL=https://proxy.webshare.io/api/v2/
PROFILE_URL=https://proxy.webshare.io/api/v2/profile/
SUBSCRIPTION_URL=https://proxy.webshare.io/api/v2/subscription/
CONFIG_URL=https://proxy.webshare.io/api/v2/proxy/config/
PROXY_LIST_URL=https://proxy.webshare.io/api/v2/proxy/list/
PROXY_STATS_URL=https://proxy.webshare.io/api/v2/proxy/stats/- Fetch Proxy List (if using proxy rotation)
python webshare_proxy_manager.pyThis will create a proxy.json file with all available proxies.
from github_fetcher import GitHubFetcher
# Initialize fetcher
fetcher = GitHubFetcher(token="your_github_token")
# Fetch data for users
users = ["torvalds", "BhavinOndhiya"]
fetcher.fetch_users(users)# First, fetch proxy list from Webshare
python webshare_proxy_manager.py
# Then run the scraper with automatic proxy rotation
python fetch_github_data.py --users torvalds BhavinOndhiya guidoThe script will automatically:
- Load proxies from
proxy.json - Rotate through proxies for each request
- Track requests per proxy
- Switch to next proxy after 60 requests
- Display proxy usage statistics at the end
# Fetch specific users
python fetch_github_data.py --users torvalds BhavinOndhiya
# Fetch with custom repo limit
python fetch_github_data.py --users torvalds --repo-limit 10
# Run without proxies
python fetch_github_data.py --users torvalds --no-proxyThe project includes a dedicated proxy management class (CWEBSHARE) for seamless integration with Webshare's proxy service.
from webshare_proxy_manager import CWEBSHARE
# Initialize proxy manager
proxy_manager = CWEBSHARE()
# Authenticate with API
proxy_manager.authenticate()
# Get user profile information
proxy_manager.get_user_profile_info()
# Get subscription details
proxy_manager.get_subscription_info()
# Get proxy configuration
proxy_manager.get_proxy_configuration_info()
# Fetch all proxies and save to file
proxy_manager.get_proxy_list('proxy.json')
# Get proxy usage statistics
proxy_manager.get_proxy_stats()The proxy.json file generated by the proxy manager follows this structure:
{
"date": "2025-10-29",
"proxies": [
{
"username": "your_username",
"password": "your_password",
"proxy_address": "192.186.151.77",
"ports": {
"http": 8578,
"socks5": 8578
},
"valid": true,
"last_verification": "2025-10-29T04:52:05.792371-07:00",
"country_code": "US",
"country_code_confidence": 1.0,
"city_name": "New Orleans"
}
]
}- Loading: Script loads all valid proxies from
proxy.json - Cycling: Uses
itertools.cycleto create infinite rotation - Formatting: Converts proxy data to
http://username:password@address:port - Tracking: Monitors requests per proxy using a counter dictionary
- Rotation: Automatically switches when proxy reaches 60 requests
- Reporting: Displays usage statistics for each proxy at completion
from itertools import cycle
# Load and format proxies
def load_proxies(proxy_file="proxy.json"):
with open(proxy_file, 'r') as f:
data = json.load(f)
proxies_list = data.get("proxies", [])
formatted_proxies = []
for proxy in proxies_list:
username = proxy["username"]
password = proxy["password"]
address = proxy["proxy_address"]
port = proxy["ports"]["http"]
proxy_url = f"http://{username}:{password}@{address}:{port}"
formatted_proxies.append({
"http": proxy_url,
"https": proxy_url
})
return cycle(formatted_proxies)
# Usage in requests
proxy = next(PROXY_CYCLE)
response = requests.get(url, proxies=proxy, timeout=10){
"username": "torvalds",
"name": "Linus Torvalds",
"location": "Portland, OR",
"public_repos": 5,
"followers": 150000,
"following": 0,
"repos": [
{
"repo_name": "linux",
"stars": 150000,
"forks": 45000,
"language": "C",
"description": "Linux kernel source tree",
"readme_found": true
}
]
}| Column | Description |
|---|---|
Username |
Repository owner |
Repo Name |
Repository name |
Stars |
Star count |
Forks |
Fork count |
Language |
Primary programming language |
Description |
Repository description |
README Found |
Whether README was downloaded |
Adjust the rate limit decorator in your code:
@rate_limit(max_calls=5, period=2) # 5 calls every 2 seconds
def api_call():
passModify the request limit per proxy:
# In get_next_proxy() function
if REQUEST_COUNT[proxy_key] >= 60: # Change 60 to your desired limit
print(f"π Proxy reached limit, rotating...")
return get_next_proxy()| Authentication | Requests/Hour | With Proxies |
|---|---|---|
| Unauthenticated | 60/IP | 60 Γ Number of Proxies |
| Authenticated (PAT) | 5,000/IP | 5,000 Γ Number of Proxies |
With 10 proxies:
- Unauthenticated: 60 Γ 10 = 600 requests/hour
- Authenticated: 5,000 Γ 10 = 50,000 requests/hour
Recommendation: Use both authentication AND proxy rotation for maximum throughput.
- Sign up at Webshare.io
- Get API Key from Dashboard β API
- Choose Plan:
- Free: 10 proxies
- Paid: 100+ proxies
- Add to .env: Copy API key to
.envfile - Fetch Proxies: Run
python webshare_proxy_manager.py
The code supports any proxy service that provides:
- HTTP/HTTPS proxy support
- Username/password authentication
- JSON configuration format
Simply format your proxies in the same structure as proxy.json.
β
Loaded 10 valid proxies
π€ torvalds (Linus Torvalds)
π Location: Portland, OR
π¦ Public repos: 5
π₯ Followers: 180000 | Following: 0
π Fetching repositories and README files...
π Repositories:
πΉ linux
β Stars: 180000, π΄ Forks: 50000, π§ Language: C
π Linux kernel source tree
π README saved β readmes/torvalds/linux_README.md
πΎ Saved user details β data/2025-10-29_16-24-20/torvalds/details.json
πΎ Combined CSV saved β data/2025-10-29_16-24-20/github_repos.csv
β
All data saved inside folder: data/2025-10-29_16-24-20
π Proxy Usage Summary:
β’ 192.186.151.77:8578: 15 requests
β’ 95.214.251.125:7035: 12 requests
β’ 64.137.10.30:5680: 18 requests
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Ensure rate limiting is respected
- Test proxy rotation with multiple proxies
- Add appropriate error handling
- Update documentation
- Include example usage
- Large-Scale Research: Analyze thousands of GitHub users without IP limits
- Data Science Projects: Collect extensive datasets for ML/AI projects
- Portfolio Building: Gather comprehensive repository statistics
- Education: Learn API integration, proxy management, and rate limiting
- Monitoring: Track repository statistics and trends over time
- Competitive Analysis: Research competing projects and developers
- Respect GitHub's Terms of Service when using this tool
- Rate limiting is enforced to prevent API abuse even with proxies
- Personal tokens should never be committed to version control
- Proxy credentials must be kept secure and not shared publicly
- Data is collected from public repositories only
- Each proxy should not exceed 60 requests to avoid detection
- Monitor proxy health and rotate out non-working proxies
- Timestamps follow ISO 8601 format (YYYY-MM-DD_HH-MM-SS)
# Increase timeout for slow proxies
response = requests.get(url, proxies=proxy, timeout=15) # Default: 10- Verify credentials in
proxy.jsonare correct - Check if proxy is still active and valid
- Run
python webshare_proxy_manager.pyto refresh proxy list
- Verify proxy rotation is working (check console output)
- Ensure
REQUEST_COUNTis tracking correctly - Try reducing requests per proxy from 60 to 50
- Add longer delays between requests
# Fetch proxies from Webshare
python webshare_proxy_manager.py
# Or create manually following the JSON structure shown above- Use proxies closer to your geographic location
- Upgrade to faster proxy service plan
- Reduce concurrent requests
- Check network connectivity
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with the GitHub REST API
- Proxy services provided by Webshare.io
- Inspired by the need for ethical large-scale data collection
- Thanks to all contributors and the open-source community
- GitHub API Documentation
- Webshare API Documentation
- Python Requests Library
- Rate Limiting Best Practices
Bhavin Ondhiya
- GitHub: @BhavinOndhiya
- Location: Surat, Gujarat, India
For questions or suggestions, please:
- Open an issue on GitHub
- Reach out via GitHub profile
Made with β€οΈ by Bhavin Ondhiya
Last Updated: October 29, 2025