Skip to content

This Python project demonstrates API rate limiting, authentication, and organized data saving using the GitHub REST API.

License

Notifications You must be signed in to change notification settings

BhavinOndhiya/Rate-Limiting-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧩 GitHub User & Repository Data Fetcher with Proxy Rotation

A Python-based data collection tool that demonstrates API rate limiting, proxy rotation, authentication, and organized data management using the GitHub REST API. Efficiently fetches user profiles, repository information, and README files while respecting API constraints and bypassing IP-based rate limits through intelligent proxy management.

πŸ“– Overview

This project showcases how to collect and structure public GitHub data in a scalable, organized manner. It implements custom rate limiting, automated proxy rotation, and secure authentication to ensure API compliance while maximizing data collection capacity. All collected data is saved in timestamped, hierarchical folders for easy access and analysis.


πŸš€ Features

βœ… Intelligent Proxy Rotation

  • Automatic IP Rotation: Cycles through multiple proxies to distribute requests
  • Request Tracking: Monitors usage per proxy (60 requests/proxy limit)
  • Auto-Switching: Automatically rotates to next proxy when limit reached
  • Load Balancing: Evenly distributes API calls across all available proxies
  • Proxy Management: Built-in Webshare API integration for proxy retrieval

βœ… Custom Rate Limiter

  • Controls API call frequency with configurable @rate_limit decorator
  • Prevents exceeding GitHub's API rate limits
  • Demonstrates safe and scalable API usage patterns
  • Works in conjunction with proxy rotation for maximum throughput

βœ… GitHub API Integration

  • User Profiles: Fetches comprehensive user details via /users/{username}
  • Repositories: Retrieves full repository lists via /users/{username}/repos
  • README Files: Downloads and decodes README content via /repos/{username}/{repo}/readme

βœ… Structured Data Organization

  • Timestamped Folders: Each run creates a unique folder (e.g., data/2025-10-28_16-30-12/)
  • User Directories: Individual folders for each user containing details.json
  • Aggregated CSV: Combined repository data across all users in github_repos.csv
  • README Storage: Organized README files in readmes/{username}/ structure

βœ… Authenticated Requests

  • Supports GitHub Personal Access Tokens (PATs)
  • Increases rate limits from 60 to 5,000 requests/hour
  • Secure token handling for API authentication
  • Compatible with proxy authentication

πŸ—‚οΈ Project Structure

.
β”œβ”€β”€ data/
β”‚   └── 2025-10-28_16-24-20/          # Timestamped collection
β”‚       β”œβ”€β”€ torvalds/
β”‚       β”‚   └── details.json          # User profile & repos
β”‚       β”œβ”€β”€ BhavinOndhiya/
β”‚       β”‚   └── details.json
β”‚       └── github_repos.csv          # Aggregated repository data
β”œβ”€β”€ readmes/
β”‚   β”œβ”€β”€ torvalds/
β”‚   β”‚   └── linux_README.md           # Decoded README files
β”‚   └── BhavinOndhiya/
β”‚       └── sample_README.md
β”œβ”€β”€ proxy.json                         # Proxy configuration file
β”œβ”€β”€ .env                               # Environment variables
β”œβ”€β”€ fetch_github_data.py              # Main scraper script
β”œβ”€β”€ webshare_proxy_manager.py         # Proxy management script
β”œβ”€β”€ helper_class.py                    # Helper utilities
└── README.md

πŸ› οΈ Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager
  • Webshare account (for proxy service) - Optional but recommended for large-scale scraping

Setup

  1. Clone the repository
git clone https://github.com/yourusername/github-data-fetcher.git
cd github-data-fetcher
  1. Install dependencies
pip install requests pandas python-dotenv
  1. Configure Environment Variables

Create a .env file in the project root:

# GitHub Authentication (Optional but recommended)
GITHUB_TOKEN=your_personal_access_token_here

# Webshare Proxy Configuration (Required for proxy rotation)
API_KEY=your_webshare_api_key
BASE_URL=https://proxy.webshare.io/api/v2/
PROFILE_URL=https://proxy.webshare.io/api/v2/profile/
SUBSCRIPTION_URL=https://proxy.webshare.io/api/v2/subscription/
CONFIG_URL=https://proxy.webshare.io/api/v2/proxy/config/
PROXY_LIST_URL=https://proxy.webshare.io/api/v2/proxy/list/
PROXY_STATS_URL=https://proxy.webshare.io/api/v2/proxy/stats/
  1. Fetch Proxy List (if using proxy rotation)
python webshare_proxy_manager.py

This will create a proxy.json file with all available proxies.


🎯 Usage

Basic Usage (Without Proxies)

from github_fetcher import GitHubFetcher

# Initialize fetcher
fetcher = GitHubFetcher(token="your_github_token")

# Fetch data for users
users = ["torvalds", "BhavinOndhiya"]
fetcher.fetch_users(users)

Advanced Usage (With Proxy Rotation)

# First, fetch proxy list from Webshare
python webshare_proxy_manager.py

# Then run the scraper with automatic proxy rotation
python fetch_github_data.py --users torvalds BhavinOndhiya guido

The script will automatically:

  1. Load proxies from proxy.json
  2. Rotate through proxies for each request
  3. Track requests per proxy
  4. Switch to next proxy after 60 requests
  5. Display proxy usage statistics at the end

Command Line Options

# Fetch specific users
python fetch_github_data.py --users torvalds BhavinOndhiya

# Fetch with custom repo limit
python fetch_github_data.py --users torvalds --repo-limit 10

# Run without proxies
python fetch_github_data.py --users torvalds --no-proxy

πŸ”„ Proxy Management

Webshare Proxy Integration

The project includes a dedicated proxy management class (CWEBSHARE) for seamless integration with Webshare's proxy service.

Available Methods

from webshare_proxy_manager import CWEBSHARE

# Initialize proxy manager
proxy_manager = CWEBSHARE()

# Authenticate with API
proxy_manager.authenticate()

# Get user profile information
proxy_manager.get_user_profile_info()

# Get subscription details
proxy_manager.get_subscription_info()

# Get proxy configuration
proxy_manager.get_proxy_configuration_info()

# Fetch all proxies and save to file
proxy_manager.get_proxy_list('proxy.json')

# Get proxy usage statistics
proxy_manager.get_proxy_stats()

Proxy JSON Structure

The proxy.json file generated by the proxy manager follows this structure:

{
    "date": "2025-10-29",
    "proxies": [
        {
            "username": "your_username",
            "password": "your_password",
            "proxy_address": "192.186.151.77",
            "ports": {
                "http": 8578,
                "socks5": 8578
            },
            "valid": true,
            "last_verification": "2025-10-29T04:52:05.792371-07:00",
            "country_code": "US",
            "country_code_confidence": 1.0,
            "city_name": "New Orleans"
        }
    ]
}

How Proxy Rotation Works

  1. Loading: Script loads all valid proxies from proxy.json
  2. Cycling: Uses itertools.cycle to create infinite rotation
  3. Formatting: Converts proxy data to http://username:password@address:port
  4. Tracking: Monitors requests per proxy using a counter dictionary
  5. Rotation: Automatically switches when proxy reaches 60 requests
  6. Reporting: Displays usage statistics for each proxy at completion

Proxy Rotation Code Example

from itertools import cycle

# Load and format proxies
def load_proxies(proxy_file="proxy.json"):
    with open(proxy_file, 'r') as f:
        data = json.load(f)
        proxies_list = data.get("proxies", [])
    
    formatted_proxies = []
    for proxy in proxies_list:
        username = proxy["username"]
        password = proxy["password"]
        address = proxy["proxy_address"]
        port = proxy["ports"]["http"]
        
        proxy_url = f"http://{username}:{password}@{address}:{port}"
        formatted_proxies.append({
            "http": proxy_url,
            "https": proxy_url
        })
    
    return cycle(formatted_proxies)

# Usage in requests
proxy = next(PROXY_CYCLE)
response = requests.get(url, proxies=proxy, timeout=10)

πŸ“Š Data Schema

details.json Structure

{
  "username": "torvalds",
  "name": "Linus Torvalds",
  "location": "Portland, OR",
  "public_repos": 5,
  "followers": 150000,
  "following": 0,
  "repos": [
    {
      "repo_name": "linux",
      "stars": 150000,
      "forks": 45000,
      "language": "C",
      "description": "Linux kernel source tree",
      "readme_found": true
    }
  ]
}

github_repos.csv Columns

Column Description
Username Repository owner
Repo Name Repository name
Stars Star count
Forks Fork count
Language Primary programming language
Description Repository description
README Found Whether README was downloaded

βš™οΈ Configuration

Rate Limiting

Adjust the rate limit decorator in your code:

@rate_limit(max_calls=5, period=2)  # 5 calls every 2 seconds
def api_call():
    pass

Proxy Request Limit

Modify the request limit per proxy:

# In get_next_proxy() function
if REQUEST_COUNT[proxy_key] >= 60:  # Change 60 to your desired limit
    print(f"πŸ”„ Proxy reached limit, rotating...")
    return get_next_proxy()

πŸ“ˆ Rate Limits & Proxy Benefits

GitHub API Limits

Authentication Requests/Hour With Proxies
Unauthenticated 60/IP 60 Γ— Number of Proxies
Authenticated (PAT) 5,000/IP 5,000 Γ— Number of Proxies

Example Calculation

With 10 proxies:

  • Unauthenticated: 60 Γ— 10 = 600 requests/hour
  • Authenticated: 5,000 Γ— 10 = 50,000 requests/hour

Recommendation: Use both authentication AND proxy rotation for maximum throughput.


πŸ”‘ Getting Started with Proxies

Webshare Setup

  1. Sign up at Webshare.io
  2. Get API Key from Dashboard β†’ API
  3. Choose Plan:
    • Free: 10 proxies
    • Paid: 100+ proxies
  4. Add to .env: Copy API key to .env file
  5. Fetch Proxies: Run python webshare_proxy_manager.py

Alternative Proxy Services

The code supports any proxy service that provides:

  • HTTP/HTTPS proxy support
  • Username/password authentication
  • JSON configuration format

Simply format your proxies in the same structure as proxy.json.


πŸ§ͺ Example Output

Console Output with Proxy Rotation

βœ… Loaded 10 valid proxies

πŸ‘€ torvalds (Linus Torvalds)
🏠 Location: Portland, OR
πŸ“¦ Public repos: 5
πŸ‘₯ Followers: 180000 | Following: 0
πŸ“„ Fetching repositories and README files...

πŸ“‚ Repositories:
   πŸ”Ή linux
      ⭐ Stars: 180000, 🍴 Forks: 50000, 🧠 Language: C
      πŸ“œ Linux kernel source tree
      πŸ“„ README saved β†’ readmes/torvalds/linux_README.md

πŸ’Ύ Saved user details β†’ data/2025-10-29_16-24-20/torvalds/details.json

πŸ’Ύ Combined CSV saved β†’ data/2025-10-29_16-24-20/github_repos.csv
βœ… All data saved inside folder: data/2025-10-29_16-24-20

πŸ“Š Proxy Usage Summary:
   β€’ 192.186.151.77:8578: 15 requests
   β€’ 95.214.251.125:7035: 12 requests
   β€’ 64.137.10.30:5680: 18 requests

🀝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Ensure rate limiting is respected
  • Test proxy rotation with multiple proxies
  • Add appropriate error handling
  • Update documentation
  • Include example usage

πŸ“ Use Cases

  • Large-Scale Research: Analyze thousands of GitHub users without IP limits
  • Data Science Projects: Collect extensive datasets for ML/AI projects
  • Portfolio Building: Gather comprehensive repository statistics
  • Education: Learn API integration, proxy management, and rate limiting
  • Monitoring: Track repository statistics and trends over time
  • Competitive Analysis: Research competing projects and developers

⚠️ Important Notes

  • Respect GitHub's Terms of Service when using this tool
  • Rate limiting is enforced to prevent API abuse even with proxies
  • Personal tokens should never be committed to version control
  • Proxy credentials must be kept secure and not shared publicly
  • Data is collected from public repositories only
  • Each proxy should not exceed 60 requests to avoid detection
  • Monitor proxy health and rotate out non-working proxies
  • Timestamps follow ISO 8601 format (YYYY-MM-DD_HH-MM-SS)

πŸ› Troubleshooting

Proxy Connection Errors

# Increase timeout for slow proxies
response = requests.get(url, proxies=proxy, timeout=15)  # Default: 10

Proxy Authentication Failed

  • Verify credentials in proxy.json are correct
  • Check if proxy is still active and valid
  • Run python webshare_proxy_manager.py to refresh proxy list

Rate Limit Still Hit

  • Verify proxy rotation is working (check console output)
  • Ensure REQUEST_COUNT is tracking correctly
  • Try reducing requests per proxy from 60 to 50
  • Add longer delays between requests

Missing proxy.json File

# Fetch proxies from Webshare
python webshare_proxy_manager.py

# Or create manually following the JSON structure shown above

Slow Performance

  • Use proxies closer to your geographic location
  • Upgrade to faster proxy service plan
  • Reduce concurrent requests
  • Check network connectivity

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


🌟 Acknowledgments

  • Built with the GitHub REST API
  • Proxy services provided by Webshare.io
  • Inspired by the need for ethical large-scale data collection
  • Thanks to all contributors and the open-source community

πŸ“š Additional Resources


πŸ‘¨β€πŸ’» Author

Bhavin Ondhiya


πŸ“§ Contact

For questions or suggestions, please:

  • Open an issue on GitHub
  • Reach out via GitHub profile

Made with ❀️ by Bhavin Ondhiya

Last Updated: October 29, 2025

About

This Python project demonstrates API rate limiting, authentication, and organized data saving using the GitHub REST API.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages