This guide explains how to test the Stats Scraper after the refactoring to make it open source and production-ready.
Before testing, make sure you have:
-
Installed all dependencies:
pip install -r requirements.txt
-
Set up your configuration:
cp config.yaml.example config.yaml
Edit
config.yaml
to configure your target repository and database settings. -
Set up your environment variables:
cp .env.example .env
Edit
.env
with your API tokens.
To test with MotherDuck (the default database):
- Make sure your
MOTHERDUCK_TOKEN
is set in your.env
file or environment variables. - Run the refactored repository visitors script:
python scripts/repo_visitors.py
- Check the logs to see if the script ran successfully.
- Verify the data in your MotherDuck database:
SELECT * FROM github_visitors ORDER BY date DESC LIMIT 10;
To test with SQLite:
-
Edit your
config.yaml
to use SQLite:database: type: "sqlite" connection: sqlite: path: "superset_stats.db"
Or set environment variables:
export DATABASE_TYPE=sqlite export SQLITE_PATH=superset_stats.db
-
Run the refactored repository visitors script:
python scripts/repo_visitors.py
-
Verify the data in your SQLite database:
sqlite3 superset_stats.db "SELECT * FROM github_visitors ORDER BY date DESC LIMIT 10;"
To test with PostgreSQL:
-
Make sure you have a PostgreSQL server running.
-
Edit your
config.yaml
to use PostgreSQL:database: type: "postgresql" connection: postgresql: host: "localhost" port: 5432 database: "superset_stats" username: "postgres" password: "your_password"
Or set environment variables:
export DATABASE_TYPE=postgresql export POSTGRESQL_HOST=localhost export POSTGRESQL_PORT=5432 export POSTGRESQL_DATABASE=superset_stats export POSTGRESQL_USERNAME=postgres export POSTGRESQL_PASSWORD=your_password
-
Run the refactored repository visitors script:
python scripts/repo_visitors.py
-
Verify the data in your PostgreSQL database:
psql -U postgres -d superset_stats -c "SELECT * FROM github_visitors ORDER BY date DESC LIMIT 10;"
To test with a different GitHub repository:
-
Edit your
config.yaml
to target a different repository:github: owner: "different-org" repo: "different-repo"
Or set environment variables:
export GITHUB_OWNER=different-org export GITHUB_REPO=different-repo
-
Run the refactored repository visitors script:
python scripts/repo_visitors.py
-
Check the logs to see if the script fetched data from the correct repository.
To test the GitHub Actions workflow locally:
-
Install act if you haven't already.
-
Run the test workflow script:
./test_workflow.sh
-
The script will check for tokens in your environment or
.env
file, create a.secrets
file for act, and run the workflow. -
Check the output to see if all steps ran successfully.
If you encounter database connection issues:
- Check that your database credentials are correct.
- Verify that your database is running and accessible.
- Look at the logs for specific error messages.
If you encounter GitHub API rate limiting:
- Make sure your
GITHUB_TOKEN
has sufficient permissions. - Consider using a token with higher rate limits.
- Add rate limiting handling in the GitHub client.
If you encounter configuration issues:
- Check that your
config.yaml
file is properly formatted. - Verify that your environment variables are set correctly.
- Look at the logs for specific error messages.
After testing the refactored code, you can:
- Refactor the remaining scripts to use the new architecture.
- Add unit tests for the core functionality.
- Consider adding a simple CLI interface for running all scrapers at once.