A comprehensive web scraping and data analysis system for tracking product prices across multiple e-commerce platforms. MarketPulse features automated daily scraping, price trend analysis, anomaly detection, and a responsive dashboard for visualization.
- Multi-Store Scraping: Automated price collection from Amazon, eBay, Best Buy, and Target
- Price Trend Analysis: Historical price tracking with trend visualization
- Anomaly Detection: Automatic detection of unusual price changes
- Competitor Analysis: Cross-store price comparison for similar products
- RESTful API: Full-featured API for product search and analytics
- Interactive Dashboard: React-based frontend with price charts and trends
- Automated Scheduling: Daily scraping with Celery task queue
- Data Warehouse: Structured database schema optimized for time-series data
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Web Scrapers │────│ ETL Pipeline │────│ Data Warehouse│
│ (Selenium/BS4)│ │ (Validation)│ │ (SQLite/PG) │
└─────────────────┘ └──────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌──────────────┐ │
│ React Dashboard│────│ FastAPI │─────────────┘
│ (Charts/UI) │ │ (REST API) │
└─────────────────┘ └──────────────┘
┌─────────────────┐ ┌──────────────┐
│ Celery Scheduler│────│ Analytics │
│ (Daily Tasks) │ │ (Anomalies) │
└─────────────────┘ └──────────────┘
# Clone and setup
git clone <repository>
cd price-compare
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements-minimal.txt# Initialize database and create sample data
python database/setup.py
python demo.py# Terminal 1: Start API server
source venv/bin/activate
python api/main.py
# Terminal 2: Start frontend (in another terminal)
cd frontend
npm install
npm start
# Terminal 3: Start Celery worker (optional, for automated scraping)
source venv/bin/activate
celery -A scrapers.tasks worker --loglevel=info- Frontend Dashboard: http://localhost:3000
- API Documentation: http://localhost:8000/docs
- API Health Check: http://localhost:8000/health
price-compare/
├── api/ # FastAPI application
│ ├── main.py # API routes and configuration
│ ├── crud.py # Database operations
│ └── schemas.py # Pydantic models
├── analytics/ # Price analysis and anomaly detection
│ └── detector.py # Analytics engine
├── database/ # Database models and utilities
│ ├── models.py # SQLAlchemy models
│ ├── connection.py # Database connection management
│ ├── setup.py # Database initialization
│ └── seed.py # Initial data seeding
├── scrapers/ # Web scraping framework
│ ├── base.py # Abstract scraper base class
│ ├── amazon.py # Amazon-specific scraper
│ ├── ebay.py # eBay-specific scraper
│ ├── pipeline.py # ETL data processing
│ └── tasks.py # Celery background tasks
├── frontend/ # React dashboard
│ └── src/
│ ├── components/ # React components
│ ├── services/ # API client
│ └── types/ # TypeScript definitions
├── config/ # Configuration management
│ └── settings.py # Application settings
├── tests/ # Test suite
├── docker-compose.yml # Docker orchestration
└── requirements.txt # Python dependencies
GET /products/- List products with latest pricesGET /products/{id}- Get specific product detailsGET /products/{id}/price-history- Get price historyGET /products/{id}/trends- Get price trend analysis
GET /stores/- List all configured storesGET /stores/{id}- Get store details
POST /search- Search products across storesGET /analytics/price-changes- Recent price changesPOST /scrape/product- Trigger manual product scraping
Copy .env.example to .env and configure:
# Database
DATABASE_URL=sqlite:///./price_compare.db
# API
API_HOST=0.0.0.0
API_PORT=8000
# Scraping
SCRAPING_DELAY=1
MAX_CONCURRENT_REQUESTS=5
# Redis (for Celery)
REDIS_URL=redis://localhost:6379/0source venv/bin/activate
python -m pytest tests/ -v- Create new scraper class inheriting from
BaseScraper - Implement
extract_product_info()andsearch_products()methods - Add scraper to the registry in
scrapers/tasks.py - Update database with new store entry
For production use, implement proper migrations:
# Install Alembic
pip install alembic
# Initialize migrations
alembic init alembic
# Create migration
alembic revision --autogenerate -m "description"
# Apply migrations
alembic upgrade head# Start all services
docker-compose up -d
# View logs
docker-compose logs -f api
# Scale workers
docker-compose up -d --scale celery=3- Database: Use PostgreSQL for production
- Redis: Required for Celery task queue
- Web Server: Use Nginx as reverse proxy
- Process Manager: Use systemd or supervisor for process management
- Monitoring: Implement logging aggregation and metrics collection
- Rate Limiting: Built-in request throttling to avoid IP blocking
- Data Validation: Comprehensive input validation on all endpoints
- Environment Variables: Sensitive configuration via environment variables
- CORS: Configurable cross-origin resource sharing
- Error Handling: Graceful error handling without data exposure
- Terms of Service: Ensure compliance with each site's ToS
- Rate Limiting: Respectful scraping with appropriate delays
- User Agents: Proper identification in requests
- Data Usage: Only collect publicly available pricing data
- Caching: Implement reasonable caching to minimize requests
- Fork the repository
- Create a feature branch
- Implement changes with tests
- Ensure all tests pass
- Submit a pull request
This project is for educational purposes. Ensure compliance with all applicable terms of service and legal requirements when scraping websites.