The competitor price scraper was experiencing inconsistent product counts, fluctuating between 400 and 200 products. This was caused by two main issues:
-
Data Persistence Issues in scraper_v2.py
- Complete data overwrite with no backup during incremental saves
- No validation before saving (could lose data if scrape failed mid-process)
- No history of previous scrapes maintained
-
Product Matching Issues in dashboard_server.py
- Dell Latitude 5320 special case was too broad (grouped ALL variants together)
- Inconsistent handling of product configurations
- Poor null/empty value handling in signatures
- Added timestamped backups before overwriting data
- Format:
competitor_prices_backup_YYYYMMDD_HHMMSS.json - Creates backup only when existing data exists
- Prevents data loss from failed scrapes
- Format:
- 30% threshold check to prevent significant data loss
- If new scrape has >30% fewer products, keeps existing data
- Saves rejected data to
{competitor}_temp.jsonfor manual review - Logs detailed warnings about the data loss
- Keeps last 5 backups automatically
- Deletes older backups to prevent disk space issues
- Uses file modification time for sorting
- Tracks product count changes in saved data
- Shows (+X) for increases, (-X) for decreases
- Includes previous_count and change fields in JSON
- Provides clear visibility into data changes
- Before: All 5320 variants grouped together regardless of specs
# Old problematic code: if brand == 'dell' and '5320' in model.lower(): return f"dell_latitude_5320_touch_{product_type}"
- After: Uses standard matching logic with RAM/storage differentiation
- Dell Latitude 5320 with 8GB/256GB now separate from 16GB/512GB
- Each configuration tracked independently
- Consistent approach for all products
- Better null handling - filters out None/empty values before joining
- Includes key specs in signature:
- Brand
- Model
- Product Type
- Processor (if available)
- RAM (if available)
- Storage (if available)
- ✅ No more data loss from failed scrapes
- ✅ Automatic backups for recovery
- ✅ Validation warnings alert to problems
- ✅ Change tracking for visibility
- ✅ More accurate counts - configurations properly differentiated
- ✅ Consistent matching - no special cases causing issues
- ✅ Better multi-site comparison - same products properly grouped
1. Start scraping competitor
↓
2. Scrape page 1
↓
3. Load existing data + count
↓
4. Validate new data (>30% loss check)
├─ PASS → Create backup → Save new data
└─ FAIL → Keep old data → Save to temp file
↓
5. Continue to page 2...
Previous scrape: 100 products
Current scrape: 65 products (35% reduction)
❌ REJECTED - Too much loss!
✅ Existing data preserved
📁 New data saved to competitor_temp.json
-
Test with single competitor:
python scraper_v2.py --competitor SystemLiquidation
-
Monitor output for:
- ✅
[BACKUP] Created backup:messages - ✅
[SAVE] Incremental save: X products (+/-Y)messages ⚠️ [WARNING] Significant data loss detected!alerts
- ✅
-
Check generated files:
competitor_prices.json- main data filecompetitor_prices_backup_*.json- timestamped backups (max 5){competitor}_temp.json- rejected data for review
- Check for backup files:
competitor_prices_backup_*.json - Restore latest backup:
copy competitor_prices_backup_20250101_123456.json competitor_prices.json
- Check temp files:
{competitor}_temp.json - Compare with backups to determine correct state
- Manually merge or restore as needed
-
scraper_v2.py
- Modified
_save_incremental_results()method - Added
_cleanup_old_backups()helper method - Enhanced error handling and logging
- Modified
-
dashboard_server.py
- Modified
create_product_signature()function - Removed Dell Latitude 5320 special case
- Improved null value handling
- Modified
-
New File Created
competitor_prices_backup_manual.json- manual backup before changes
-
Product Count Stability
- Should remain relatively stable between scrapes
- Small fluctuations (±5%) are normal (actual inventory changes)
- Large drops (>30%) will be blocked
-
Backup File Count
- Should see 1-5 backup files at any time
- Files automatically cleaned up after 5th backup
-
Change Tracking
- Check
changefield in JSON data - Monitor for unexpected large changes
- Check
Potential improvements for consideration:
-
Web UI for Backup Management
- View backup history
- One-click restore
- Compare backup versions
-
Email Alerts
- Notify on validation failures
- Alert on significant data changes
-
Merge Logic
- Smart merge instead of overwrite
- Keep products not seen in new scrape
-
Historical Tracking
- Store complete scrape history
- Track product price changes over time
These changes provide robust data protection and more accurate product matching:
- ✅ Prevents data loss through backups and validation
- ✅ Improves product counts through better matching logic
- ✅ Provides visibility with change tracking and logging
- ✅ Enables recovery with automatic backups
The system is now much more resilient to scraping errors and provides better data quality for the dashboard.