- Project Architecture
- Setup Development Environment
- Backend Development
- Frontend Development
- API Documentation
- Testing
- Deployment
- Contributing
The AI Web Scraper is built using a modern full-stack architecture:
- Backend: FastAPI (Python) with async/await support
- Frontend: React with TypeScript and Vite
- LLM Integration: OpenAI/OpenRouter APIs
- Output Formats: Word, PDF, Excel, Text
- Testing: pytest (backend), Vitest (frontend)
ai-webscraper/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/ # API endpoints
│ │ ├── models.py # Pydantic models
│ │ ├── services/ # Business logic
│ │ └── utils/ # Utilities
│ ├── tests/ # Backend tests
│ └── requirements.txt
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── services/ # API services
│ │ ├── types/ # TypeScript types
│ │ └── pages/ # Page components
│ └── tests/ # Frontend tests
└── docs/ # Documentation
- Python 3.8+
- Node.js 16+
- npm or yarn
- Git
cd backend
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keyscd frontend
npm install
# Create .env.local for environment variablesOPENAI_API_KEY=your_openai_key
OPENROUTER_API_KEY=your_openrouter_key
LLM_PROVIDER=openai # or openrouter
LOG_LEVEL=INFO
VITE_API_URL=http://localhost:8000
- Handles communication with OpenAI/OpenRouter APIs
- Provides content processing with user prompts
- Implements retry logic and error handling
- Web page content extraction
- Content cleaning and preprocessing
- URL validation
- Multi-format output generation
- File management and cleanup
- Professional document styling
{
"url": "https://example.com",
"prompt": "Extract product information",
"output_format": "excel" # text, word, pdf, excel
}Returns system health and LLM service availability.
Downloads generated output files.
-
New Output Format:
- Add format to
OutputFormatenum inmodels.py - Implement generation logic in
output_service.py - Add tests
- Add format to
-
New LLM Provider:
- Extend
LLMServiceclass - Add configuration options
- Update provider selection logic
- Extend
- Use custom exceptions for specific error types
- Log errors with appropriate levels
- Return user-friendly error messages
- Implement retry logic for external APIs
- Handles user input (URL, prompt, format)
- Form validation
- API request submission
- Displays scraping results
- File download functionality
- Status monitoring
- Local component state using React hooks
- API calls using axios
- Type-safe interfaces for all data
- Tailwind CSS for utility-first styling
- Responsive design principles
- Component-based CSS classes
- Create component in
src/components/ - Add TypeScript interfaces in
src/types/ - Write tests in
tests/ - Update imports in parent components
Currently, the API doesn't require authentication, but API keys are configured server-side.
Implement rate limiting for production use:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)All endpoints return consistent error format:
{
"success": false,
"message": "Error description",
"data": null
}cd backend
pytest tests/ -v
pytest tests/ --cov=app # With coveragetest_scrape.py: Scraping endpoint teststest_status.py: Status endpoint tests- Mock external dependencies (LLM APIs, web requests)
cd frontend
npm test
npm run test:coverage- Component testing with React Testing Library
- API service mocking
- User interaction testing
@pytest.mark.asyncio
async def test_scrape_success():
with patch('app.services.llm_service.LLMService') as mock_llm:
mock_llm.process_content.return_value = {"title": "Test"}
# Test implementationtest('renders scrape form', () => {
render(<ScrapeForm onSubmit={mockSubmit} />);
expect(screen.getByText('URL')).toBeInTheDocument();
});Create Dockerfile for containerization:
# Backend Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]-
Security:
- Use HTTPS
- Implement API authentication
- Validate all inputs
- Rate limiting
-
Performance:
- Implement caching
- Database for persistent storage
- Load balancing
-
Monitoring:
- Health checks
- Error tracking (Sentry)
- Performance monitoring
# Production environment variables
export ENVIRONMENT=production
export DEBUG=false
export DATABASE_URL=postgresql://...- Backend: Black formatter, flake8 linting
- Frontend: Prettier, ESLint
- Type hints for all Python functions
- TypeScript strict mode
- Create feature branch from
main - Make changes with descriptive commits
- Add tests for new features
- Submit pull request
- Code review and merge
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Documentation update
## Testing
- [ ] Tests pass locally
- [ ] New tests added
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated- Write tests before implementing features
- Keep functions small and focused
- Use meaningful variable names
- Add docstrings for public functions
- Handle errors gracefully
- Log important events
- Update documentation with changes
- LLM API Errors: Check API keys and rate limits
- Import Errors: Verify Python path and virtual environment
- Port Conflicts: Change port in uvicorn command
- CORS Errors: Configure backend CORS settings
- Build Failures: Check Node.js version and dependencies
- API Connection: Verify VITE_API_URL environment variable
Enable debug logging:
# Backend
LOG_LEVEL=DEBUG
# Add debug endpoints
@app.get("/debug/logs")
async def get_logs():
return {"logs": "Recent log entries"}import time
from functools import wraps
def timing_decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start = time.time()
result = await func(*args, **kwargs)
end = time.time()
logger.info(f"{func.__name__} took {end - start:.2f} seconds")
return result
return wrapper