An automated, zero-touch pipeline that scrapes all 175 problems from Striver's SDE Sheet, generates multi-language solutions with complexity analysis via LLM, persists everything to PostgreSQL, and exports a richly formatted Excel workbook — entirely hands-free.
Result: 175 / 175 problems solved (100%) across 24 topic categories, each with Python, Java, and C++ solutions + Big-O analysis in a single pipeline run.
project/
├── scraper/
│ └── scrape_striver.py ← requests+BS4 → Selenium → 175-problem static fallback
├── llm/
│ └── solver.py ← delimiter-based single-call LLM solver (no JSON)
├── database/
│ └── models.py ← psycopg2 / PostgreSQL CRUD
├── exporter/
│ └── excel_export.py ← openpyxl formatted workbook (25 sheets)
├── main.py ← orchestrator: CLI flags, progress, logging, signal handling
├── config.py ← .env loader, typed constants
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env.example
All seven fields are extracted from a single LLM call using plain delimiter markers:
[APPROACH]...[/APPROACH]
[TIME]...[/TIME]
[SPACE]...[/SPACE]
[EXPLANATION]...[/EXPLANATION]
[PYTHON]...[/PYTHON]
[JAVA]...[/JAVA]
[CPP]...[/CPP]
This eliminates the entire class of json_validate_failed errors that occur when code
strings containing \n, \", triple-quotes, or backtick-fenced blocks appear inside JSON
values. Delimiter markers are whitespace-agnostic and never appear in algorithm code.
Groq (llama-3.1-8b-instant) ← primary (fast + free tier)
↓ on failure
OpenAI (gpt-4o-mini)
↓ on failure
Google Gemini (gemini-2.0-flash-lite)
↓ on failure
Anthropic (claude-3-haiku)
Each provider uses full-jitter exponential backoff retry (3 attempts, 5 s base delay).
cp .env.example .envMinimum required values:
| Variable | Description |
|---|---|
GROQ_API_KEY |
Groq API key — free tier at console.groq.com |
DB_PASSWORD |
PostgreSQL password |
LLM_PROVIDER |
groq / openai / google / anthropic (default: groq) |
DB_HOST |
Postgres host (default: localhost) |
DB_NAME |
Database name (default: striver_dsa) |
Optional fallback keys: OPENAI_API_KEY, GOOGLE_API_KEY, ANTHROPIC_API_KEY
pip install -r requirements.txt# Full pipeline: scrape → solve → export
python main.py
# Individual steps
python main.py --scrape # scrape problems only
python main.py --solve # solve unsolved problems (skips already-solved)
python main.py --export # export DB → Excel
# Utilities
python main.py --stats # print DB progress summary
python main.py --validate # test LLM API connectivity
python main.py --cache-info # show llm_cache.json statscp .env.example .env # fill in at least GROQ_API_KEY
docker-compose up --buildtakeuforward.org
│
▼
scraper (requests+BS4 / Selenium / static fallback)
│ 175 problems × {topic, difficulty, practice_link}
▼
PostgreSQL ←── ON CONFLICT DO NOTHING (idempotent re-runs)
│
▼
LLM solver ←── Groq llama-3.1-8b-instant (primary)
│ file-based JSON cache (llm_cache.json)
│ full-jitter exponential backoff retry
▼
PostgreSQL (7 fields written per problem)
│
▼
Excel export (openpyxl)
│ 1 Summary sheet + 24 per-topic sheets
▼
Striver_SDE_Auto_Solved.xlsx
Striver_SDE_Auto_Solved.xlsx contains 25 sheets:
- Summary — all 175 problems with all 12 columns, difficulty colour-coded
- 24 topic sheets — e.g.
Arrays,Linked List,DP,Graphs,Tries, …
Columns:
ID | Topic | Problem | Difficulty | Practice Link |
Best Approach | Time Complexity | Space Complexity | Explanation |
Python Code | Java Code | C++ Code
Formatting: frozen headers, alternating row fills, hyperlinked practice URLs,
Courier New monospace code cells, auto-fitted column widths, difficulty colour
coding (green / yellow / red).
| Metric | Value |
|---|---|
| Total problems | 175 |
| Topics / categories | 24 |
| Languages per problem | 3 (Python, Java, C++) |
| Total content fields generated | 1,225 |
| Excel sheets | 25 (1 summary + 24 topic) |
| LLM providers supported | 4 |
| Token budget per call | 3,200 |
| Typical pipeline runtime | ~3–4 minutes |
| Solve rate | 100% (175 / 175) |
| Feature | Detail |
|---|---|
| Resume safety | progress.json + cache: re-runs skip already-solved problems |
| LLM retry | 3 attempts, full-jitter exponential backoff, per-provider |
| Rate-limit handling | Detects HTTP 429, waits before retry |
| Repetition loop prevention | temperature=0.1 + stop=["[/CPP]"] hard boundary |
| Token truncation handling | Greedy fallback regex catches unclosed tags on cutoff |
| Provider cascade | Auto-falls back Groq → OpenAI → Gemini → Anthropic |
| Duplicate protection | ON CONFLICT DO NOTHING — safe to run multiple times |
| Graceful interrupt | Ctrl-C saves progress before exiting |
| Structured logging | Console + log.txt, timestamped, level-filtered |
All settings are driven by .env / environment variables — nothing is hardcoded.
# LLM
LLM_PROVIDER=groq
LLM_FALLBACK_PROVIDER=google
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.1-8b-instant
GOOGLE_API_KEY=...
GOOGLE_MODEL=gemini-2.0-flash-lite
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
ANTHROPIC_API_KEY=sk-ant-...
LLM_TEMPERATURE=0.1
LLM_RETRY_COUNT=3
LLM_RETRY_DELAY=5
# Token budgets (sum = 3200 per call)
PHASE1_MAX_TOKENS=600 # approach + explanation
PHASE2_MAX_TOKENS=1100 # Python code
PHASE3_MAX_TOKENS=1500 # Java + C++ code
# Cache
LLM_CACHE_ENABLED=true
LLM_CACHE_FILE=llm_cache.json
# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=striver_dsa
DB_USER=postgres
DB_PASSWORD=your_passwordCREATE TABLE problems (
id SERIAL PRIMARY KEY,
topic TEXT,
problem_name TEXT UNIQUE NOT NULL,
difficulty TEXT,
practice_link TEXT,
best_approach TEXT,
time_complexity TEXT,
space_complexity TEXT,
explanation TEXT,
python_code TEXT,
java_code TEXT,
cpp_code TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
solved_at TIMESTAMP
);| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| LLM inference | Groq (llama-3.1-8b-instant), OpenAI, Google Gemini, Anthropic |
| Web scraping | requests, BeautifulSoup4, lxml, Selenium, webdriver-manager |
| Database | PostgreSQL, psycopg2-binary |
| Excel export | openpyxl, pandas |
| Config | python-dotenv |
| Progress bar | tqdm |
| Containerization | Docker, Docker Compose |
MIT