Striver SDE Sheet — AI DSA Knowledge Engine

An automated, zero-touch pipeline that scrapes all 175 problems from Striver's SDE Sheet, generates multi-language solutions with complexity analysis via LLM, persists everything to PostgreSQL, and exports a richly formatted Excel workbook — entirely hands-free.

Result: 175 / 175 problems solved (100%) across 24 topic categories, each with Python, Java, and C++ solutions + Big-O analysis in a single pipeline run.

Architecture

project/
├── scraper/
│   └── scrape_striver.py   ← requests+BS4 → Selenium → 175-problem static fallback
├── llm/
│   └── solver.py           ← delimiter-based single-call LLM solver (no JSON)
├── database/
│   └── models.py           ← psycopg2 / PostgreSQL CRUD
├── exporter/
│   └── excel_export.py     ← openpyxl formatted workbook (25 sheets)
├── main.py                 ← orchestrator: CLI flags, progress, logging, signal handling
├── config.py               ← .env loader, typed constants
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env.example

LLM Solver — Delimiter Architecture

All seven fields are extracted from a single LLM call using plain delimiter markers:

[APPROACH]...[/APPROACH]
[TIME]...[/TIME]
[SPACE]...[/SPACE]
[EXPLANATION]...[/EXPLANATION]
[PYTHON]...[/PYTHON]
[JAVA]...[/JAVA]
[CPP]...[/CPP]

This eliminates the entire class of json_validate_failed errors that occur when code strings containing \n, \", triple-quotes, or backtick-fenced blocks appear inside JSON values. Delimiter markers are whitespace-agnostic and never appear in algorithm code.

Provider Cascade

Groq (llama-3.1-8b-instant)   ← primary (fast + free tier)
  ↓ on failure
OpenAI (gpt-4o-mini)
  ↓ on failure
Google Gemini (gemini-2.0-flash-lite)
  ↓ on failure
Anthropic (claude-3-haiku)

Each provider uses full-jitter exponential backoff retry (3 attempts, 5 s base delay).

Quick Start

1 — Copy and fill environment variables

cp .env.example .env

Minimum required values:

Variable	Description
`GROQ_API_KEY`	Groq API key — free tier at console.groq.com
`DB_PASSWORD`	PostgreSQL password
`LLM_PROVIDER`	`groq` / `openai` / `google` / `anthropic` (default: `groq`)
`DB_HOST`	Postgres host (default: `localhost`)
`DB_NAME`	Database name (default: `striver_dsa`)

Optional fallback keys: OPENAI_API_KEY, GOOGLE_API_KEY, ANTHROPIC_API_KEY

2 — Install dependencies

pip install -r requirements.txt

3 — Run

# Full pipeline: scrape → solve → export
python main.py

# Individual steps
python main.py --scrape      # scrape problems only
python main.py --solve       # solve unsolved problems (skips already-solved)
python main.py --export      # export DB → Excel

# Utilities
python main.py --stats       # print DB progress summary
python main.py --validate    # test LLM API connectivity
python main.py --cache-info  # show llm_cache.json stats

4 — Docker (no local Postgres needed)

cp .env.example .env    # fill in at least GROQ_API_KEY
docker-compose up --build

Pipeline Flow

takeuforward.org
      │
      ▼
  scraper (requests+BS4 / Selenium / static fallback)
      │  175 problems × {topic, difficulty, practice_link}
      ▼
  PostgreSQL  ←── ON CONFLICT DO NOTHING (idempotent re-runs)
      │
      ▼
  LLM solver  ←── Groq llama-3.1-8b-instant  (primary)
      │             file-based JSON cache (llm_cache.json)
      │             full-jitter exponential backoff retry
      ▼
  PostgreSQL  (7 fields written per problem)
      │
      ▼
  Excel export (openpyxl)
      │  1 Summary sheet + 24 per-topic sheets
      ▼
  Striver_SDE_Auto_Solved.xlsx

Excel Output

Striver_SDE_Auto_Solved.xlsx contains 25 sheets:

Summary — all 175 problems with all 12 columns, difficulty colour-coded
24 topic sheets — e.g. Arrays, Linked List, DP, Graphs, Tries, …

Columns:

ID | Topic | Problem | Difficulty | Practice Link |
Best Approach | Time Complexity | Space Complexity | Explanation |
Python Code | Java Code | C++ Code

Formatting: frozen headers, alternating row fills, hyperlinked practice URLs, Courier New monospace code cells, auto-fitted column widths, difficulty colour coding (green / yellow / red).

Key Numbers

Metric	Value
Total problems	175
Topics / categories	24
Languages per problem	3 (Python, Java, C++)
Total content fields generated	1,225
Excel sheets	25 (1 summary + 24 topic)
LLM providers supported	4
Token budget per call	3,200
Typical pipeline runtime	~3–4 minutes
Solve rate	100% (175 / 175)

Fault Tolerance

Feature	Detail
Resume safety	`progress.json` + cache: re-runs skip already-solved problems
LLM retry	3 attempts, full-jitter exponential backoff, per-provider
Rate-limit handling	Detects HTTP 429, waits before retry
Repetition loop prevention	`temperature=0.1` + `stop=["[/CPP]"]` hard boundary
Token truncation handling	Greedy fallback regex catches unclosed tags on cutoff
Provider cascade	Auto-falls back Groq → OpenAI → Gemini → Anthropic
Duplicate protection	`ON CONFLICT DO NOTHING` — safe to run multiple times
Graceful interrupt	Ctrl-C saves progress before exiting
Structured logging	Console + `log.txt`, timestamped, level-filtered

Configuration Reference

All settings are driven by .env / environment variables — nothing is hardcoded.

# LLM
LLM_PROVIDER=groq
LLM_FALLBACK_PROVIDER=google
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.1-8b-instant
GOOGLE_API_KEY=...
GOOGLE_MODEL=gemini-2.0-flash-lite
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
ANTHROPIC_API_KEY=sk-ant-...
LLM_TEMPERATURE=0.1
LLM_RETRY_COUNT=3
LLM_RETRY_DELAY=5

# Token budgets (sum = 3200 per call)
PHASE1_MAX_TOKENS=600    # approach + explanation
PHASE2_MAX_TOKENS=1100   # Python code
PHASE3_MAX_TOKENS=1500   # Java + C++ code

# Cache
LLM_CACHE_ENABLED=true
LLM_CACHE_FILE=llm_cache.json

# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=striver_dsa
DB_USER=postgres
DB_PASSWORD=your_password

Database Schema

CREATE TABLE problems (
    id               SERIAL PRIMARY KEY,
    topic            TEXT,
    problem_name     TEXT UNIQUE NOT NULL,
    difficulty       TEXT,
    practice_link    TEXT,
    best_approach    TEXT,
    time_complexity  TEXT,
    space_complexity TEXT,
    explanation      TEXT,
    python_code      TEXT,
    java_code        TEXT,
    cpp_code         TEXT,
    created_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    solved_at        TIMESTAMP
);

Tech Stack

Layer	Technology
Language	Python 3.10+
LLM inference	Groq (`llama-3.1-8b-instant`), OpenAI, Google Gemini, Anthropic
Web scraping	`requests`, `BeautifulSoup4`, `lxml`, `Selenium`, `webdriver-manager`
Database	PostgreSQL, `psycopg2-binary`
Excel export	`openpyxl`, `pandas`
Config	`python-dotenv`
Progress bar	`tqdm`
Containerization	Docker, Docker Compose

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Striver SDE Sheet — AI DSA Knowledge Engine

Architecture

LLM Solver — Delimiter Architecture

Provider Cascade

Quick Start

1 — Copy and fill environment variables

2 — Install dependencies

3 — Run

4 — Docker (no local Postgres needed)

Pipeline Flow

Excel Output

Key Numbers

Fault Tolerance

Configuration Reference

Database Schema

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
database		database
exporter		exporter
llm		llm
scraper		scraper
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Striver SDE Sheet — AI DSA Knowledge Engine

Architecture

LLM Solver — Delimiter Architecture

Provider Cascade

Quick Start

1 — Copy and fill environment variables

2 — Install dependencies

3 — Run

4 — Docker (no local Postgres needed)

Pipeline Flow

Excel Output

Key Numbers

Fault Tolerance

Configuration Reference

Database Schema

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages