Skip to content

lakshyakhandelwal2901/Automated-Strive-DSA-Revision-Sheet-Generator

Repository files navigation

Striver SDE Sheet — AI DSA Knowledge Engine

An automated, zero-touch pipeline that scrapes all 175 problems from Striver's SDE Sheet, generates multi-language solutions with complexity analysis via LLM, persists everything to PostgreSQL, and exports a richly formatted Excel workbook — entirely hands-free.

Result: 175 / 175 problems solved (100%) across 24 topic categories, each with Python, Java, and C++ solutions + Big-O analysis in a single pipeline run.


Architecture

project/
├── scraper/
│   └── scrape_striver.py   ← requests+BS4 → Selenium → 175-problem static fallback
├── llm/
│   └── solver.py           ← delimiter-based single-call LLM solver (no JSON)
├── database/
│   └── models.py           ← psycopg2 / PostgreSQL CRUD
├── exporter/
│   └── excel_export.py     ← openpyxl formatted workbook (25 sheets)
├── main.py                 ← orchestrator: CLI flags, progress, logging, signal handling
├── config.py               ← .env loader, typed constants
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env.example

LLM Solver — Delimiter Architecture

All seven fields are extracted from a single LLM call using plain delimiter markers:

[APPROACH]...[/APPROACH]
[TIME]...[/TIME]
[SPACE]...[/SPACE]
[EXPLANATION]...[/EXPLANATION]
[PYTHON]...[/PYTHON]
[JAVA]...[/JAVA]
[CPP]...[/CPP]

This eliminates the entire class of json_validate_failed errors that occur when code strings containing \n, \", triple-quotes, or backtick-fenced blocks appear inside JSON values. Delimiter markers are whitespace-agnostic and never appear in algorithm code.

Provider Cascade

Groq (llama-3.1-8b-instant)   ← primary (fast + free tier)
  ↓ on failure
OpenAI (gpt-4o-mini)
  ↓ on failure
Google Gemini (gemini-2.0-flash-lite)
  ↓ on failure
Anthropic (claude-3-haiku)

Each provider uses full-jitter exponential backoff retry (3 attempts, 5 s base delay).


Quick Start

1 — Copy and fill environment variables

cp .env.example .env

Minimum required values:

Variable Description
GROQ_API_KEY Groq API key — free tier at console.groq.com
DB_PASSWORD PostgreSQL password
LLM_PROVIDER groq / openai / google / anthropic (default: groq)
DB_HOST Postgres host (default: localhost)
DB_NAME Database name (default: striver_dsa)

Optional fallback keys: OPENAI_API_KEY, GOOGLE_API_KEY, ANTHROPIC_API_KEY

2 — Install dependencies

pip install -r requirements.txt

3 — Run

# Full pipeline: scrape → solve → export
python main.py

# Individual steps
python main.py --scrape      # scrape problems only
python main.py --solve       # solve unsolved problems (skips already-solved)
python main.py --export      # export DB → Excel

# Utilities
python main.py --stats       # print DB progress summary
python main.py --validate    # test LLM API connectivity
python main.py --cache-info  # show llm_cache.json stats

4 — Docker (no local Postgres needed)

cp .env.example .env    # fill in at least GROQ_API_KEY
docker-compose up --build

Pipeline Flow

takeuforward.org
      │
      ▼
  scraper (requests+BS4 / Selenium / static fallback)
      │  175 problems × {topic, difficulty, practice_link}
      ▼
  PostgreSQL  ←── ON CONFLICT DO NOTHING (idempotent re-runs)
      │
      ▼
  LLM solver  ←── Groq llama-3.1-8b-instant  (primary)
      │             file-based JSON cache (llm_cache.json)
      │             full-jitter exponential backoff retry
      ▼
  PostgreSQL  (7 fields written per problem)
      │
      ▼
  Excel export (openpyxl)
      │  1 Summary sheet + 24 per-topic sheets
      ▼
  Striver_SDE_Auto_Solved.xlsx

Excel Output

Striver_SDE_Auto_Solved.xlsx contains 25 sheets:

  • Summary — all 175 problems with all 12 columns, difficulty colour-coded
  • 24 topic sheets — e.g. Arrays, Linked List, DP, Graphs, Tries, …

Columns:

ID | Topic | Problem | Difficulty | Practice Link |
Best Approach | Time Complexity | Space Complexity | Explanation |
Python Code | Java Code | C++ Code

Formatting: frozen headers, alternating row fills, hyperlinked practice URLs, Courier New monospace code cells, auto-fitted column widths, difficulty colour coding (green / yellow / red).


Key Numbers

Metric Value
Total problems 175
Topics / categories 24
Languages per problem 3 (Python, Java, C++)
Total content fields generated 1,225
Excel sheets 25 (1 summary + 24 topic)
LLM providers supported 4
Token budget per call 3,200
Typical pipeline runtime ~3–4 minutes
Solve rate 100% (175 / 175)

Fault Tolerance

Feature Detail
Resume safety progress.json + cache: re-runs skip already-solved problems
LLM retry 3 attempts, full-jitter exponential backoff, per-provider
Rate-limit handling Detects HTTP 429, waits before retry
Repetition loop prevention temperature=0.1 + stop=["[/CPP]"] hard boundary
Token truncation handling Greedy fallback regex catches unclosed tags on cutoff
Provider cascade Auto-falls back Groq → OpenAI → Gemini → Anthropic
Duplicate protection ON CONFLICT DO NOTHING — safe to run multiple times
Graceful interrupt Ctrl-C saves progress before exiting
Structured logging Console + log.txt, timestamped, level-filtered

Configuration Reference

All settings are driven by .env / environment variables — nothing is hardcoded.

# LLM
LLM_PROVIDER=groq
LLM_FALLBACK_PROVIDER=google
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.1-8b-instant
GOOGLE_API_KEY=...
GOOGLE_MODEL=gemini-2.0-flash-lite
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
ANTHROPIC_API_KEY=sk-ant-...
LLM_TEMPERATURE=0.1
LLM_RETRY_COUNT=3
LLM_RETRY_DELAY=5

# Token budgets (sum = 3200 per call)
PHASE1_MAX_TOKENS=600    # approach + explanation
PHASE2_MAX_TOKENS=1100   # Python code
PHASE3_MAX_TOKENS=1500   # Java + C++ code

# Cache
LLM_CACHE_ENABLED=true
LLM_CACHE_FILE=llm_cache.json

# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=striver_dsa
DB_USER=postgres
DB_PASSWORD=your_password

Database Schema

CREATE TABLE problems (
    id               SERIAL PRIMARY KEY,
    topic            TEXT,
    problem_name     TEXT UNIQUE NOT NULL,
    difficulty       TEXT,
    practice_link    TEXT,
    best_approach    TEXT,
    time_complexity  TEXT,
    space_complexity TEXT,
    explanation      TEXT,
    python_code      TEXT,
    java_code        TEXT,
    cpp_code         TEXT,
    created_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    solved_at        TIMESTAMP
);

Tech Stack

Layer Technology
Language Python 3.10+
LLM inference Groq (llama-3.1-8b-instant), OpenAI, Google Gemini, Anthropic
Web scraping requests, BeautifulSoup4, lxml, Selenium, webdriver-manager
Database PostgreSQL, psycopg2-binary
Excel export openpyxl, pandas
Config python-dotenv
Progress bar tqdm
Containerization Docker, Docker Compose

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors