| title | Book Finder |
|---|---|
| emoji | 📚 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | streamlit |
| sdk_version | 1.40.0 |
| app_file | app.py |
| pinned | false |
Phase 1 Data Engineering Project
Modern semantic search systems rely on clean, structured, and enriched data.
Our university library records (OPAC) are often:
- Sparse (basic title/author only)
- Missing rich metadata (descriptions, categories, thumbnails)
- Inconsistent for machine learning tasks
This project addresses those challenges by building a production-style data pipeline that transforms raw accession registers into a queryable, enriched database, and exposes the data through a FAST API for downstream applications such as:
- Semantic search (Phase 2)
- Recommendation systems
- Library analytics dashboards
The project follows real-world data engineering principles:
- Clear separation of pipeline stages
- Deterministic and resume-safe processing
- CLI-driven configuration
- Database-backed persistence
- API-based data access
This system provides:
- Robust Ingestion: Asynchronous fetching from Google Books API
- Smart Synchronization: Altcha-aware scraping of Koha OPAC
- Data Transformation: Deduplication, normalization, and cleaning
- Persistent Storage: SQLAlchemy + SQLite architecture
- Pipeline Orchestration: Unified controller for all stages
- FastAPI Service: REST interface for browsing and triggering syncs
- Fully Self-Documenting CLI:
--helpsupport for every script
Each folder has one clear responsibility, mirroring how production pipelines are organized.
Book-Finder/
│
├── api/
│ └── serving.py # FastAPI service & DB models
├── data/
│ ├── raw/ # CSVs and BibTeX files
│ ├── processed/ # Intermediate JSONL files
├── ingestion/
│ └── ingestion.py # Async Google Books fetcher
├── logs/
│ └── project_log.md # Technical development log
├── analysis/
│ └── metrics_analysis.py # Data quality reporting
├── storage/
│ └── storage.py # Database loading logic
├── Transformation/
│ └── transformation.py # Cleaning & Deduplication
├── main.py # Pipeline Orchestrator
├── sync_pipeline.py # OPAC Synchronization
└── README.md
This structure ensures:
- Clear data lineage
- Easy debugging
- Independent execution of each stage
The pipeline is linear, deterministic, and restartable.
┌──────────────────────┐
┌──────────────────────┐
│ Raw CSV Registers │ │ OPAC (Koha) │
│ (Existing Data) │ │ (New Arrivals) │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
▼ ▼
┌───────────────────────────────────────────────────┐
│ SYNC & MERGE │
│ - Crawl New Arrivals (Altcha-aware) │
│ - Merge with Accession Register │
│ - Detect Incremental Changes │
└──────────────────────────┬────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ INGESTION (Google Books) │
│ - Async I/O (aiohttp) │
│ - Rate Limiting & Backoff │
│ - Fetch Metadata (ISBN, Desc, Thumbnails) │
└──────────────────────────┬────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ TRANSFORMATION │
│ - Normalize Titles/Authors │
│ - Deduplicate (ISBN > Google ID > Title match) │
│ - Merge Metadata Conflicts │
└──────────────────────────┬────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ STORAGE │
│ - JSONL → SQLite │
│ - SQLAlchemy ORM │
│ - Integrity Checks │
└──────────────────────────┬────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ FASTAPI SERVICE │
│ - Browse & Paginate │
│ - Search (Title/Author/ISBN) │
│ - Trigger Pipeline Sync │
└───────────────────────────────────────────────────┘
To execute all ETL stages in order, run:
python main.pyYou can also skip specific stages or set limits:
python main.py --skip-sync --ingest-limit 100book_finder.db
This database becomes the single source of truth for the API.
Goal:
Keep the local dataset aligned with the physical library's "New Arrivals".
Key Design Choices
- Altcha-Awareness: Detects anti-bot protection and degrades gracefully (warns user instead of crashing).
- Incremental Logic: Checks for new IDs against existing CSV to minimize redundant processing.
- BibTeX Parsing: Converts library standard format to project CSV schema.
Default Run
python sync_pipeline.pyGoal:
Enrich sparse CSV records with rich metadata from Google Books.
Key Design Choices
- Asyncio + Aiohttp: High-throughput fetching.
- Semaphore Handling: Limits concurrency to prevent IP bans.
- Resumable State: Skips already processed IDs in
outputJSONL.
Default Run
python ingestion/ingestion.py --limit 100Custom Input / Output
python ingestion/ingestion.py \
--input "data/raw/custom_list.csv" \
--output "data/processed/my_enrichment.jsonl"Goal:
Clean, normalize, and deduplicate the noisy API results.
Operations
- Step 1 (Transform): Merges CSV metadata (Book No, Publisher) with Google Metadata.
- Step 2 (Dedup): Resolves duplicates using a hierarchy:
ISBN-13>Google ID>Title+Author.
Why this matters:
API results often return the same book for slightly different queries. This stage ensures uniqueness.
Default Run
python Transformation/transformation.pyGoal:
Persist final records into a structured Relational Database.
Design Choices
- SQLAlchemy: ORM-based access for future portability (e.g., to PostgreSQL).
- Batch Commits: Improves write performance.
- Integrity Handling: Skips duplicates at the database level if pre-checks fail.
Default Run
python storage/storage.pyThe FastAPI layer provides read-only access to the final dataset and control access to the pipeline.
python api/serving.py --reloadGET /books/– Paginated book listingGET /books/{isbn}– ISBN lookupGET /search/?q=term– Partial match searchGET /sync/– Trigger background pipeline run
Swagger UI:
http://127.0.0.1:8000/docs
All statistics can be reproduced using analysis/metrics_analysis.py or the analysis/project_metrics.ipynb notebook.
| Stage | Input Records | Output Records | Success Rate |
|---|---|---|---|
| Source (CSV) | 36,358 | - | - |
| Ingestion | ~36,358 | 33,502 | 92.1% |
| Transformation | 33,502 | 26,173 | 78.1% (Dedup) |
| Final DB | 26,173 | 26,173 | 100% |
| Metric | Value |
|---|---|
| Total Source Records | 36,358 |
| Successful Matches | 33,502 |
| Final Unique Books | 26,173 |
| Database Size | ~41 MB |
Observation
The pipeline successfully enriches over 90% of the library catalog, proving the effectiveness of the fuzzy matching strategy used during ingestion.
| Field | Description |
|---|---|
id |
Internal DB Primary Key |
isbn_13 |
13-digit ISBN (Primary Identifier) |
title |
Book Title (from Google Books) |
authors |
Comma-separated list of authors |
description |
Full text summary/blurb |
categories |
Genre/Subject tags |
thumbnail |
URL to cover image |
average_rating |
Google Books rating |
book_no |
Original Library Call Number |
This project emphasizes:
- Separation of concerns: Each module does one thing well.
- Fail-Safe Operation: Network errors or API limits do not crash the pipeline.
- Reproducibility: Everything is code-defined and scriptable.
- Transparency: Extensive logging (
logs/project_log.md) tracks all decisions.
This project demonstrates a complete, production-style data pipeline:
- Quantifiable data-quality improvements
- Deterministic ETL stages
- Resume-safe enrichment
- Persistent storage
- API-based data access
It bridges the gap between 'operational' library lists and 'analytical' datasets, forming a strong foundation for Phase 2: Semantic Search.
202518053 : Falak Parmar
202518035 : Aditya Jana
DA-IICT — Big Data Engineering