Project Notes: DataStreamIQ - Real-Time Streaming & Processing of Unstructured Data
Student: Sneha Rangole | Group: 1
Objective: Build a scalable pipeline to process unstructured job descriptions (JSON, TXT, PDF) from the City of Los Angeles, enabling structured querying and visualization.
Key Goals:
- Data Processing: Transform unstructured data into structured format using PySpark.
- Storage: Use AWS S3 for reliable, scalable storage.
- Querying: Catalog data with AWS Glue and enable SQL queries via Athena.
- Insights: Visualize processed data in Power BI.
- PySpark: Real-time processing, UDFs for parsing unstructured data.
- AWS S3: Store raw/processed data.
- AWS Glue: Data cataloging.
- AWS Athena: Query processed data.
- Power BI: Visualization.
- Ingestion: Collect data from JSON, TXT, and PDF files.
- Processing:
- PySpark UDFs: Extract fields (salary, dates, requirements) using regex.
- Unified Schema: Standardize data across formats.
- Storage: Save raw/processed data to S3 in Parquet format.
- Cataloging: Use AWS Glue Crawler to create metadata tables.
- Querying: Run SQL queries in Athena for analysis.
- Visualization: Export results to Power BI for dashboards.
main.py:- Sets up Spark Session with AWS credentials.
- Reads streaming data from JSON, TXT, and PDF directories.
- Applies UDFs to extract structured fields (e.g.,
extract_salary,extract_end_date). - Unions data from all sources and writes to S3.
udf_utils.py:- Custom functions for parsing text (e.g., regex patterns for salary ranges, dates).
- Handles PDF text extraction using PyMuPDF.
- Unstructured Data Complexity
- Issue: Varied formats (PDF, JSON, TXT) required flexible parsing.
- Solution: UDFs with regex patterns and PySpark’s schema enforcement.
- AWS Permissions
- Issue: Configuring IAM roles and S3 bucket policies.
- Solution: Defined granular permissions for Glue, Athena, and S3 access.
- Real-Time Processing
- Issue: Optimizing Spark Streaming for diverse inputs.
- Solution: Checkpointing in S3 and microbatch processing (5-second intervals).
