This project demonstrates a scalable approach for real-time processing of unstructured data using Apache Spark and AWS services. The application streams unstructured data, processes it to extract relevant details, and saves it in a structured format on AWS S3. Additionally, AWS Glue and Athena are used for data cataloging and querying, providing a seamless workflow for data analysis.
- Apache Spark: Install Spark with support for Hadoop libraries.
- Hadoop: Required to manage large-scale data processing tasks in Spark.
- PySpark: Python API for Apache Spark.
- AWS IAM: To manage permissions for various AWS services.
- AWS S3: To store the processed data in Parquet format.
- AWS Glue: To catalog data for querying.
- AWS Athena: To query the data from S3.
git clone <repository-url>
cd <repository-directory>python3 -m venv venv
source venv/bin/activateExecute the following command to submit the job to Spark:
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.1026 app.py- The details of the extracted data will be displayed in the terminal.
- Ensure that all dependencies are satisfied for smooth execution.
- Log in to AWS Console as ROOT USER.
- Navigate to IAM > Users and create a new user with Programmatic access.
To allow the IAM user to access specific services, assign the following inline policies with full access for:
- IAM
- S3
- Glue
- Athena
These policies allow the user to interact with the services required for this project.
- Log out of the ROOT account and sign in as the IAM User you created.
- Go to IAM > Users > Security Credentials and create Access Key and Secret Access Key.
- Copy these values into your AWS configuration file (
~/.aws/credentials) or specify them directly in your configuration code.
- Create a new S3 bucket named spark-unstructured-streaming.
- Once you run the code, you should see data saved in Parquet format under the
/datadirectory in this bucket.
- Go to AWS Glue and create a database.
- Set up a Crawler for the S3 bucket (i.e.,
spark-unstructured-streaming). - The crawler will catalog the data, enabling it to be queried from Athena.
- After the Glue crawler completes, go to AWS Athena.
- Choose the database where the crawler stored the data.
- Run queries to retrieve insights from the unstructured data that was transformed into a structured format.
This step defines the structure for handling unstructured data. The schema is designed to accommodate varying data types, including text, JSON, and other file formats.
Custom UDFs allow for specific data extraction tasks based on the type of unstructured data being processed.
Data from unstructured sources is parsed, and text content is extracted and then structured into a DataFrame format for ease of manipulation and analysis.
Both structured and unstructured data streams are joined to generate comprehensive insights from heterogeneous data sources.
The final DataFrame is saved in Parquet format, providing a compact, efficient file storage option that integrates seamlessly with AWS services.
The Glue crawler catalogs the processed data, and Athena allows for querying without needing a dedicated ETL process, making this pipeline efficient and scalable.
- Ensure all AWS services are configured as detailed above.
- Execute the Spark job using the
spark-submitcommand. - Data will automatically be written to the S3 bucket in Parquet format, and the AWS Glue crawler will catalog this data.
- Once cataloged, data can be queried in Athena using standard SQL syntax.