- Create a new Python virtual environment:
python -m venv env source env/bin/activate # For Linux/Mac env\Scripts\activate # For Windows
- Install the required dependencies:
pip install -r requirements.txt
- Open the
data_collection.pyfile. - Fill in your YouTube API key in the appropriate placeholder.
- Run the script:
python data_collection.py
- The script will save metadata for popular movies in the
trailer_data/datadirectory.
- Open the
get_comments.pyfile. - Fill in your YouTube API key in the appropriate placeholder.
- Run the script:
python get_comments.py
- The script will fetch comments for the trailers listed in the
trailer_data/datadirectory and save them to comments dir.
- Run the conversion script:
python convert_to_parquet.py
- This script will convert all JSON files in
trailer_data/datato Parquet format. This step ensures efficient storage and compatibility with Spark jobs and save them to comment_parquet dir.
- Upload the
trailer_datadirectory to your Google Cloud Storage (GCS) bucket.
- Open each file in the
srcfolder. - Update any hardcoded paths (e.g., bucket names, file paths) to match your GCS bucket configuration.
- Upload all the updated
srcfiles to your GCS bucket.
- Create a Dataproc cluster for running Spark jobs.
- Follow the Spark NLP GCP Dataproc Guide to set up your cluster with Spark and Spark NLP preinstalled.
- Submit the analysis scripts as Spark jobs using the Google Cloud SDK Shell:
gcloud dataproc jobs submit pyspark <script_name.py> \ --cluster=<cluster_name> \ --region=<region>
- Example:
gcloud dataproc jobs submit pyspark sentiment_analysis.py \ --cluster=my-cluster \ --region=us-central1
- The analysis results (e.g., sentiment distributions, entity recognition, topic models) will be printed in terminal and saved in the
trailer_data/resultsdirectory in your GCS bucket. - You can download or further process the results as needed.
- Run the following command to start a local server:
python -m http.server
- Open your browser and go to:
http://localhost:8000/website/main.html
Note: The website is currently in its alpha stage. Future improvements will include a redesigned frontend and integration with APIs to create a dynamic, fully functional website.
- Local Execution: Steps 2(a) to 2(c) are meant to be executed on your local machine.
- Google Cloud Setup: Ensure you have the necessary permissions and API keys configured for your GCS bucket and Dataproc cluster.
- Hardcoded Paths: It is essential to update the hardcoded paths in all scripts before uploading them to your GCS bucket.
Follow these steps carefully to successfully execute the project. If you encounter any issues, ensure all dependencies are installed, and double-check your Google Cloud configurations.