This repository contains the code and resources for my master's-level Big Data assignment. The project explores concepts of distributed data processing, scalability, and data analytics. Code has been commented so that you can understand it better
The data.py script in Pre-Processing performs preprocessing tasks, such as cleaning and transforming data fields, to prepare the data for further analysis.
All data files required for this assignment are available from the IMDB Non-Commercial Datasets https://developer.imdb.com/non-commercial-datasets/.
Make sure to have Python: Version 3.8 or higher.
Install the required Python libraries by running:
pip install pandas numpy
Visit the IMDB Non-Commercial Datasets page. Download all the required data files. Place the downloaded file in the same directory as your python script.
Run the data.py script from the terminal using the following command:
python3 <FILE_NAME>.py
The script in Pre-Processing performs the following tasks: Reads the title.ratings.tsv.gz file into a Pandas DataFrame. Cleans and transforms specific fields in the dataset (e.g., removing prefixes from IDs). Prepares the data for further analysis.
The find_functional_dep.py in the Finding Functional Dependencies directory, identifies functional dependencies between attributes within the dataset. It analyzes the dataset to find column combinations that can uniquely identify other columns, helping to understand the relationships within the data.
title(): Merges thetitle.basics.tsv.gzandtitle.ratings.tsv.gzfiles to prepare a combined dataset with movie information and ratings, saving the result astitle.csv.genre(): Extracts and counts the unique genres from thetitle.basics.tsv.gzfile, and saves the result asgenres.csv.member(): Processes thename.basics.tsv.gzfile to clean up and store member (actor/actress/producer) details inmember.csv.characters(): Extracts and cleans character names from thetitle.principals.tsv.gzfile, storing the unique characters incharacters.tsv.title_actor(): Processes thetitle.principals.tsv.gzfile to extract the relationship between titles and actors/actresses, saving it astitle_actor.tsv.title_writers(): Extracts and processes writer information from thetitle.crew.tsv.gzfile, saving the cleaned data intitle_writers.csv.title_director(): Extracts and processes director information from thetitle.crew.tsv.gzfile, saving the cleaned data intitle_directors.csv.title_producer(): Extracts and processes producer information from thetitle.principals.tsv.gzfile, saving the cleaned data intitle_producer.csv.title_actor_character(): Combines data fromtitle.principals.tsv.gzandcharacters.tsvto create a detailed relationship between actors and characters, saving the result intitle_actor_character.csv.
find_dependencies.py: The script will print the functional dependencies between columns in the dataset. main.py: The script will join and merge the datasets, resulting in new_relation.csv. remove_character.py: The script will clean the data by removing duplicate entries, and the result will be saved as find_dependencies.tsv
Please do not copy the code files for your assignment, rather run it locally and understand how it works. Copying content for your assignment or homework is strictly prohibited, as it may result in plagiarism.