Skip to content

surajsureshkumar/masters-big-data

Repository files navigation

masters-big-data

This repository contains the code and resources for my master's-level Big Data assignment. The project explores concepts of distributed data processing, scalability, and data analytics. Code has been commented so that you can understand it better

Overview

The data.py script in Pre-Processing performs preprocessing tasks, such as cleaning and transforming data fields, to prepare the data for further analysis.

Dataset

All data files required for this assignment are available from the IMDB Non-Commercial Datasets https://developer.imdb.com/non-commercial-datasets/.

Prerequisites

Make sure to have Python: Version 3.8 or higher.

Libraries

Install the required Python libraries by running: pip install pandas numpy

Download the Dataset

Visit the IMDB Non-Commercial Datasets page. Download all the required data files. Place the downloaded file in the same directory as your python script.

Run the Script

Run the data.py script from the terminal using the following command: python3 <FILE_NAME>.py

Expected Output

The script in Pre-Processing performs the following tasks: Reads the title.ratings.tsv.gz file into a Pandas DataFrame. Cleans and transforms specific fields in the dataset (e.g., removing prefixes from IDs). Prepares the data for further analysis.

The find_functional_dep.py in the Finding Functional Dependencies directory, identifies functional dependencies between attributes within the dataset. It analyzes the dataset to find column combinations that can uniquely identify other columns, helping to understand the relationships within the data.

The script in Pre-Processing performs the following tasks:

  • title(): Merges the title.basics.tsv.gz and title.ratings.tsv.gz files to prepare a combined dataset with movie information and ratings, saving the result as title.csv.
  • genre(): Extracts and counts the unique genres from the title.basics.tsv.gz file, and saves the result as genres.csv.
  • member(): Processes the name.basics.tsv.gz file to clean up and store member (actor/actress/producer) details in member.csv.
  • characters(): Extracts and cleans character names from the title.principals.tsv.gz file, storing the unique characters in characters.tsv.
  • title_actor(): Processes the title.principals.tsv.gz file to extract the relationship between titles and actors/actresses, saving it as title_actor.tsv.
  • title_writers(): Extracts and processes writer information from the title.crew.tsv.gz file, saving the cleaned data in title_writers.csv.
  • title_director(): Extracts and processes director information from the title.crew.tsv.gz file, saving the cleaned data in title_directors.csv.
  • title_producer(): Extracts and processes producer information from the title.principals.tsv.gz file, saving the cleaned data in title_producer.csv.
  • title_actor_character(): Combines data from title.principals.tsv.gz and characters.tsv to create a detailed relationship between actors and characters, saving the result in title_actor_character.csv.

The Scripts in the Finding Functional Dependency performs the following:

find_dependencies.py: The script will print the functional dependencies between columns in the dataset. main.py: The script will join and merge the datasets, resulting in new_relation.csv. remove_character.py: The script will clean the data by removing duplicate entries, and the result will be saved as find_dependencies.tsv

The script file in the Views directory can be run on datagrip or any other IDEs that you prefer

Please do not copy the code files for your assignment, rather run it locally and understand how it works. Copying content for your assignment or homework is strictly prohibited, as it may result in plagiarism.

About

This repository contains the code and resources for my master's-level Big Data assignment. The project explores concepts of distributed data processing, scalability, and data analytics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors