Name	Name	Last commit message	Last commit date
parent directory ..
AWS_Lambda	AWS_Lambda
1_Extract_Poster_Feature_Vectors.ipynb	1_Extract_Poster_Feature_Vectors.ipynb
2_Parsing_Data.ipynb	2_Parsing_Data.ipynb
3_Split_Poster_Feature_Vectors.ipynb	3_Split_Poster_Feature_Vectors.ipynb
4_Extract_Bert_Tokens.ipynb	4_Extract_Bert_Tokens.ipynb
5_Data_Exploration.ipynb	5_Data_Exploration.ipynb
README.md	README.md

Name

Last commit message

Last commit date

1_Extract_Poster_Feature_Vectors.ipynb

2_Parsing_Data.ipynb

3_Split_Poster_Feature_Vectors.ipynb

4_Extract_Bert_Tokens.ipynb

5_Data_Exploration.ipynb

README.md

Data

This directory contains all the code for extracting data from various sources.

NOTE: The individual ipython notebooks are linked to Google Colab and must be run on TPU's.

The dataset for the movie plots and posters was extracted using AWS Lambda functions provided in the AWS_Lambda directory. There are two lambdas, one for the poster image and the other for plot summary/genres, that need the IMDB movieid as inputs. The movieid typically has the format ttXXXXX. The movieid file is then dropped into an S3 bucket and subsequently a SQS queue is generated for each of the movieid's. The SQS queue feeds the movieid to both the lambdas (getIMDBMetadata and getIMDBPosters) and the results are stored in an output S3 bucket as a json. This json is then read and converted to a csv file and saved as imdb_dataset.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data

FilesExpand file tree

Data

Directory actions

More options

Directory actions

More options

Latest commit

History

Data

Folders and files

parent directory

README.md

Data