- Remove rows where
genresisNULL - If
startYearisNULLthen use the last seen value - Keeping
tconstcolumn in all tables as an identifier is mandatory.
- Download gzip files for
titleandratingsfrom IMDb - Initialize
settingsand SQL engine - Clean
titledata- Remove blocked
titleTypes andgenresoutlined insettings - Remove
isAdultif set toTrueinsettings - Drop columns outlined in
settings - If
startYearisNULLthen use the last seen value - Remove rows where
genresisNULL
- Remove blocked
- Join
ratingsfile with cleanedtitlefile - Split
genrescolumn from comma-separated string to list and explode it into separate rows (if enabled insettings). This also converts the genres value tointand creates a ref-table with the corresponding string values. - Convert
titleTypes tointand create ref-table (if enabled insettings) - Parse table info outlined in
settingsand use the cleaned data to create the tables based on it
On init or update
valuesinsettingsis empty- Duplicates in
valuesdict on either key or values (key=imdb_data_col_name and value=sql_col_name) valuesdict has key (imdb_data_col_name) that is not present in the datasetdtypesdict has key not matching any value (sql_col_name) invalues
Only on update
- Target database has tables matching the
valuesdict .values() (sql_col_name) - The target tables and source tables have an exact match of COL_NAMES
- Target and source tables have an exact match of DTYPES and DTYPE.LENGTH(if the dtype has a length attr)