Fundamentals Of Data Science With Python and SQL

REPORT ON THE DATA MANAGEMENT FUNDAMEMTALS TASKS ON BRISTOL AIR POLLUTION DB

More details on the dataset can be understood via the link - UK Government Air Quality Strategy

Some possible Visualisation that can be done on the data are:

Scatterplot : to find relationship between two variables.
Heatmap: to find multi-collinearity.
Distplot: to show distribution. To name a few.

All of these plots can be done using python's data visualisation libraries like matplotlib, seaborn and plotly.

Problem Solving approach to the tasks

Task 1A:

For this task, I imported two libraries pandas and datetime to help crop the 'bristol-air-quality-data.csv' data. Here are the steps I followed:

I imported the 'bristol-air-quality-data.csv' into my working environment.
I then went ahead to format my 'Date Time' column to allow for easy filtering.
I then set my condition for filtering the 'Date Time' column, which is deleting any record before 00:00 1 Jan 2010.
And then I wrote the result into a csv file named 'crop.csv'.

Task1B:

For this task, I imported four libraries pandas, glob, sys and datetime to help clean the 'crop.csv' file generated from task1b. Here are the steps I followed:

I first imported the 'crop.csv' with the help of pandas.
I then included a try except statement to help me handle error.
I created my first function to hold a dictionary of the station monitors and assigned the key, value pair to a lookupstation.
I created a temporary list to hold the mismatch and then replaced null values with ' '.
I then went ahead to create series of functions to satisfy the condition of the task which is to remove any dud record or mismatch.
- The first function checks if the Locations from the DataFrame matches with the station monitor values.
  - When the first condition is satisfied, the second statemnt compares the siteID with the Location from the lookup to see if it matches. If it doesn't match it prints a message and stores the mis-match row to the already created list.
- The second function takes the result from the first function, drops the rows which are mis-match that was stored in the temporary list from the data and prints the number of mismatch. If not a mismatch, it writes the rows to a new csv file named 'clean.csv' and prints the number of rows written.

TASK 2A:

In this task, I studied the data given, identified the primary keys and broke the table into three parts (stations, readings and schema) with schema acting as guide and description of the data. I then went ahead to create a relationship between them. This task was done on MySQL workbench.

TASK 2B:

In this task, I used the MySQL forward engineer to create an SQL schema with my ER model from task 2a.

TASK 3A:

I first import a few libraries namely MySQL connector, sys, csv, datetime and then connection to MySQL server from my python environment.
I then created a database and the three tables with all the entities and their data-types.
After that, two lists were created to hold the clean.csv and schema.csv.
I later went ahead to create 3 separate lists, and then enumerated through the two csv's and parsed the right rows into their respective lists.
Insert data from the 3 seperate lists into the respective table created in the database and close connection.

TASK 3B:

For this task, a list was created to hold the insert-100 readings. And then using itertools iterate through the first 100 rows and write the result to an sql file.

TASK 4A,B,C:

Simple SQL queries were run on the database created.

TASK 5:

For this task I selected MongoDB which is a very popular NoSQL database and has so many resources. I did my research and discovered some of the advantages of NoSQL over RDMS. I then went ahead to replicate one of the SQL queries from task 4 in this task.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
clean.py		clean.py
crop.py		crop.py
insert-100.py		insert-100.py
nosql.md		nosql.md
pollution.sql		pollution.sql
pollution_er.png		pollution_er.png
populate.py		populate.py
query-a.sql		query-a.sql
query-b.sql		query-b.sql
query-c.sql		query-c.sql
report.md		report.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fundamentals Of Data Science With Python and SQL

REPORT ON THE DATA MANAGEMENT FUNDAMEMTALS TASKS ON BRISTOL AIR POLLUTION DB

Problem Solving approach to the tasks

About

Uh oh!

Releases

Packages

Languages

totalknockout/Fundamentals-Of-Data-Science-With-Python-and-SQL

Folders and files

Latest commit

History

Repository files navigation

Fundamentals Of Data Science With Python and SQL

REPORT ON THE DATA MANAGEMENT FUNDAMEMTALS TASKS ON BRISTOL AIR POLLUTION DB

Problem Solving approach to the tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages