Skip to content

Parani-Bala123/exno1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exno:1

Data Cleaning Process

AIM

To read the given data and perform data cleaning and save the cleaned data to a file.

Explanation

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect ,incompleted , irrelevant , duplicated or improperly formatted. Data cleaning is not simply about erasing data ,but rather finding a way to maximize datasets accuracy without necessarily deleting the information.

Algorithm

STEP 1: Read the given Data

STEP 2: Get the information about the data

STEP 3: Remove the null values from the data

STEP 4: Save the Clean data to the file

STEP 5: Remove outliers using IQR

STEP 6: Use zscore of to remove outliers

Coding and Output

from google.colab import drive drive.mount('/content/drive')

ls drive/MyDrive/DS2024/Data_set.csv

Data Cleaning

import pandas as pd df=pd.read_csv('drive/MyDrive/DS2024/Data_set.csv') df Screenshot 2024-12-13 152317

CHECK OUT NULL VALUES IN DATA SET USING FUNCTION

df_null=df.isnull() df_null Screenshot 2024-12-13 152317

DISPLAY THE SUM ON NULL VALUES IN EACH ROWS

df_null_sum=df.isnull().sum() df_null_sum Screenshot 2024-12-13 152306

DROP NULL VALUES

df_dropna=df.isnull().dropna() df_dropna Screenshot 2024-12-13 152258

FILL NULL VALUES WITH CONSTANT VALUE "O"

df_nafill_0=df.fillna(0) df_nafill_0 Screenshot 2024-12-13 152252

FILL NULL VALUES WITH ffill METHOD

df_ffill=df.ffill() df_ffill Screenshot 2024-12-13 152241

FILL NULL VALUES WITH bfill METHOD

df_bfill=df.bfill() df_bfill Screenshot 2024-12-13 152230

CALCULATE MEAN VALUE OF A COLUMN AND FILL IT WITH NULL VALUES

df_mean1=df['num_episodes'].fillna(df['num_episodes'].mean()) df_mean1 Screenshot 2024-12-13 152203

df_mean2=df['rating'].fillna(df['rating'].mean()) df_mean2 Screenshot 2024-12-13 152158

df_mean3=df['current_overall_rank'].fillna(df['current_overall_rank'].mean()) df_mean3 Screenshot 2024-12-13 152153 df_mean4=df['lifetime_popularity_rank'].fillna(df['lifetime_popularity_rank'].mean()) df_mean4 Screenshot 2024-12-13 152148

df_mean5=df['watchers'].fillna(df['watchers'].mean()) df_mean5 Screenshot 2024-12-13 152140

DROP NULL VALUES

df_dropna=df.dropna() df_dropna Screenshot 2024-12-13 152130

Outlier Detection and Removal - IQR

import pandas as pd import seaborn as sns

age=[1,3,28,27,25,92,30,39,40,50,26,24,29,94] af=pd.DataFrame(age) af Screenshot 2024-12-13 152117

USE BOXPLOT FUNCTION HERE TO DETECT OUTLIER

sns.boxplot(af) Screenshot 2024-12-13 152111

sns.scatterplot(af) Screenshot 2024-12-13 152104

q1=af.quantile(0.25) q2=af.quantile(0.5) q3=af.quantile(0.75)

iqr=q3-q1 iqr Screenshot 2024-12-13 152055

import numpy as np

Q1=np.percentile(af,25) Q2=np.percentile(af,50) Q3=np.percentile(af,75)

IQR=Q3-Q1

lower_bound=Q1-1.5IQR upper_bound=Q3+1.5IQR

outliers = [x for x in age if x < lower_bound or x > upper_bound]

print('Q1:',Q1) print('Q3:',Q3) print('IQR:',IQR) print('Lower bound:',lower_bound) print('Upper bound:',upper_bound) print('Outliers:',outliers) Screenshot 2024-12-13 152049

af=af[((af>=lower_bound)&(af<=upper_bound))] af Screenshot 2024-12-13 152044 af.dropna() Screenshot 2024-12-13 152038

sns.boxplot(af) Screenshot 2024-12-13 152031

sns.scatterplot(af) Screenshot 2024-12-13 152023

Z Score

from scipy import stats #STATS METHOD IS USED TO IMPLEMENT Z SCORE METHOD import numpy as np import pandas as pd import seaborn as sns

data=[1,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93] df=pd.DataFrame(data)

USE BOXPLOT FUNCTION HERE TO DETECT OUTLIER

sns.boxplot(df) Screenshot 2024-12-13 152014

mean=np.mean(data) mean Screenshot 2024-12-13 152008

std=np.std(data) std Screenshot 2024-12-13 152003

PERFORM Z SCORE METHOD AND DETECT OUTLIER VALUES

z=np.abs(stats.zscore(df)) z Screenshot 2024-12-13 151958

threshold=3 outliers = df[abs(df) > 3] print("Outliers:") print(outliers) Screenshot 2024-12-13 151927

Remove outliers

df_cleaned = df[(z <= threshold)] df_cleaned Screenshot 2024-12-13 151916

USE BOXPLOT FUNCTION HERE TO CHECK OUTLIER IS REMOVED

sns.boxplot(df_cleaned) Screenshot 2024-12-13 151849

sns.scatterplot(df_cleaned) Screenshot 2024-12-13 151842

Result

     The process of data cleaning code is executed successfully.

About

data process

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 81.3%
  • Python 18.7%