Data Cleaning Process
To read the given data and perform data cleaning and save the cleaned data to a file.
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect ,incompleted , irrelevant , duplicated or improperly formatted. Data cleaning is not simply about erasing data ,but rather finding a way to maximize datasets accuracy without necessarily deleting the information.
STEP 1: Read the given Data
STEP 2: Get the information about the data
STEP 3: Remove the null values from the data
STEP 4: Save the Clean data to the file
STEP 5: Remove outliers using IQR
STEP 6: Use zscore of to remove outliers
from google.colab import drive drive.mount('/content/drive')
ls drive/MyDrive/DS2024/Data_set.csv
import pandas as pd
df=pd.read_csv('drive/MyDrive/DS2024/Data_set.csv')
df

df_null_sum=df.isnull().sum()
df_null_sum

df_dropna=df.isnull().dropna()
df_dropna

df_nafill_0=df.fillna(0)
df_nafill_0

df_mean1=df['num_episodes'].fillna(df['num_episodes'].mean())
df_mean1

df_mean2=df['rating'].fillna(df['rating'].mean())
df_mean2

df_mean3=df['current_overall_rank'].fillna(df['current_overall_rank'].mean())
df_mean3
df_mean4=df['lifetime_popularity_rank'].fillna(df['lifetime_popularity_rank'].mean())
df_mean4

df_mean5=df['watchers'].fillna(df['watchers'].mean())
df_mean5

df_dropna=df.dropna()
df_dropna

import pandas as pd import seaborn as sns
age=[1,3,28,27,25,92,30,39,40,50,26,24,29,94]
af=pd.DataFrame(age)
af

q1=af.quantile(0.25) q2=af.quantile(0.5) q3=af.quantile(0.75)
import numpy as np
Q1=np.percentile(af,25) Q2=np.percentile(af,50) Q3=np.percentile(af,75)
IQR=Q3-Q1
lower_bound=Q1-1.5IQR upper_bound=Q3+1.5IQR
outliers = [x for x in age if x < lower_bound or x > upper_bound]
print('Q1:',Q1)
print('Q3:',Q3)
print('IQR:',IQR)
print('Lower bound:',lower_bound)
print('Upper bound:',upper_bound)
print('Outliers:',outliers)

af=af[((af>=lower_bound)&(af<=upper_bound))]
af
af.dropna()

from scipy import stats #STATS METHOD IS USED TO IMPLEMENT Z SCORE METHOD import numpy as np import pandas as pd import seaborn as sns
data=[1,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93] df=pd.DataFrame(data)
threshold=3
outliers = df[abs(df) > 3]
print("Outliers:")
print(outliers)

df_cleaned = df[(z <= threshold)]
df_cleaned

The process of data cleaning code is executed successfully.













