Skip to content

Files

Latest commit

 Cannot retrieve latest commit at this time.

History

History
155 lines (111 loc) Β· 4.43 KB

guide on Data Collection and Data Preprocessing.md

File metadata and controls

155 lines (111 loc) Β· 4.43 KB

Here’s a Guide on Data Collection and Data Preprocessing, covering data sources, cleaning, encoding, and scaling.

1. Data Collection

Data collection is the first step in any data-driven project. The sources of data can be classified into:

A. Primary Data Sources (Self-Collected Data)

  • Surveys: Collect responses from users through Google Forms, Typeform, or Microsoft Forms.
  • Interviews: Structured or unstructured interviews with individuals or experts.
  • Sensors & IoT Devices: Data collected from hardware sensors, such as temperature, humidity, or motion detectors.
  • Web Scraping: Extracting data from websites using libraries like BeautifulSoup or Scrapy.

B. Secondary Data Sources (Existing Datasets)

  • Kaggle: A vast repository of open-source datasets for machine learning and analytics.
  • data.gov.in: A government portal providing public datasets on demographics, economics, and more.
  • Internet Archive: A digital library containing historical data, text, images, and videos.
  • UCI Machine Learning Repository: Offers various structured datasets for machine learning research.

2. Data Preprocessing

Raw data is often messy and requires extensive preprocessing. The key steps are:

A. Data Cleaning

1. Removing Duplicates

Duplicates can occur due to multiple data entries or data merging. Use pandas to remove them:

import pandas as pd

# Load dataset
df = pd.read_csv("data.csv")

# Remove duplicate rows
df = df.drop_duplicates()

# Reset index after dropping
df.reset_index(drop=True, inplace=True)

2. Handling Missing Values

Missing values can negatively impact model performance. There are multiple ways to handle them:

a) Removing Rows with Missing Values

df = df.dropna()

b) Filling Missing Values (Imputation)

  • Mean Imputation (for numerical data):
df.fillna(df.mean(), inplace=True)
  • Mode Imputation (for categorical data):
df['Category_Column'].fillna(df['Category_Column'].mode()[0], inplace=True)
  • Forward Fill (Using previous values):
df.fillna(method='ffill', inplace=True)
  • Backward Fill (Using next values):
df.fillna(method='bfill', inplace=True)

3. Handling Outliers

Outliers can distort the results. We use IQR (Interquartile Range) method to remove them:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

B. Encoding Categorical Data

Machine learning models work with numerical values, so categorical data must be encoded.

1. Label Encoding (For Ordinal Data)

Used when categories have a meaningful order (e.g., Low < Medium < High):

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['Category'] = encoder.fit_transform(df['Category'])

2. One-Hot Encoding (For Nominal Data)

Used when categories have no specific order:

df = pd.get_dummies(df, columns=['Category'], drop_first=True)

C. Feature Scaling

Feature scaling ensures that all numerical features have the same scale, preventing dominant features from skewing the model.

1. Min-Max Scaling (Normalization)

Scales values between 0 and 1:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

2. Standardization (Z-score Scaling)

Transforms data to have zero mean and unit variance:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

3. Robust Scaling (Handling Outliers)

Uses median and IQR, making it robust to outliers:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

Final Process Overview

  1. Collect Data β†’ From surveys, Kaggle, data.gov.in, etc.
  2. Remove Duplicates β†’ df.drop_duplicates()
  3. Handle Missing Values β†’ df.fillna(method='ffill')
  4. Handle Outliers β†’ Using IQR method
  5. Encode Categorical Data β†’ LabelEncoder() or pd.get_dummies()
  6. Scale Data β†’ Using MinMaxScaler, StandardScaler, or RobustScaler