This is the first task of my internship. The objective is to clean and preprocess the provided dataset to make it suitable for further analysis and machine learning tasks.
Data cleaning is important because raw data often contains missing values, duplicates, incorrect data types, and outliers. By cleaning the data, we ensure better quality and reliability of results.
Task1_DataCleaning.ipynb→ Jupyter Notebook containing all data cleaning steps.dataset.csv→ The raw dataset provided for this task.README.md→ Project documentation (this file).
- Importing Libraries – Pandas, NumPy, Matplotlib, Seaborn.
- Loading Data – Read the dataset with
pandas.read_csv(). - Exploring Data – Used
.head(),.info(),.describe()to understand structure. - Handling Missing Values – Checked with
.isnull().sum(), applied mean/median/mode imputation or dropped rows. - Removing Duplicates – Identified using
.duplicated(), then removed duplicates. - Data Type Conversion – Converted incorrect data types to correct ones (e.g., object → datetime/int).
- Outlier Treatment – Used boxplots and IQR method to detect and handle outliers.
- Renaming & Standardizing Columns – Fixed inconsistent column names.
- Final Output – Saved the cleaned dataset for further use.
- The dataset was cleaned successfully.
- Missing values and duplicates were handled.
- Outliers were treated.
- Final dataset is now ready for upcoming tasks.
- Clone the repository:
git clone <your-repository-link> cd DataCleaningTask
Open Jupyter Notebook:
jupyter notebook
Run the notebook Task1_DataCleaning.ipynb step by step.
✅ Conclusion
This task provided practical experience in data preprocessing. The cleaned dataset will be used for analysis and modeling in the upcoming internship tasks.