This repository contains data analysis notebooks and modularized Python scripts for Exploratory Data Analysis (EDA). The notebooks in the notebooks/
folder provide step-by-step analysis, while the scripts/
folder contains reusable functions for EDA to keep the code modular and maintainable.
The repository is structured as follows:
├── .vscode/
│ └── settings.json
├── .github/
│ └── workflows/
│ └── unittests.yml
├── .gitignore
├── requirements.txt
├── README.md
├── src/
│ └── __init__.py
├── notebooks/
│ ├──EDA.ipynb
│ └── README.md
├── tests/
│ └── __init__.py
└── scripts/
├── __init__.py
└── EDA_functions.py
-
notebooks/: This folder contains the Jupyter notebooks used for data exploration and cleaning. The main file is
EDA.ipynb
which includes the initial implementation of data import, cleaning, and outlier detection using the IQR method. -
scripts/: This folder contains Python scripts that modularize the functions used in the notebooks. The
eda_functions.py
file contains reusable functions such as handling missing values, detecting outliers, and other EDA tasks. -
tests/: This folder can be used for unit tests that ensure the functionality of the code in the
scripts/
directory.
-
Install dependencies: Ensure you have Python 3.x installed and install the required packages using:
pip install -r requirements.txt
-
Running the Jupyter Notebook:
- Navigate to the
notebooks/
directory and openEDA.ipynb
in Jupyter Notebook. - Run the notebook cells sequentially for data cleaning and EDA.
- Navigate to the
-
Using the Modular Functions:
- The
scripts/eda_functions.py
file contains reusable functions that were initially part of the notebook. You can import these functions in your Python code or notebooks as follows:
from EDA_functions import *
- The
- EDA.ipynb:
- Loads data using
pandas
. - Performs data cleaning by handling missing values and removing duplicates.
- Detects outliers using the IQR method.
- Uses the modular functions defined in
eda_functions.py
for better code reuse and clarity.
- Loads data using
- eda_functions.py:
clean_data(df)
: Cleans the input DataFrame by handling missing values and duplicates.detect_outliers(df, column)
: Detects and removes outliers from a specified column using the IQR method.- Additional functions for EDA tasks as needed.
This project is open-source and available under the MIT License.