Repository part of a bigger project: Adversarial AI in Wealth Management, a Capstone Project for IE University.
Pandas: used for dataset manipulation. Alternatively could have usedpolars.- Used Pandas.pd.get_dummies for quick OneHotEncoding-like result. Ordinal Variable treated differently.
ydata_profiling.ProfileReport: Used to automatically generate EDA report.sklearn.preprocessing.OrdinalEncoder: Used to handle Ordinal Variables.sklearn.impute.KNNImputer: Used to fill in missing values. Not recommended for larger datasets. But for a dataset of ~15,000 records, it works fine.
- Attatched to repository as index.html. Check it out here.
| ydata_profiling alert | Alert Type |
|---|---|
| Income has 2250 (15.0%) missing values | Missing |
| Credit Score has 2250 (15.0%) missing values | Missing |
| Loan Amount has 2250 (15.0%) missing values | Missing |
| Assets Value has 2250 (15.0%) missing values | Missing |
| Number of Dependents has 2250 (15.0%) missing values | Missing |
| Previous Defaults has 2250 (15.0%) missing values | Missing |
| Debt-to-Income Ratio has unique values | Unique |
| Years at Current Job has 727 (4.8%) | zeros |
- Credit Risk Rating (Low-Medium-High)
- Level of Education (High School, Bachelor's, Master's, PhD)
- Payment History (Poor-Fair-Good-Excellent)
- Number of Dependents
- Previous Defaults
- Gender
- Marital Status
- Loan Purpose
- Unemployment Status
- Age
- Income
- Credit Score
- Loan Amount
- Years at Current Job
- Debt-to-Income Ratio
- Assets Value
- Number of Dependents
- Previous Defaults
report.to_file("report.html") # Save report as an HTML file
At present, I cannot handle location data. I attempted to concatenate City and Country Column, but was unsuccessful.
data = data.drop(['City', 'State', 'Country'], axis=1)
ordinal_mapping = {
'Education Level': ['High School', "Bachelor's", "Master's", "PhD"],
'Risk Rating': ["Low", "Medium", "High"],
'Payment History': ["Poor", "Fair", "Good", "Excellent"]
}
encoder = OrdinalEncoder(categories=[ordinal_mapping[col] for col in ordinal_mapping])
data[list(ordinal_mapping.keys())] = encoder.fit_transform(data[list(ordinal_mapping.keys())])
data = pd.get_dummies(data)
imputer = KNNImputer(n_neighbors=5)
data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
Final dataset exported in .parquet format.