-
Notifications
You must be signed in to change notification settings - Fork 6
Roar reproduce #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Roar reproduce #25
Changes from 3 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
4bb4e1d
fix: gravitational reproduce bug
zkhotanlou 0566741
fix: claproar flaky test
zkhotanlou f58d042
fix: github workflow updates
zkhotanlou b0b85fb
adding reproduce for ROAR, some kinks seem to presists, need fixing
HashirA123 33c932d
Getting results with the Linear model
HashirA123 11bbbe2
Shifted approach to reproduction
HashirA123 2b80eed
Merge pull request #23 from zkhotanlou/zahra/fix-bug
zkhotanlou 45ae92f
reduced duplication in loading german data
HashirA123 5863322
Reproduction with german dataset on ROAR with LR
HashirA123 e72b7df
Seperate asserts for linear and mlp in reproduce
HashirA123 2223fa2
Changed a parameter to the loadData and loadModel
HashirA123 8c28db3
adding reproduce for ROAR, some kinks seem to presists, need fixing
HashirA123 2c806d2
Getting results with the Linear model
HashirA123 eff95e6
Shifted approach to reproduction
HashirA123 4ac5a78
reduced duplication in loading german data
HashirA123 488dfb9
Reproduction with german dataset on ROAR with LR
HashirA123 294ae01
Seperate asserts for linear and mlp in reproduce
HashirA123 20431a0
Changed a parameter to the loadData and loadModel
HashirA123 f5bb515
Modified Linear model to be skearn Linear
HashirA123 1857f39
resolved merge conflicts
HashirA123 451acdf
Fix formating and imports
HashirA123 7a459ef
added results to results.csv
HashirA123 7417a2e
Ran precommit hooks
HashirA123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
196 changes: 196 additions & 0 deletions
196
data/catalog/_data_main/process_data/process_sba_data.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| import os | ||
| from random import seed | ||
| import pandas as pd | ||
| from sklearn.preprocessing import StandardScaler | ||
| import numpy as np | ||
| from random import shuffle | ||
|
|
||
| import process_data.process_utils_data as ut | ||
|
|
||
| RANDOM_SEED = 54321 | ||
| seed( | ||
| RANDOM_SEED | ||
| ) # set the random seed so that the random permutations can be reproduced again | ||
|
|
||
| def get_feat_types(df): | ||
| cat_feat = [] | ||
| num_feat = [] | ||
| for key in list(df): | ||
| if df[key].dtype==object: | ||
| cat_feat.append(key) | ||
| elif len(set(df[key]))>2: | ||
| num_feat.append(key) | ||
| return cat_feat,num_feat | ||
|
|
||
| def load_sba_data(): | ||
| # Define attributes of interest | ||
| attrs = [ | ||
| 'Zip', 'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', | ||
| 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', | ||
| 'RevLineCr', 'ChgOffDate', 'DisbursementDate', 'DisbursementGross', | ||
| 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'New', 'RealEstate', 'Portion', | ||
| 'Recession', 'daysterm', 'xx' | ||
| ] | ||
| sensitive_attrs = [] # just an example, pick what matters for fairness | ||
| attrs_to_ignore = [] # IDs or very sparse high-cardinality | ||
|
|
||
| # Path to raw SBA file | ||
| this_files_directory = os.path.dirname(os.path.realpath(__file__)) | ||
| file_name = os.path.join(this_files_directory, "..", "raw_data", "SBAcase.11.13.17.csv") | ||
|
|
||
| # Load file | ||
| df = pd.read_csv(file_name) | ||
| df = df.fillna(-1) # replace NaNs with sentinel | ||
| df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True) | ||
|
|
||
| # print(df['RevLineCr'].value_counts()) | ||
|
|
||
| # Define target | ||
| y = 1 - df["Default"].values | ||
|
|
||
| # Dicts for storage | ||
| x_control = {} | ||
| attrs_to_vals = {} | ||
|
|
||
| for k in attrs: | ||
| if k in sensitive_attrs: | ||
| x_control[k] = df[k].tolist() | ||
| elif k in attrs_to_ignore: | ||
| pass | ||
| else: | ||
| attrs_to_vals[k] = df[k].tolist() | ||
|
|
||
| # Combine | ||
| all_attrs_to_vals = attrs_to_vals | ||
| for k in sensitive_attrs: | ||
| all_attrs_to_vals[k] = x_control[k] | ||
| all_attrs_to_vals["label"] = y | ||
|
|
||
| df_all = pd.DataFrame.from_dict(all_attrs_to_vals) | ||
|
|
||
| _, num_feat = get_feat_types(df_all) | ||
|
|
||
| # for key in num_feat: | ||
| # scaler = StandardScaler() | ||
| # df_all[key] = scaler.fit_transform(df_all[key].values.reshape(-1,1)) | ||
|
|
||
| # ---- Create processed dataframe with integer encodings ---- | ||
| processed_df = pd.DataFrame() | ||
|
|
||
| # Numeric attributes: keep directly | ||
| num_attrs = [ | ||
| 'Zip', 'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', | ||
| 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural' | ||
| ] | ||
| for a in num_attrs: | ||
| processed_df[a] = df_all[a] | ||
|
|
||
| # RevLineCr ("Y"/"N"/other) → 1,2,3 | ||
| processed_df.loc[df_all["RevLineCr"] == "Y", "RevLineCr"] = 1 | ||
| processed_df.loc[df_all["RevLineCr"] == "N", "RevLineCr"] = 2 | ||
| processed_df.loc[df_all["RevLineCr"] == "T", "RevLineCr"] = 3 | ||
| processed_df.loc[df_all["RevLineCr"] == "0", "RevLineCr"] = 4 | ||
| # processed_df.loc[df_all["RevLineCr"] == -1, "RevLineCr"] = 5 | ||
|
|
||
| # print(processed_df['RevLineCr'].value_counts()) | ||
| # cant think of what to do, can just drop the Nas actaully. | ||
|
|
||
| # processed_df['RevLineCr'] = pd.Categorical(processed_df['RevLineCr']) | ||
|
|
||
| # Add recession, real estate, portion, etc. directly | ||
| for a in ['ChgOffDate', 'DisbursementDate', 'DisbursementGross', | ||
| 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'New', 'RealEstate', 'Portion', | ||
| 'Recession', 'daysterm', 'xx']: | ||
| processed_df[a] = df_all[a] | ||
|
|
||
| processed_df["Label"] = df_all["label"] | ||
|
|
||
| processed_df = processed_df[processed_df["ApprovalFY"]<2006] | ||
HashirA123 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| processed_df = processed_df[processed_df['RevLineCr'].notna()] | ||
|
|
||
| return processed_df.astype("float64") | ||
|
|
||
| def load_sba_data_modified(): | ||
| # Define attributes of interest | ||
| attrs = [ | ||
| 'Zip', 'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', | ||
| 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', | ||
| 'RevLineCr', 'ChgOffDate', 'DisbursementDate', 'DisbursementGross', | ||
| 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'New', 'RealEstate', 'Portion', | ||
| 'Recession', 'daysterm', 'xx' | ||
| ] | ||
| sensitive_attrs = [] # just an example, pick what matters for fairness | ||
| attrs_to_ignore = [] # IDs or very sparse high-cardinality | ||
|
|
||
| # Path to raw SBA file | ||
| this_files_directory = os.path.dirname(os.path.realpath(__file__)) | ||
| file_name = os.path.join(this_files_directory, "..", "raw_data", "SBAcase.11.13.17.csv") | ||
|
|
||
| # Load file | ||
| df = pd.read_csv(file_name) | ||
| df = df.fillna(-1) # replace NaNs with sentinel | ||
| df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True) | ||
|
|
||
|
|
||
| # Define target | ||
| y = 1 - df["Default"].values | ||
|
|
||
| # Dicts for storage | ||
| x_control = {} | ||
| attrs_to_vals = {} | ||
|
|
||
| for k in attrs: | ||
| if k in sensitive_attrs: | ||
| x_control[k] = df[k].tolist() | ||
| elif k in attrs_to_ignore: | ||
| pass | ||
| else: | ||
| attrs_to_vals[k] = df[k].tolist() | ||
|
|
||
| # Combine | ||
| all_attrs_to_vals = attrs_to_vals | ||
| for k in sensitive_attrs: | ||
| all_attrs_to_vals[k] = x_control[k] | ||
| all_attrs_to_vals["label"] = y | ||
|
|
||
| df_all = pd.DataFrame.from_dict(all_attrs_to_vals) | ||
|
|
||
| _, num_feat = get_feat_types(df_all) | ||
|
|
||
| # for key in num_feat: | ||
| # scaler = StandardScaler() | ||
| # df_all[key] = scaler.fit_transform(df_all[key].values.reshape(-1,1)) | ||
|
|
||
| # ---- Create processed dataframe with integer encodings ---- | ||
| processed_df = pd.DataFrame() | ||
|
|
||
| # Numeric attributes: keep directly | ||
| num_attrs = [ | ||
| 'Zip', 'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', | ||
| 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural' | ||
| ] | ||
| for a in num_attrs: | ||
| processed_df[a] = df_all[a] | ||
|
|
||
| # RevLineCr ("Y"/"N"/other) → 1,2,3 | ||
| processed_df.loc[df_all["RevLineCr"] == "Y", "RevLineCr"] = 1 | ||
| processed_df.loc[df_all["RevLineCr"] == "N", "RevLineCr"] = 2 | ||
| processed_df.loc[df_all["RevLineCr"] == "T", "RevLineCr"] = 3 | ||
| processed_df.loc[df_all["RevLineCr"] == "0", "RevLineCr"] = 4 | ||
| # processed_df.loc[df_all["RevLineCr"] == -1, "RevLineCr"] = 5 | ||
| # cant think of what to do, can just drop the Nas actaully. | ||
|
|
||
| # processed_df['RevLineCr'] = pd.Categorical(processed_df['RevLineCr']) | ||
|
|
||
| # Add recession, real estate, portion, etc. directly | ||
| for a in ['ChgOffDate', 'DisbursementDate', 'DisbursementGross', | ||
| 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'New', 'RealEstate', 'Portion', | ||
| 'Recession', 'daysterm', 'xx']: | ||
| processed_df[a] = df_all[a] | ||
|
|
||
| processed_df["Label"] = df_all["label"] | ||
|
|
||
| processed_df = processed_df[processed_df['RevLineCr'].notna()] | ||
|
|
||
| return processed_df.astype("float64") | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.