Skip to content

Roar reproduce#25

Merged
zkhotanlou merged 23 commits intocharmlab:mainfrom
HashirA123:roar-reproduce
Nov 5, 2025
Merged

Roar reproduce#25
zkhotanlou merged 23 commits intocharmlab:mainfrom
HashirA123:roar-reproduce

Conversation

@HashirA123
Copy link
Contributor

@HashirA123 HashirA123 commented Oct 3, 2025

I personally believe that this reproduction should be able to merge with the main.

As of now:

  • The results are good and in line, although not identical, with the paper. The results, in terms of key measurable metrics like validity and cost, are similar and can lead to similar logical conclusions to the paper. The main differences can be attributed the the structure of the models (ie the layers and neurons in our mlp vs theirs).
  • The results are consistent, as in each run of reproduction seems to be the same. This is likely the result of the big fixes by Zhara.
  • For a slightly simpler reproduction, I focus only on one of the datasets used in the paper, the German dataset, since the dataset is smaller, uses fewer categorical features, and has fewer features in general.

HashirA123 and others added 3 commits October 22, 2025 19:25
The linear model seems to be giving acceptable
Validity results. The accuracy and cost of recourse
are not the same, though these could be marked up to
the model differences (technically the linear is a LR, so getting
model accuracy the same as the paper's should be possible. Why it isn't
there yet definitly has something to do with the preprocessing of the data)

Roar is having trouble getting recourse for mlp (non-linear) This needs to
be investigated further, given that the application of ROAR is almost the
same as the paper. The only difference between Linear and non-linear model
is the use of LIME. That is something to look at next.
I've decided to add a utils.py file within the ROAR method
directory to hold functions that replicate those from the
ModelCatalog and DataCatalog classes. This is to ensure
that the reproduction script can accurately recreate the
data preprocessing and model training steps as they were
originally intended, rather than doing something higher-level
like in the quick_start script.
Reproduction on ROAR with the german dataset seems to
reporduce validity and cost results very similar to the original
paper. However the model accuracies are much worse than expected.
Techinically the LR should be functionally the same as the on in the
paper. I may try using the sklearn implementation rather than the pytorch
one to see if that helps.
I added a parameter "modified" to load a modified version of the
SBA/german dataset and updated the relevant functions and classes to handle this new parameter.

The results for ROAR seem inline with the paper
in terms of validity. I would try a larger sample of
factuals to be sure.

Sometimes the tests fail, i need to make sure that the
method runs the same every time, interms of factuals and
model training. Right now there isn't consistency in that.
The linear model seems to be giving acceptable
Validity results. The accuracy and cost of recourse
are not the same, though these could be marked up to
the model differences (technically the linear is a LR, so getting
model accuracy the same as the paper's should be possible. Why it isn't
there yet definitly has something to do with the preprocessing of the data)

Roar is having trouble getting recourse for mlp (non-linear) This needs to
be investigated further, given that the application of ROAR is almost the
same as the paper. The only difference between Linear and non-linear model
is the use of LIME. That is something to look at next.
I've decided to add a utils.py file within the ROAR method
directory to hold functions that replicate those from the
ModelCatalog and DataCatalog classes. This is to ensure
that the reproduction script can accurately recreate the
data preprocessing and model training steps as they were
originally intended, rather than doing something higher-level
like in the quick_start script.
Reproduction on ROAR with the german dataset seems to
reporduce validity and cost results very similar to the original
paper. However the model accuracies are much worse than expected.
Techinically the LR should be functionally the same as the on in the
paper. I may try using the sklearn implementation rather than the pytorch
one to see if that helps.
I added a parameter "modified" to load a modified version of the
SBA/german dataset and updated the relevant functions and classes to handle this new parameter.

The results for ROAR seem inline with the paper
in terms of validity. I would try a larger sample of
factuals to be sure.

Sometimes the tests fail, i need to make sure that the
method runs the same every time, interms of factuals and
model training. Right now there isn't consistency in that.
The ROAR method now expects the Linear model to be from sklearn for the linear case.
This is to be inline with the paper implementation.

I've also noticed that the results seem to be consistent now,
unlike before, so the test cases should always pass.
@HashirA123 HashirA123 marked this pull request as ready for review October 29, 2025 01:32
@HashirA123 HashirA123 requested a review from zkhotanlou October 29, 2025 01:32
Copy link
Collaborator

@zkhotanlou zkhotanlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and is ready to merge! Before that, could you please run your method on the possible datasets and models using run_experiments.py, and add the results to results.csv?

Copy link
Collaborator

@zkhotanlou zkhotanlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please run the pre-commit hooks to resolve the linting and styling issues; some of the checks didn’t pass.

@HashirA123 HashirA123 requested a review from zkhotanlou November 3, 2025 14:42
@zkhotanlou zkhotanlou merged commit aaf9232 into charmlab:main Nov 5, 2025
1 check passed
Jamie001129 pushed a commit to Jamie001129/recourse_benchmarks that referenced this pull request Nov 16, 2025
This PR implements the ROAR (Robust Algorithmic Recourse) method, along with unit tests to ensure the results are reproducible as reported in the original paper [1].
The reproducibility badge level is Level 2, as the results on the German dataset have been successfully reproduced using both MLP and logistic regression models.

[1] Upadhyay, S., Joshi, S., & Lakkaraju, H. (2021). Towards Robust and Reliable Algorithmic Recourse. Advances in Neural Information Processing Systems, 34, 16926–16937.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants