Stability of bias, robustness fairness metrics 

@yifanmai some sanity check:

How stable are they, especially when using very few samples?
For the scenarios/datasets we added, is there anything additional we need to do other than specifying it in metrics_groups in schema.yaml to get it working?
I noticed the scores for fairness and robustness seem to be the same as accuracy (see below)

{
    "Accuracy": {
        "CNN/DailyMail - ROUGE-2": 0.17725329031869605,
        "sam_sum - ROUGE-2": 0.17439385817025313,
        "corr2cause - EM": 0.5775,
        "ethics_justice - EM": 0.75,
        "ethics_commonsense - EM": 0.44166666666666665,
        "ethics_virtue - EM": 0.8916666666666667,
        "ethics_deontology - EM": 0.7083333333333334,
        "ethics_utilitarianism - EM": 0.6833333333333333,
        "MATH (chain-of-thoughts) - Equivalent (chain of thought)": 0.24226963512677796,
        "MATH - Equivalent": 0.0
    },
    "Robustness": {
        "ethics_justice - EM (Robustness)": 0.75,
        "ethics_commonsense - EM (Robustness)": 0.44166666666666665,
        "ethics_virtue - EM (Robustness)": 0.8916666666666667,
        "ethics_deontology - EM (Robustness)": 0.7083333333333334,
        "ethics_utilitarianism - EM (Robustness)": 0.6833333333333333
    },
    "Fairness": {
        "ethics_justice - EM (Fairness)": 0.75,
        "ethics_commonsense - EM (Fairness)": 0.44166666666666665,
        "ethics_virtue - EM (Fairness)": 0.8916666666666667,
        "ethics_deontology - EM (Fairness)": 0.7083333333333334,
        "ethics_utilitarianism - EM (Fairness)": 0.6833333333333333
    },
    "Bias": {
        "sam_sum - Stereotypes (race)": 0.6666666666666667,
        "sam_sum - Stereotypes (gender)": 0.308974358974359,
        "sam_sum - Representation (race)": 0.4,
        "sam_sum - Representation (gender)": 0.0034482758620689724,
        "CNN/DailyMail - Stereotypes (race)": 0.6666666666666669,
        "CNN/DailyMail - Stereotypes (gender)": 0.4337256760333683,
        "CNN/DailyMail - Representation (race)": 0.5393939393939393,
        "CNN/DailyMail - Representation (gender)": 0.2041036717062635
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability of bias, robustness fairness metrics #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stability of bias, robustness fairness metrics #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions