Fairness Evaluation

In this document we will report the findings of the fairness evaluation of the find tech models.

This analysis was done on only the test data for the models, thus the data used to evaluate fairness was not used in the training of any of these models.

All this analysis was done in the notebooks/fairness.ipynb notebook.

This analysis has 2 goals:

To find out how fair the models are,
To pick the best model which performs well on the test data and the fairness evaluation.

Models evaluated and test scores

We started by considering 8 models, the 4 models which performed well enough on the test data, and 4 ensemble models of these. To begin with, here is a summary of how well these models did on the test data.

Model name	Sample size	accuracy	f1	precision_score	recall_score
Ensemble predictions - 1 models	107	0.841	0.860	0.788	0.945
Ensemble predictions - 2 models	107	0.879	0.885	0.862	0.909
Ensemble predictions - 3 models	107	0.869	0.873	0.873	0.873
Ensemble predictions - 4 models	107	0.841	0.832	0.913	0.764
count_SVM_201022 predictions	107	0.841	0.847	0.839	0.855
bert_SVM_scibert_201022 predictions	107	0.879	0.879	0.904	0.855
bert_SVM_bert_201022 predictions	107	0.869	0.881	0.825	0.945
tfidf_log_reg_201022 predictions	107	0.841	0.844	0.852	0.836

Since 'Ensemble predictions - 1 models' and 'Ensemble predictions - 4 models' don't satisfy the metric cutoffs of 0.8 (F1) and 0.82 (recall and precision) we disregarded these.

Furthermore 'tfidf_log_reg_201022 predictions', 'count_SVM_201022 predictions' both perform in the worst 50% of the models for both precision and recall, so we disregard these too.

Thus, we are evaluating fairness and selecting the 'best' model from 4 options:

'Ensemble predictions - 2 models'
'Ensemble predictions - 3 models'
'bert_SVM_scibert_201022 predictions'
'bert_SVM_bert_201022 predictions'

Fairness groups

To calculate fairness we needed to decide which groups of the data we want to compare the model performance on. From the options we had available from the grants data we chose to compare 2 things:

Is the recipient organisation in the Golden Triangle universities or not? (which we defined as University College London, Imperial College London, King's College London, University of Oxford, University of Cambridge).
Is the region Greater London, UK-based but not Greater London or international?

Grouping	Group	Number of data points in test data
Recipient organisation	Golden triangle	41
Recipient organisation	Not golden triangle	66
Region	Greater London	30
Region	International	10
Region	UK, not greater London	67

Fairness results

The F1 scores for the models split by recipient organisation is:

The F1 scores for the models split by the regions is:

Since the international category is so much lower than the other results, and that it contains quite a smaller sample size of 10, we chose to remove it from the analysis. Thus we can compare the region groups results in more detail without them:

Fairness range of results

The goal of a fair model would be one in which the metrics were similar for the different groupings of the data. Thus we find the difference between the maximum and minimum metrics for the groups. e.g. for the Recipient organisation grouping and F1 metric this would be

max('Golden triangle F1', 'Not golden triangle F1') - min('Golden triangle F1', 'Not golden triangle F1')

Model name	Grouping	f1 difference	precision_score difference	recall_score difference
Ensemble predictions - 2 models	Recipient organisation	0.011	0.071	0.058
	Region	0.020	0.081	0.135
Ensemble predictions - 3 models	Recipient organisation	0.001	0.036	0.034
	Region	0.022	0.048	0.002
bert_SVM_bert_201022 predictions	Recipient organisation	0.038	0.067	0.003
	Region	0.012	0.082	0.081
bert_SVM_scibert_201022 predictions	Recipient organisation	0.021	0.023	0.019
	Region	0.021	0.093	0.123

We can visualise the precision (lighter) and recall (darker) differences for each model and grouping (reds - recipient organisation, blues - region):

Selecting the best model

As mentioned earlier we want to pick a model with both the best test scores and the best fairness scores.

To find the best overall fairness score, we added the metric differences for each grouping, e.g. if the F1 difference for the recipient organisation grouping was 0.011 and for the region grouping it is 0.02, then we add these to get 0.031.

The lowest scores for all three of the metrics are found in the 'Ensemble predictions - 3 models' model.

Model name	f1 sum difference	precision_score sum difference	recall_score sum difference
Ensemble predictions - 3 models	0.024	0.083	0.035
Ensemble predictions - 2 models	0.031	0.152	0.194
bert_SVM_scibert_201022 predictions	0.042	0.116	0.142
bert_SVM_bert_201022 predictions	0.050	0.150	0.084

And a recap of the test results shows this model also gives relatively high across all metrics:

Model name	f1	precision_score	recall_score
Ensemble predictions - 2 models	0.885	0.862	0.909
Ensemble predictions - 3 models	0.873	0.873	0.873
bert_SVM_scibert_201022 predictions	0.879	0.904	0.855
bert_SVM_bert_201022 predictions	0.881	0.825	0.945

Thus we select this as our best model.

Fairness results of best model

And a recap of all the fairness results for this model:

Grouping	Group	Sample size accuracy	f1	precision_score	recall_score
Recipient organisation	Golden triangle	41	0.872	0.850	0.895
Recipient organisation	Not golden triangle	66	0.873	0.886	0.861
Region grouped	Greater London	30	0.867	0.867	0.867
Region grouped	International	10	0.750	0.600	1.000
Region grouped	UK, not greater London	67	0.889	0.914	0.865

Conclusion:

We pick our best model as the ensemble model which looks for agreement in 3 out of 4 models.
This model performs equally well on the F1, precision and recall metrics for the test data - all 0.87.
This model is less precise (0.85 vs 0.89) but has a higher recall (0.90 vs 0.86) for golden triangle universities, overall the F1 scores are the same (0.87).
This model performs better for UK outside of Greater London in comparison to Great London regions (0.89 vs 0.87 F1, 0.91 vs 0.87 precision, similar recall).
This model performs badly on International regions as compared to Great London or elsewhere in the UK. The precision is 0.6 and the recall is 1 for International regions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fairness Evaluation

Models evaluated and test scores

Fairness groups

Fairness results

Fairness range of results

Selecting the best model

Fairness results of best model

Conclusion:

FilesExpand file tree

Tech_grant_model_fairness.md

Latest commit

History

Tech_grant_model_fairness.md

File metadata and controls

Fairness Evaluation

Models evaluated and test scores

Fairness groups

Fairness results

Fairness range of results

Selecting the best model

Fairness results of best model

Conclusion: