Hi all, me again on this point. I thought about it a bit and I'll try to explain here why I would report precision, recall and F1 for the label 1 instead of the macro average.
Remember that we are in a binary classification scenario with very unbalanced labels and we want to know which method is the best one at correctly predicting the 1 label (so the best one at finding correct occurrences of a specific sense). Now, consider this setting, where you have gold labels and three approaches: a majority class baseline (which always predicts 0), a random baseline and "our approach" which sometimes predicts it correctly. We want to know if we are better than the baselines in capturing the 1 cases.
gold = [0,0,0,0,1]
random = [1,0,1,0,1]
majority = [0,0,0,0,0]
our = [1,0,0,0,1]
If we compute precision and recall for each class, for each method, this is what we get:
random
label_1 = [0.33, 1.0, 0.5]
label_0 = [0.5, 1.0, 0.667]
macro = [0.667, 0.75, 0.70]
Note! I computed the macro F1 score manually from the values of precision and recall. if you use scikit learn out of the box you will get: [0.667, 0.75, 0.583] where 0.583 is not the harmonic mean of p (0.667) and r (0.75), but the average of the F1 of label_1 (0.5) and label_0 (0.667)), because of this issue. Reporting [0.667, 0.75, 0.583], I believe, would make the reader (and especially the reviewer) very confused, so I added a patch in #144 at least to compute the macro F1 correctly.
You can try yourself - you'll get the same behaviour for all the other methods.
from sklearn.metrics import precision_recall_fscore_support
gold = [0,0,0,0,1]
random = [1,0,1,0,1]
majority = [0,0,0,0,0]
our = [1,0,0,0,1]
method = random
print ("label_1", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='binary',pos_label=1)[:3]])
print ("label_0", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='binary',pos_label=0)[:3]])
print ("macro", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='macro')[:3]])
majority
label_1 = [0.0, 0.0, 0.0]
label_0 = [0.8, 1.0, 0.889]
macro = [0.4, 0.5, 0.44]
our
label_1 [0.5, 1.0, 0.667]
label_0 [1.0, 0.75, 0.857]
macro [0.75, 0.875, 0.80]
Now, remember that our goal is to assess which method is better at finding 1s. If we consider macro it seems that random and our are not that distant, and overall this seems an easy task (if you just randomly predict you get it right around 70% of the time, for the different metrics):
random = [0.667, 0.75, 0.70]
majority = [0.4, 0.5, 0.44]
our [0.75, 0.875, 0.80]
however, if you look at label 1:
random = [0.33, 1.0, 0.5]
majority = [0.0, 0.0, 0.0]
our = [0.5, 1.0, 0.667]
The story is a bit different and it is closer to reality (especially for precision). The task is actually hard and if you guess randomly you will return lots of false positives. Majority is a useless approach for label 1 because the majority class is label 0, so we will never return anything for label 1.
If we have to suggest to a historian what is the best method for finding occurrences of a specific sense of machine (so the goal of our ACL paper), based on this numbers we will tell them that with our approach 50% of the retrieved results will be correct (precision) with a perfect recall, while if they go with random only 33% of the retrieved results will be correct. The performance on macro are not informative to the final user, because the final user does not care about how we perform on label 0.
To conclude, I would report the results for label 1, because I think it is the most meaningful metric for the task (even if numbers will all be a bit lower - but they will more precisely represent the experimental setting and the goal of the paper). I can write this part of the paper justifying it.
@kasparvonbeelen @BarbaraMcG @mcollardanuy @kasra-hosseini @GiorgiatolfoBL let me know what you think and especially if you spot any error as I might just miss something. However, if you prefer to go with macro, no problem, but we should change a bit the argumentation in the paper maybe, so that the metric is more in line with the problem.
Hi all, me again on this point. I thought about it a bit and I'll try to explain here why I would report precision, recall and F1 for the label 1 instead of the macro average.
Remember that we are in a binary classification scenario with very unbalanced labels and we want to know which method is the best one at correctly predicting the 1 label (so the best one at finding correct occurrences of a specific sense). Now, consider this setting, where you have gold labels and three approaches: a majority class baseline (which always predicts 0), a random baseline and "our approach" which sometimes predicts it correctly. We want to know if we are better than the baselines in capturing the 1 cases.
If we compute precision and recall for each class, for each method, this is what we get:
Note! I computed the macro F1 score manually from the values of precision and recall. if you use scikit learn out of the box you will get:
[0.667, 0.75, 0.583]where 0.583 is not the harmonic mean of p (0.667) and r (0.75), but the average of the F1 of label_1 (0.5) and label_0 (0.667)), because of this issue. Reporting[0.667, 0.75, 0.583], I believe, would make the reader (and especially the reviewer) very confused, so I added a patch in #144 at least to compute themacro F1correctly.You can try yourself - you'll get the same behaviour for all the other methods.
Now, remember that our goal is to assess which method is better at finding 1s. If we consider
macroit seems thatrandomandourare not that distant, and overall this seems an easy task (if you just randomly predict you get it right around 70% of the time, for the different metrics):however, if you look at label 1:
The story is a bit different and it is closer to reality (especially for precision). The task is actually hard and if you guess randomly you will return lots of false positives. Majority is a useless approach for label 1 because the majority class is label 0, so we will never return anything for label 1.
If we have to suggest to a historian what is the best method for finding occurrences of a specific sense of machine (so the goal of our ACL paper), based on this numbers we will tell them that with our approach 50% of the retrieved results will be correct (precision) with a perfect recall, while if they go with random only 33% of the retrieved results will be correct. The performance on macro are not informative to the final user, because the final user does not care about how we perform on label
0.To conclude, I would report the results for
label 1, because I think it is the most meaningful metric for the task (even if numbers will all be a bit lower - but they will more precisely represent the experimental setting and the goal of the paper). I can write this part of the paper justifying it.@kasparvonbeelen @BarbaraMcG @mcollardanuy @kasra-hosseini @GiorgiatolfoBL let me know what you think and especially if you spot any error as I might just miss something. However, if you prefer to go with
macro, no problem, but we should change a bit the argumentation in the paper maybe, so that the metric is more in line with the problem.