This page contains benchmark results for the baselines and other methods on the datasets contained in this repository.
Please feel free to submit the results of your model as a pull request.
All the results are for models using only the context feature to select the correct response. Models using extra contexts are not reported here (yet).
For a description of the baseline systems, see baselines/README.md.
These are results on the data from 2015 to 2018 inclusive, (TABLE_REGEX="^201[5678]_[01][0-9]$")$").
| 1-of-100 accuracy | |
|---|---|
| Baselines | |
| TF_IDF | 26.4% |
| BM25 | 27.5% |
| USE_SIM | 36.6% |
| USE_MAP | 40.8% |
| USE_LARGE_SIM | 41.4% |
| USE_LARGE_MAP | 47.7% |
| ELMO_SIM | 12.5% |
| ELMO_MAP | 20.6% |
| BERT_SMALL_SIM | 17.1% |
| BERT_SMALL_MAP | 24.5% |
| BERT_LARGE_SIM | 14.8% |
| BERT_LARGE_MAP | 24.0% |
| USE_QA_SIM | 46.3% |
| USE_QA_MAP | 46.6% |
| Other models | |
| N-gram dual-encoder [1] | 61.3% |
| ConveRT [2] | 68.3% |
| 1-of-100 accuracy | |
|---|---|
| Baselines | |
| TF_IDF | 10.9% |
| BM25 | 10.9% |
| USE_SIM | 13.6% |
| USE_MAP | 15.8% |
| USE_LARGE_SIM | 14.9% |
| USE_LARGE_MAP | 18.0% |
| ELMO_SIM | 9.5% |
| ELMO_MAP | 13.3% |
| BERT_SMALL_SIM | 13.8% |
| BERT_SMALL_MAP | 17.5% |
| BERT_LARGE_SIM | 12.2% |
| BERT_LARGE_MAP | 16.8% |
| USE_QA_SIM | 16.8% |
| USE_QA_MAP | 17.1% |
| Other models | |
| Fine-tuned N-gram dual-encoder [1] | 30.6% |
| ConveRT (not fine-tuned) [2] | 21.5% |
| ConveRT (not fine-tuned) MAP [2] | 23.1% |
| 1-of-100 accuracy | |
|---|---|
| Baselines | |
| TF_IDF | 51.8% |
| BM25 | 52.3% |
| USE_SIM | 47.6% |
| USE_MAP | 54.4% |
| USE_LARGE_SIM | 51.3% |
| USE_LARGE_MAP | 61.9% |
| ELMO_SIM | 16.0% |
| ELMO_MAP | 35.5% |
| BERT_SMALL_SIM | 27.8% |
| BERT_SMALL_MAP | 45.8% |
| BERT_LARGE_SIM | 25.9% |
| BERT_LARGE_MAP | 44.1% |
| USE_QA_SIM | 67.0% |
| USE_QA_MAP | 70.7% |
| Other models | |
| N-gram dual-encoder [1] | 71.3% |
| ConveRT (not fine-tuned) [2] | 67.0% |
| ConveRT (not fine-tuned) MAP [2] | 71.6% |
| ConveRT (fine-tuned) | 84.3% |
Note the result for [1] here differs from the original paper, as we found a bug in the evaluation. Updated versions of the papers are in progress.
[1] A Repository of Conversational Datasets. Henderson et al. Proceedings of the Workshop on NLP for Conversational AI, 2019.
[2] ConveRT: Efficient and Accurate Conversational Representations from Transformers. Henderson et al. arXiv pre-print 2019. These results can be reproduced with baselines/run_baseline.py --method CONVERT_[SIM|MAP].