Skip to content

FAQ Chatbot Notes

Alexis Smirnov edited this page Mar 22, 2020 · 1 revision

Goal: Build a chatbot that can answer questions for which the FAQ has an answer.

Members: MSSS (811 Quebec), Dialogue, Mila

Datasets

Q&A:

https://stanfordnlp.github.io/coqa/ https://microsoft.github.io/msmarco/ https://rajpurkar.github.io/SQuAD-explorer/

Medical Q&A:

https://github.com/durakkerem/Medical-Question-Answer-Datasets https://github.com/abachaa/MedQuAD https://github.com/abachaa/MEDIQA2019 https://github.com/abachaa/MeQSum https://github.com/abachaa/LiveQA_MedicalTask_TREC2017

COVID:

https://www.kaggle.com/tags/covid19 https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge https://pages.semanticscholar.org/coronavirus-research https://www.cdc.gov/coronavirus/2019-ncov/faq.html https://covid-19.dimensions.ai https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov https://www.health.gov.au/resources/publications/coronavirus-covid-19-frequently-asked-questions

Twitter

COVID-français:

https://www.who.int/fr/emergencies/diseases/novel-coronavirus-2019/advice-for-public/q-a-coronaviruses (multilingual) http://www.emro.who.int/fr/health-topics/corona-virus/questions-and-answers.html (multilingual) https://www.vd.ch/toutes-les-actualites/hotline-et-informations-sur-le-coronavirus/coronavirus-covid-19-reponses-du-medecin-cantonal-aux-questions-frequentes/ https://montreal.consulfrance.org/CORONAVIRUS-QUESTIONS-FREQUENTES https://gouv.nc/info-coronavirus-covid-19/questions-frequentes-sur-le-coronavirus-covid-19 https://ca.ambafrance.org/FAQ-Coronavirus-COVID-19-questions-frequentes

Useful links

Coronavirus FAQ Chatbot Template Now Available to Health Organizations World Health Organization’s WhatsApp bot texts you coronavirus facts

Notes

Chatbot system needs to handle French and English (and possibly a mix of both).

Text-based system:

Needs to be robust to spelling mistakes.

Speech-based system:

Needs to be robust to accents

Ideas for a Q&A system for Quebec.ca:

The simplest, fastest, and safest solution is to mimic the chatbot at https://www.canada.ca/en/public-health/services/diseases/coronavirus-disease-covid-19.html:

This solution assumes that the content of the website is organized in a tree-like structure. The chatbot guides the user into traversing this structure to get the information of interest. This solution doesn’t require handling free-form text. It is simple. But it’s not “natural”. This solution is very easy to deploy and will always function properly when the website is updated. One could envision allowing the user to enter questions at the end w/o immediately providing answers (in a similar way as Dialogue’s Chloe system). Those questions can help MSSS to determine how to augment the information available on the website.

Stats can be generated to indicate the sections that users are most interested in. Those sections could be further refined or updated more regularly.

A solution which is a bit more advanced but requires human intervention is to build a set of “master questions”, with each question linked to a section, a paragraph or a sentence in the Quebec.ca text (note that the master questions and their links to the text need to be manually built and might have to be updated each time Quebec.ca is). Two types of approaches can then be considered:

Classification task:

  • Requires a labeled data set containing (user question, corresponding master question) pairs.
  • Model would be trained to predict the right master question.
  • Model would need to be retrained when master questions are modified or augmented.
  • We can use a pretrained model (excluding the classification layer) on other large corpora.
  • We would need either 2 models, one for English and one for French, or we could have a single model for both languages; this model would be fed an extra language id “token”. Whatever the chosen solution, the master questions can be coded in one language.

Similarity task: -> chosen solution

  • Model generates an embedding for the master question and an embedding for the user question, then uses a dot product (or another similarity-based function) to find the closest master question.
  • Can start with a pretrained model (e.g. BERT), although we will probably need a small COVID-19 dataset to fine tune and test the model. This dataset would need to contain (user question, corresponding master question) pairs.

Handling of languages:

Solution #1:

  • Have 2 versions of each master question, one version in English and one in French.
  • Train 2 separate models, one for each language.

Solution #2:

  • Define the master questions in one language.
  • Train a single model for both languages. This model would be fed the user or master question + a token specifying the language id.

Solution #3:

  • Define a reference language (either french or English). Use a machine translation model to map the data from one language to another.
  • Train a single model on the reference language.
  • Usual extractive Q&A solutions:
  • Requires a labeled set of (user question, text span) pairs.
  • In order to reduce latency, this solution would require identifying a section of the Quebec.ca text in which the relevant text will be identified.

Clone this wiki locally