Question Answering Chatbot: Project tracking

High-level requirements

Given a user question we want to display a relevant passage (the answer) extracted from the corpus.
The answer need to be sourced from a medically-accurate and reputable public source, such as a gouvernment website.
First source of corpus shall be quebec.ca, but the system shall not be limited to this source.
The system shall be deployed at scale in the form of a publicly accessible chatbot.

Overall approach

Several approaches are considered, but at the high level we intend to create a level of indirection, called “master question”. A master question is a text passage that is a) linked to the answer b) semantically similar to the user’s question. Once "master questions" are available, the task is to select the most appropriate "master question" given user input (called user question), and then display the answer to the master question.

Workstreams

The overall project is separated in workstreams. Each workstream can advance somewhat in parallel and different strategies can be explored within the workstream. Each workstream is roughly responsible for a specific step in the training and inference pipeline.

Milestone 1

Do the absolute minimum to get the end-to-end question-answering system up the running on covid19.dialogue.co

Milestone 2

Prioritize the next steps based on our view of the potential gain, and how much we can parallelize.

How we work

The work is coordinated on Slack #covi1d19-chatbot channel. The work items are logged as GitHub issues. The work is tracked on the project page

Corpus construction (Joumana)

The goal is to have the cleanest, most comprehensive corpus possible.

Scrape FAQ pages Given a publicly accessible website (quebec.ca) to have a complete and continuous source of content in a machine-readable format.
Crawl referred pages. Given a FAQ page, determine which links to crawl and how to parse the content.
Merge/resolve duplicated content. Detect duplicated content and resolve it.

Passage extraction (answer corpus) (Joumana)

The goal is to property segment the corpus into sections (passages) that are focused on a single question/topic. Corpus on which questions will be answered. This can be a subset of Covid Corpus. Only trusted channels will be here, e.g., Quebec, Canada, CDC, and other countries.

Several approaches can be explored:

Treat all paragraphs under a heading as one passage. issue
The heuristic approach of treating a single paragraph as a single answer.
Some paragraphs refer to the text in the paragraph above and are incomplete as standalone answers. We may increase consistency with coreference detection and merging paragraphs that have co-references.

Master question generation (Joumana)

The goal is to have the biggest set of master questions, accurately mapped to properly defined answer passage.

Heuristic-based (combine header, the first sentence in a paragraph, etc.) issue
Increase similarity of master questions (convert text to lowercase, re-write)
Generate questions using a generative model based on a text passage.

Dataset engineering (Joumana, Alexis)

The goal is to have the biggest, most current training and test datasets.

Topics dataset. List of topics or question clusters. issue
User questions dataset. Collect questions that users are asking about COVID. issue
Translate user questions using DeepL issue
Tweets that mention COVID. issue
Questions that people ask on Twitter. issue
Questions generated via linguistic rules and vocabularies

Labelling/annotations (Joumana, Alexis)

The goal is to set up an effective and time-efficient labelling setup for human labellers

Label user questions as complete and answerable and questions that cannot be answered in their current form (incomplete, lack of content such as personal medical history, geolocation, etc.) issue
Label user questions with high-level topics (cluster). issue
Covid QA annotations Label user questions with master questions. Annotate reasonably sized corpus in the same format as NaturalQuestions, i.e., question paired with a paragraph. This will be on top of Covid Answers Corpus. issue
Create a test set of QA annotations issue
Create a training set of QA annotations for fine-tuning (also contains unanswerable questions).
Verify the mapping between the master question and the answer
Create a labelling step in the chatbot to ask users if the answer is good or not. issue

Language Model (Siva Reddy)

The goal here is to build the language model the best suited to represent language (English and French) that is used when talking about COVID and related subjects.

Covid Language Model Corpus. Collect a large corpus of text talking about Covid. The most relevant corpus is probably directives from government agencies. Scientific literature is also fine (AI2 has already released this), Twitter. issue
Covid Bert. Since Bert is trained on BookCorpus and Wikipedia it may not work well for Covid language. Continue Bert training on large Covid corpus. issue
Translate NaturalQuestions corpus to French. https://github.com/dialoguemd/covid-19/issues/265
Domain-adapt distillBert to large Covid corpus. https://github.com/dialoguemd/covid-19/issues/266

Inference (Siva Reddy)

Given the user input, retrieve the answer

Experiment with Google search as a baseline for answer retrieval. https://github.com/dialoguemd/covid-19/issues/248
Baseline: ElasticSearch Text Index. will serve as a baseline model. https://github.com/dialoguemd/covid-19/issues/267
Paragraph retrieval. Given a question, retrieve a relevant paragraph
Document retrieval. Given a question, retrieve a relevant document
Use Covid QA annotations to get accuracy numbers (F1) https://github.com/dialoguemd/covid-19/issues/268
Re-rank tf/idf using Bert. Given top-10 search results, use Bert to rank the best possible answer. The model trained on NaturalQuestions. https://github.com/dialoguemd/covid-19/issues/269
Encoding similarity ranking. Index BERT-encodings of master questions, look up the closest based on the user question as to the key
Bert Span based QA model. Above models give a whole paragraph as an answer. This model helps identifying spans which are more direct answers. We can bold these spans in the paragraph

Application integration (LP, Alexis)

Improve ES document structure to have all the data the app needs to display the answer. https://github.com/dialoguemd/covid-19/issues/270
Integrate ES API into Rasa policy https://github.com/dialoguemd/covid-19/issues/247
Call Rasa the app to get the answer

Deployment (LP, Maxime Belanger)

Deploy ElasticSearch service
Setup indexing pipeline (GitHub, CircleCI)

dialogue.co | covid19.dialogue.co

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question Answering Chatbot: Project tracking

High-level requirements

Overall approach

Workstreams

Milestone 1

Milestone 2

How we work

Corpus construction (Joumana)

Passage extraction (answer corpus) (Joumana)

Master question generation (Joumana)

Dataset engineering (Joumana, Alexis)

Labelling/annotations (Joumana, Alexis)

Language Model (Siva Reddy)

Inference (Siva Reddy)

Application integration (LP, Alexis)

Deployment (LP, Maxime Belanger)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally