Skip to content

Question Answering Chatbot: Project tracking

Alexis Smirnov edited this page Mar 22, 2020 · 7 revisions

High-level requirements

  • Given a user question we want to display a relevant passage (the answer) extracted from the corpus.
  • The answer need to be sourced from a medically-accurate and reputable public source, such as a gouvernment website.
  • First source of corpus shall be quebec.ca, but the system shall not be limited to this source.
  • The system shall be deployed at scale in the form of a publicly accessible chatbot.

Overall approach

Several approaches are considered, but at the high level we intend to create a level of indirection, called “master question”. A master question is a text passage that is a) linked to the answer b) semantically similar to the user’s question. Once "master questions" are available, the task is to select the most appropriate "master question" given user input (called user question), and then display the answer to the master question.

Workstreams

The overall project is separated in workstreams. Each workstream can advance somewhat in parallel and different strategies can be explored within the workstream. Each workstream is roughly responsible for a specific step in the training and inference pipeline.

Do the absolute minimum to get the end-to-end question-answering system up the running on covid19.dialogue.co

Milestone 2

Prioritize the next steps based on our view of the potential gain, and how much we can parallelize.

How we work

The work is coordinated on Slack #covi1d19-chatbot channel. The work items are logged as GitHub issues. The work is tracked on the project page

Corpus construction (Joumana)

The goal is to have the cleanest, most comprehensive corpus possible.

Passage extraction (answer corpus) (Joumana)

The goal is to property segment the corpus into sections (passages) that are focused on a single question/topic. Corpus on which questions will be answered. This can be a subset of Covid Corpus. Only trusted channels will be here, e.g., Quebec, Canada, CDC, and other countries.

Several approaches can be explored:

  • Treat all paragraphs under a heading as one passage. issue
  • The heuristic approach of treating a single paragraph as a single answer.
  • Some paragraphs refer to the text in the paragraph above and are incomplete as standalone answers. We may increase consistency with coreference detection and merging paragraphs that have co-references.

Master question generation (Joumana)

The goal is to have the biggest set of master questions, accurately mapped to properly defined answer passage.

  • Heuristic-based (combine header, the first sentence in a paragraph, etc.) issue
  • Increase similarity of master questions (convert text to lowercase, re-write)
  • Generate questions using a generative model based on a text passage.

Dataset engineering (Joumana, Alexis)

The goal is to have the biggest, most current training and test datasets.

  • Topics dataset. List of topics or question clusters. issue
  • User questions dataset. Collect questions that users are asking about COVID. issue
  • Translate user questions using DeepL issue
  • Tweets that mention COVID. issue
  • Questions that people ask on Twitter. issue
  • Questions generated via linguistic rules and vocabularies

Labelling/annotations (Joumana, Alexis)

The goal is to set up an effective and time-efficient labelling setup for human labellers

  • Label user questions as complete and answerable and questions that cannot be answered in their current form (incomplete, lack of content such as personal medical history, geolocation, etc.) issue
  • Label user questions with high-level topics (cluster). issue
  • Covid QA annotations Label user questions with master questions. Annotate reasonably sized corpus in the same format as NaturalQuestions, i.e., question paired with a paragraph. This will be on top of Covid Answers Corpus. issue
  • Create a test set of QA annotations issue
  • Create a training set of QA annotations for fine-tuning (also contains unanswerable questions).
  • Verify the mapping between the master question and the answer
  • Create a labelling step in the chatbot to ask users if the answer is good or not. issue

Language Model (Siva Reddy)

The goal here is to build the language model the best suited to represent language (English and French) that is used when talking about COVID and related subjects.

  • Covid Language Model Corpus. Collect a large corpus of text talking about Covid. The most relevant corpus is probably directives from government agencies. Scientific literature is also fine (AI2 has already released this), Twitter. issue
  • Covid Bert. Since Bert is trained on BookCorpus and Wikipedia it may not work well for Covid language. Continue Bert training on large Covid corpus. issue
  • Translate NaturalQuestions corpus to French. https://github.com/dialoguemd/covid-19/issues/265
  • Domain-adapt distillBert to large Covid corpus. https://github.com/dialoguemd/covid-19/issues/266

Inference (Siva Reddy)

Given the user input, retrieve the answer

Application integration (LP, Alexis)

Deployment (LP, Maxime Belanger)

  • Deploy ElasticSearch service
  • Setup indexing pipeline (GitHub, CircleCI)