-
Notifications
You must be signed in to change notification settings - Fork 12
Question Answering Chatbot: Project tracking
- Given a user question we want to display a relevant passage (the answer) extracted from the corpus.
- The answer need to be sourced from a medically-accurate and reputable public source, such as a gouvernment website.
- First source of corpus shall be quebec.ca, but the system shall not be limited to this source.
- The system shall be deployed at scale in the form of a publicly accessible chatbot.
Several approaches are considered, but at the high level we intend to create a level of indirection, called “master question”. A master question is a text passage that is a) linked to the answer b) semantically similar to the user’s question. Once "master questions" are available, the task is to select the most appropriate "master question" given user input (called user question), and then display the answer to the master question.
The overall project is separated in workstreams. Each workstream can advance somewhat in parallel and different strategies can be explored within the workstream. Each workstream is roughly responsible for a specific step in the training and inference pipeline.
Do the absolute minimum to get the end-to-end question-answering system up the running on covid19.dialogue.co
Prioritize the next steps based on our view of the potential gain, and how much we can parallelize.
The work is coordinated on Slack #covi1d19-chatbot channel. The work items are logged as GitHub issues. The work is tracked on the project page
The goal is to have the cleanest, most comprehensive corpus possible.
- Scrape FAQ pages Given a publicly accessible website (quebec.ca) to have a complete and continuous source of content in a machine-readable format.
- Crawl referred pages. Given a FAQ page, determine which links to crawl and how to parse the content.
- Merge/resolve duplicated content. Detect duplicated content and resolve it.
The goal is to property segment the corpus into sections (passages) that are focused on a single question/topic. Corpus on which questions will be answered. This can be a subset of Covid Corpus. Only trusted channels will be here, e.g., Quebec, Canada, CDC, and other countries.
Several approaches can be explored:
- Treat all paragraphs under a heading as one passage. issue
- The heuristic approach of treating a single paragraph as a single answer.
- Some paragraphs refer to the text in the paragraph above and are incomplete as standalone answers. We may increase consistency with coreference detection and merging paragraphs that have co-references.
The goal is to have the biggest set of master questions, accurately mapped to properly defined answer passage.
- Heuristic-based (combine header, the first sentence in a paragraph, etc.) issue
- Increase similarity of master questions (convert text to lowercase, re-write)
- Generate questions using a generative model based on a text passage.
The goal is to have the biggest, most current training and test datasets.
- Topics dataset. List of topics or question clusters. issue
- User questions dataset. Collect questions that users are asking about COVID. issue
- Translate user questions using DeepL issue
- Tweets that mention COVID. issue
- Questions that people ask on Twitter. issue
- Questions generated via linguistic rules and vocabularies
The goal is to set up an effective and time-efficient labelling setup for human labellers
- Label user questions as complete and answerable and questions that cannot be answered in their current form (incomplete, lack of content such as personal medical history, geolocation, etc.) issue
- Label user questions with high-level topics (cluster). issue
- Covid QA annotations Label user questions with master questions. Annotate reasonably sized corpus in the same format as NaturalQuestions, i.e., question paired with a paragraph. This will be on top of Covid Answers Corpus. issue
- Create a test set of QA annotations issue
- Create a training set of QA annotations for fine-tuning (also contains unanswerable questions).
- Verify the mapping between the master question and the answer
- Create a labelling step in the chatbot to ask users if the answer is good or not. issue
The goal here is to build the language model the best suited to represent language (English and French) that is used when talking about COVID and related subjects.
- Covid Language Model Corpus. Collect a large corpus of text talking about Covid. The most relevant corpus is probably directives from government agencies. Scientific literature is also fine (AI2 has already released this), Twitter. issue
- Covid Bert. Since Bert is trained on BookCorpus and Wikipedia it may not work well for Covid language. Continue Bert training on large Covid corpus. issue
- Translate NaturalQuestions corpus to French. https://github.com/dialoguemd/covid-19/issues/265
- Domain-adapt distillBert to large Covid corpus. https://github.com/dialoguemd/covid-19/issues/266
Given the user input, retrieve the answer
- Experiment with Google search as a baseline for answer retrieval. https://github.com/dialoguemd/covid-19/issues/248
- Baseline: ElasticSearch Text Index. will serve as a baseline model. https://github.com/dialoguemd/covid-19/issues/267
- Paragraph retrieval. Given a question, retrieve a relevant paragraph
- Document retrieval. Given a question, retrieve a relevant document
- Use Covid QA annotations to get accuracy numbers (F1) https://github.com/dialoguemd/covid-19/issues/268
- Re-rank tf/idf using Bert. Given top-10 search results, use Bert to rank the best possible answer. The model trained on NaturalQuestions. https://github.com/dialoguemd/covid-19/issues/269
- Encoding similarity ranking. Index BERT-encodings of master questions, look up the closest based on the user question as to the key
- Bert Span based QA model. Above models give a whole paragraph as an answer. This model helps identifying spans which are more direct answers. We can bold these spans in the paragraph
- Improve ES document structure to have all the data the app needs to display the answer. https://github.com/dialoguemd/covid-19/issues/270
- Integrate ES API into Rasa policy https://github.com/dialoguemd/covid-19/issues/247
- Call Rasa the app to get the answer
- Deploy ElasticSearch service
- Setup indexing pipeline (GitHub, CircleCI)