Replies: 4 comments 5 replies
-
Very interesting, I had not thought about reading up on how those datasets were made. I have no idea what it would cost to create such a dataset on MTurk, what the expected quality would be and how long it would take. But I don't know enough about the subject to trust my own research so I'm leaving it up to you if that's okay :) If the cost are not staggering and the expected quality is high I can see myself putting some money into it... |
Beta Was this translation helpful? Give feedback.
-
Basing my estimates on this article. Unfortunately reading chunks of text and writing quiz questions about them is far more laborious than labeling images. Assuming 5 questions about an article chunk in 10 minutes.... that's 30 questions an hour. The US min wage is $7.25/hour. The dataset should be at least 50k pairs. 50k * $7.25 / 30 is about $12k. I'd say that's a bit staggeringly high for an amateur effort such as this. I wonder how much SQuAD cost to build. |
Beta Was this translation helpful? Give feedback.
-
Just an update on my research on datasets, etc: Tl;dr: combine datasets, filter training questions aggressively for desirability, use more robust metrics, test out different seq2seq models. There is a lot to research! Here is a vomit of some tentative conclusions: Since it's impractical to construct a custom dataset, combining multiple QA datasets is the next best option. With more datasets we'll have both more data and more exposure to different domains beyond Wikipedia. There is precedent for combining QA datasets for QA systems. These two papers are useful references: this paper and this paper . Some small datasets can be used as out-of-distribution test sets to test generalization (as is done in one of the papers I mentioned). There are many QA datasets of variability quality, format, size, etc but here are some promising ones I found:
QuoRef, HotPotQA, and MultiRC all focus on questions which require reasoning over multiple sentences or paragraphs. It's possible their inclusion could cause the model to generate more complex questions. Any post-filtering one might want to do after generating cards could also be done as a pre-filtering step for the training set: like grammaticality, etc. With more data (see above point) we have more latitude to be picky in our filtering. Consider that QA datasets are constructed with training a QA system in mind. QA systems need to be robust to malformed questions; noisy data can even be an advantage. For QG, learning to generate malformed questions is the last thing you want to do. Here are some samples from different percentiles of grammaticality from the SQuAD training set. It seems that roughly 5-10% of SQuAD is ungrammatical.
Recently, Google put out a paper making sentences extracted from contexts stand alone. They call it "decontextualization". This task is obviously relevant to Autocards. But, it's unclear to me if the dataset they constructed can be of any use. Two ideas occur to me: 1) building a 'decontextualized' classifier for filtering (see above), 2) decontextualizing each sentence of a context before question generation. I haven't been successful in making 1 work, but maybe I just need an idea for an approach. I haven't attempted 2. Evaluation metrics like BLEU and ROUGE are suspect. Something based on semantic similarity might be worth exploring. I've experimented with getting a TPU training setup with Colab (Pro). I was able to finetune t5-base on 65k answer-context pairs for 10 epochs in ~2 hours. Very feasible times for iterating on experiments. It's not a priori obvious to me that T5 will be the best model. I think it would be worth trying out T5, BART, ProphetNet, and PEGASUS. They're all available on HuggingFace. ERNIE-GEN and UniLMv2 might be better but the former requires using Baidu's framework and the latter isn't out afaict. Summarization and question generation seem like strongly related tasks to me. I've found some papers that try to exploit that connection but none particularly stand out to me. Not sure what to do with this intuition. |
Beta Was this translation helpful? Give feedback.
-
The above post is as I wrote it up in my notes a couple weeks ago. The task of collecting and collating those datasets was daunting. So, I decided to wait to see if I had any further ideas pop up (read: procrastinated). Fortunately, Salesforce published a paper last week which more or less accomplishes the whole program suggested above! See MixQG: Neural Question Generation with Mixed Answer Types. They did most of what I wanted to do, except filtering the datasets for grammaticality etc. They demonstrate that the diversity of context domains and question/answer types makes the model more robust and performant (as I hoped). They also trained multiple versions of the model from t5-small all the way up to t5-3b, and released them on HF. They even found a clever way of incorporating summarization: summarize, then treat each sentence as the 'answer' along with the context. They claim that the model's training diversity allows it to generate factoid questions from small answer strings (as in SQuAD and therefore Autocards), or longer non-factoid questions with longer answers. See this diagram: It seems that using this summarization-as-answer-extraction idea along with factoid-answer-extraction could generate more varied questions. Having more q/a pairs to work with is aligned with the goals of CherryTree, at least. Obviously testing is in order. Here are some comparisons with Autocards, and testing out their summarization idea. I used patil's answer extraction model to extract answers which I used for both QG models. I used
Tentative conclusion: MixQG isn't clearly better. Usually the models give roughly similar questions. Sometimes one is better, sometimes the other. But, the summarization idea is interesting and leads to different types of question. I don't think patil's model would work well for this because of the limited training domain. Although, if I can get his Probably tomorrow I'll show some interesting comparison examples from the above here. |
Beta Was this translation helpful? Give feedback.
-
The cards generated by Autocards are tantalizing good, but still lacking in many ways. Here are some notes about how trying different datasets could improve the card quality
SQuAD
The project that Autocards is based on https://github.com/patil-suraj/question_generation trained the model on SQuAD. That dataset is intended to be used for a machine reading comprehension task. You can explore the dataset here https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/French_and_Indian_War.html
Consider these questions from the link above:
These questions don't make sense outside of the context of passage from which they're derived. Of course, you can always include the passage but having to read a chunk of text to answer a card isn't ideal. See https://www.supermemo.com/en/archives1990-2015/articles/20rules
Even putting aside the lack of context, it's not clear why these types of q/a pairs should be good for notecards at all. Here is the interface the SQuAD crowdworkers saw (from the original paper):
The paper says
The paper doesn't show those samples of "good and bad questions and answers", unfortunately. But, from looking at the dataset, the pairs don't seem to target the most "important" parts of the passage. It seems to me that the most you can say about the dataset is that the question/answer pairs are coherent and answerable from the passage (with extracted spans).
Perfect dataset?
The above criticisms make me wonder what the perfect dataset for flashcards would be. At the very least the questions should be short but self-contained: answerable by a human educated on the topic without further context. The answer should also be short, but need not be an exact extracted spans from the context.
The selection of the q/a pairs from the context seems trickier. There's an argument to be made that selecting pairs based on how worthy they are to be memorized would be pointless because of individual differences. Maybe it is better to just form every viable pair and let the user select what they're interested in.
On the other hand... it does seem likely that there is large overlap in what people consider important, or at least, what's not important. As an analogy, consider students who are trying to guess 'what will be on the test': there would probably be substantial overlap in their guesses. Or, to translate into crowdworking instructions, something like "write a 5 question quiz on the most important parts of this article" could work.
For selecting contexts ideally including sources other than Wikipedia would be included: news articles, educational blog posts, etc.
Candidates
Beyond building our own dataset with MTurk (how much would this cost, anyone know?), here are some datasets I've looked at
TriviaQA
These are trivia questions scraped from trivia websites.
Examples:
The questions in this dataset tend to be very self-contained; trivia has to make sense without context, after all. Unfortunately, for the same reason, this dataset doesn't come with clean contexts from which to attempt to derive the pairs. The authors did attempt to automate it by grabbing Wiki articles and search results from random web articles. But, there's no indication where to find the context containing the answer in the articles. Overall, it's not very usable.
However, the KILT dataset incorporated the questions from TriviaQA with their own crowd-sourced Wikipedia provenance (down to the paragraphs). The total number of questions with provenance is around 50k~, a bit low.
Many of these are promising, like
But, consider another example
In this case that the book was written by C.S. Lewis isn't included in this passage (but it is present on the whole Wiki page from which this paragraph is taken). This pattern of the passage only including information about one part of the question (but technically being enough to get the answer) is common.
As another example, sometimes the provenance is split for answers between different pages
In this case one of the provenances is the wiki page for Alfred Hitchcock, but the selected paragraphs only mentions that the movie Rebecca was based on Maurier's work. Elsewhere on the (very long) page it also mentions that The Birds was (I went and checked Wikipedia manually), but the provenance does not include that paragraph. There are other provenances from the wiki pages of Rebecca and The Birds, but they don't mention each other.
This is all to say that, simply, getting a contiguous chunk of text from Wikipedia which could plausibly prompt these trivia q/a pairs is non-trivial.
Anyway, besides being self-contained it's not clear that trivia-type questions are ideal for flashcards either. For instance, a large percentage of the questions are based on pop culture.
LearningQ
These are supposed to be educational questions scraped from TedEd and KhanAcademy. Very low quality dataset afaict, especially the KA questions. Not worth looking at it. Here are some TedEd ones:
The answers are large chunks from the contexts, which are transcribed from videos.
Something else?
I'm in the process of researching more datasets. I'll write my findings here. I just wanted to get something written about what I've discovered so far.
Beta Was this translation helpful? Give feedback.
All reactions