Better datasets? #21

deklanw · 2021-09-18T00:05:03Z

deklanw
Sep 18, 2021

The cards generated by Autocards are tantalizing good, but still lacking in many ways. Here are some notes about how trying different datasets could improve the card quality

SQuAD

The project that Autocards is based on https://github.com/patil-suraj/question_generation trained the model on SQuAD. That dataset is intended to be used for a machine reading comprehension task. You can explore the dataset here https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/French_and_Indian_War.html

Consider these questions from the link above:

How many people were in French North American Colonies?

Where was war fought?

What order did British make of French?

Where was France concentraing efforts?

What was the significance of British win?

Where were populations centered in colonies?

These questions don't make sense outside of the context of passage from which they're derived. Of course, you can always include the passage but having to read a chunk of text to answer a card isn't ideal. See https://www.supermemo.com/en/archives1990-2015/articles/20rules

Even putting aside the lack of context, it's not clear why these types of q/a pairs should be good for notecards at all. Here is the interface the SQuAD crowdworkers saw (from the original paper):

The paper says

To guide the workers, tasks contained a sample paragraph, and examples of good and bad questions and answers on that paragraph along with the reasons they were categorized as such. Additionally, crowdworkers were encouraged to ask questions in their own words, without copying word phrases from the paragraph.

The paper doesn't show those samples of "good and bad questions and answers", unfortunately. But, from looking at the dataset, the pairs don't seem to target the most "important" parts of the passage. It seems to me that the most you can say about the dataset is that the question/answer pairs are coherent and answerable from the passage (with extracted spans).

Perfect dataset?

The above criticisms make me wonder what the perfect dataset for flashcards would be. At the very least the questions should be short but self-contained: answerable by a human educated on the topic without further context. The answer should also be short, but need not be an exact extracted spans from the context.

The selection of the q/a pairs from the context seems trickier. There's an argument to be made that selecting pairs based on how worthy they are to be memorized would be pointless because of individual differences. Maybe it is better to just form every viable pair and let the user select what they're interested in.

On the other hand... it does seem likely that there is large overlap in what people consider important, or at least, what's not important. As an analogy, consider students who are trying to guess 'what will be on the test': there would probably be substantial overlap in their guesses. Or, to translate into crowdworking instructions, something like "write a 5 question quiz on the most important parts of this article" could work.

For selecting contexts ideally including sources other than Wikipedia would be included: news articles, educational blog posts, etc.

Candidates

Beyond building our own dataset with MTurk (how much would this cost, anyone know?), here are some datasets I've looked at

TriviaQA

These are trivia questions scraped from trivia websites.

Examples:

The most northerly part of mainland Australia is in which state?
Colony of Queensland

What is the surname of the US astronaut after whom the Thunderbirds character John Tracy is named?
Glenn

Which female country and Western singer died in a plane crash in Tennessee in 1963?
Patsy Cline

Which Rugby Union Premiership team play home fixtures at Kingston Park?
Newcastle Falcons

The religious order of Poor Ladies of San Damiano has what more common name?
Clarissine

The questions in this dataset tend to be very self-contained; trivia has to make sense without context, after all. Unfortunately, for the same reason, this dataset doesn't come with clean contexts from which to attempt to derive the pairs. The authors did attempt to automate it by grabbing Wiki articles and search results from random web articles. But, there's no indication where to find the context containing the answer in the articles. Overall, it's not very usable.

However, the KILT dataset incorporated the questions from TriviaQA with their own crowd-sourced Wikipedia provenance (down to the paragraphs). The total number of questions with provenance is around 50k~, a bit low.

Many of these are promising, like

Q: Which band released the 1970 album 'In the Wake of Poseidon'?
A: King Crimson
C: In the Wake of Poseidon is the second studio album by English progressive rock group King Crimson, released in May 1970 by Island Records in Europe, Atlantic Records in the United States, and Vertigo Records in New Zealand. The album was recorded during instability in the band, with several personnel changes, but repeats the style of their first album, "In the Court of the Crimson King". As with their first album, the mood of "In the Wake of Poseidon" often and quickly changes from serene to chaotic, reflecting the versatile musical aspects of progressive rock. To date the album is their highest-charting in the UK, reaching number 4. It has been well received by critics.

But, consider another example

Q: What was the name of the lion in C S Lewis's book 'The Lion, The Witch and The Wardrobe'?
A: Aslan
C: Most of the novel is set in Narnia, a land of talking animals and mythical creatures that is ruled by the evil White Witch. In the frame story, four English children are relocated to a large, old country house following a wartime evacuation. The youngest, Lucy, visits Narnia three times via the magic of a wardrobe in a spare room. Lucy's three siblings are with her on her third visit to Narnia. In Narnia, the siblings seem fit to fulfill an old prophecy and find themselves adventuring to save Narnia and their own lives. The lion Aslan gives his life to save one of the children; he later rises from the dead, vanquishes the White Witch, and crowns the children Kings and Queens of Narnia.

In this case that the book was written by C.S. Lewis isn't included in this passage (but it is present on the whole Wiki page from which this paragraph is taken). This pattern of the passage only including information about one part of the question (but technically being enough to get the answer) is common.

As another example, sometimes the provenance is split for answers between different pages

Q: The Alfred Hitchcock films Rebecca and The Birds were based on novels by which author?
A: Daphne du Maurier

In this case one of the provenances is the wiki page for Alfred Hitchcock, but the selected paragraphs only mentions that the movie Rebecca was based on Maurier's work. Elsewhere on the (very long) page it also mentions that The Birds was (I went and checked Wikipedia manually), but the provenance does not include that paragraph. There are other provenances from the wiki pages of Rebecca and The Birds, but they don't mention each other.

This is all to say that, simply, getting a contiguous chunk of text from Wikipedia which could plausibly prompt these trivia q/a pairs is non-trivial.

Anyway, besides being self-contained it's not clear that trivia-type questions are ideal for flashcards either. For instance, a large percentage of the questions are based on pop culture.

LearningQ

These are supposed to be educational questions scraped from TedEd and KhanAcademy. Very low quality dataset afaict, especially the KA questions. Not worth looking at it. Here are some TedEd ones:

Q: egypt 's great stone pyramids were the work of how many generations of egyptians ?
A: khafre ’ s had a slight twist near the top in order to make the edges line up evenly . what ’ s remarkable is egypt ’ s biggest stone pyramids were the product of just three human generations , but those were generations full of trial and error . pyramid building continued for nearly 700 years , and like any product , efficiency started to win out over quality .

Q: what is the name of the enzyme that uses zinc to catalyze the reaction between co2 and water ?
A: so this is diethylzinc , so it 's c4h10zn . - zinc is quite important biologically , and there are all sorts of enzymes in our bodies that use zinc . and particularly one called carbonic anhydrase that catalyse at the reaction of co2 and water and so without zinc , none of us could survive . the delivery of the statement that zinc is boring were greatly under anticipating the result of this experiment .

The answers are large chunks from the contexts, which are transcribed from videos.

Something else?

I'm in the process of researching more datasets. I'll write my findings here. I just wanted to get something written about what I've discovered so far.

thiswillbeyourgithub · 2021-09-18T12:46:01Z

thiswillbeyourgithub
Sep 18, 2021
Collaborator

Very interesting, I had not thought about reading up on how those datasets were made.

I have no idea what it would cost to create such a dataset on MTurk, what the expected quality would be and how long it would take. But I don't know enough about the subject to trust my own research so I'm leaving it up to you if that's okay :)

If the cost are not staggering and the expected quality is high I can see myself putting some money into it...

0 replies

deklanw · 2021-09-18T14:27:23Z

deklanw
Sep 18, 2021
Author

Basing my estimates on this article. Unfortunately reading chunks of text and writing quiz questions about them is far more laborious than labeling images. Assuming 5 questions about an article chunk in 10 minutes.... that's 30 questions an hour. The US min wage is $7.25/hour. The dataset should be at least 50k pairs. 50k * $7.25 / 30 is about $12k. I'd say that's a bit staggeringly high for an amateur effort such as this. I wonder how much SQuAD cost to build.

2 replies

thiswillbeyourgithub Sep 18, 2021
Collaborator

Indeed, way too high. And that's not even taking into account the time to train the model, several times if needed.

So if creating a new dataset is out of the way, is there some hope in fine tuning current models? Again, still newbish in NLP.

I'm curious as to how well GPT-3 would perform if it's given good question&answer/context pair. Is there a cheap way to try it? IIRC Polar Bookshelf (a PDF reader) supports turning highlights into question using GPT-3 API.

deklanw Sep 19, 2021
Author

So if creating a new dataset is out of the way, is there some hope in fine tuning current models? Again, still newbish in NLP.

Yes, I'm still looking at alternative datasets. I think there are some promising approaches. Will write more about this later

I'm curious as to how well GPT-3 would perform if it's given good question&answer/context pair. Is there a cheap way to try it? IIRC Polar Bookshelf (a PDF reader) supports turning highlights into question using GPT-3 API.

Didn't know Polar supported this. But, generally fine-tuned SOTA models outperform GPT-3 on various tasks.

deklanw · 2021-10-21T03:24:36Z

deklanw
Oct 21, 2021
Author

Just an update on my research on datasets, etc:

Tl;dr: combine datasets, filter training questions aggressively for desirability, use more robust metrics, test out different seq2seq models.

There is a lot to research! Here is a vomit of some tentative conclusions:

Since it's impractical to construct a custom dataset, combining multiple QA datasets is the next best option. With more datasets we'll have both more data and more exposure to different domains beyond Wikipedia. There is precedent for combining QA datasets for QA systems. These two papers are useful references: this paper and this paper . Some small datasets can be used as out-of-distribution test sets to test generalization (as is done in one of the papers I mentioned). There are many QA datasets of variability quality, format, size, etc but here are some promising ones I found:

NewsQA
NewsQuizQA
BioASQ
QuoRef
HotPotQA
MultiRC
NQuAD
ProcessBank aka BioProcess

QuoRef, HotPotQA, and MultiRC all focus on questions which require reasoning over multiple sentences or paragraphs. It's possible their inclusion could cause the model to generate more complex questions.

Any post-filtering one might want to do after generating cards could also be done as a pre-filtering step for the training set: like grammaticality, etc. With more data (see above point) we have more latitude to be picky in our filtering.

Consider that QA datasets are constructed with training a QA system in mind. QA systems need to be robust to malformed questions; noisy data can even be an advantage. For QG, learning to generate malformed questions is the last thing you want to do.

Here are some samples from different percentiles of grammaticality from the SQuAD training set. It seems that roughly 5-10% of SQuAD is ungrammatical.

--Percentile 0.001, Score approx 0.060668881982564926--

What did the abbot remain as a town built around the abbey?
What antenna can the solid rod be viewed as a dipole antenna?
When do Jeffries and Ryan that the modern concept of separation of church and state dates from?
In medieval Europe was was placed on a table to help count money?
With what group does the  agreement form an alliance?

--Percentile 0.01, Score approx 0.12935198843479156--
What festival was the largest failed petition to keep Kanye from performing?
Lay people tend to live by the five what?
As of 2012 50.3 years is considered the Malian peoples averages of what statistic?
what was the  name of the person who was once a large  shareholder in EIC and talk to the issues with with the new Regulating bill in 1793?
Along with the Mongolian People's Republic and the People's Republic of China, what country did the RSFSR border to its southeast?

--Percentile 0.05, Score approx 0.3348544239997864--
What is the time period called from which no writing can be found.
The belt of fortresses were in what area of France?
Why do some police acts limit when police can interfere without court orders?
When is it believed that the earliest know hair was said to exist?
Which parent of Beyonce's help co-write a book?

Recently, Google put out a paper making sentences extracted from contexts stand alone. They call it "decontextualization". This task is obviously relevant to Autocards. But, it's unclear to me if the dataset they constructed can be of any use. Two ideas occur to me: 1) building a 'decontextualized' classifier for filtering (see above), 2) decontextualizing each sentence of a context before question generation. I haven't been successful in making 1 work, but maybe I just need an idea for an approach. I haven't attempted 2.

Evaluation metrics like BLEU and ROUGE are suspect. Something based on semantic similarity might be worth exploring.

I've experimented with getting a TPU training setup with Colab (Pro). I was able to finetune t5-base on 65k answer-context pairs for 10 epochs in ~2 hours. Very feasible times for iterating on experiments.

It's not a priori obvious to me that T5 will be the best model. I think it would be worth trying out T5, BART, ProphetNet, and PEGASUS. They're all available on HuggingFace. ERNIE-GEN and UniLMv2 might be better but the former requires using Baidu's framework and the latter isn't out afaict.

Summarization and question generation seem like strongly related tasks to me. I've found some papers that try to exploit that connection but none particularly stand out to me. Not sure what to do with this intuition.

0 replies

deklanw · 2021-10-21T03:55:41Z

deklanw
Oct 21, 2021
Author

The above post is as I wrote it up in my notes a couple weeks ago. The task of collecting and collating those datasets was daunting. So, I decided to wait to see if I had any further ideas pop up (read: procrastinated).

Fortunately, Salesforce published a paper last week which more or less accomplishes the whole program suggested above! See MixQG: Neural Question Generation with Mixed Answer Types. They did most of what I wanted to do, except filtering the datasets for grammaticality etc. They demonstrate that the diversity of context domains and question/answer types makes the model more robust and performant (as I hoped). They also trained multiple versions of the model from t5-small all the way up to t5-3b, and released them on HF.

They even found a clever way of incorporating summarization: summarize, then treat each sentence as the 'answer' along with the context. They claim that the model's training diversity allows it to generate factoid questions from small answer strings (as in SQuAD and therefore Autocards), or longer non-factoid questions with longer answers. See this diagram:

It seems that using this summarization-as-answer-extraction idea along with factoid-answer-extraction could generate more varied questions. Having more q/a pairs to work with is aligned with the goals of CherryTree, at least.

Obviously testing is in order. Here are some comparisons with Autocards, and testing out their summarization idea. I used patil's answer extraction model to extract answers which I used for both QG models. I used google/pegasus-cnn_dailymail as the summarization model, not sure what they used in the paper (doesn't specify). Also, I couldn't test out the 3b model for lack of memory on a Colab GPU. Maybe on a TPUv3 it'll work.

Tentative conclusion: MixQG isn't clearly better. Usually the models give roughly similar questions. Sometimes one is better, sometimes the other. But, the summarization idea is interesting and leads to different types of question. I don't think patil's model would work well for this because of the limited training domain. Although, if I can get his prepend model working I'll try a comparison. Unfortunately, it seems he didn't train a prepend model with anything larger than t5-small. Hmph.

Probably tomorrow I'll show some interesting comparison examples from the above here.

3 replies

thiswillbeyourgithub Oct 21, 2021
Collaborator

Thanks a lot for the detailed report. I'm kinda flooded right now and don't really have anything meaningful to add but I am very enthusiastic. Thanks a lot :)

thiswillbeyourgithub Nov 5, 2021
Collaborator

I'm flooded right now but this popped up in my news recently, an exhausted quick glimpse tells me that it might be relevant @deklanw : CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

deklanw Nov 8, 2021
Author

I'm flooded right now but this popped up in my news recently

No worries, I've also been busy. Haven't taken the next step on this project yet

an exhausted quick glimpse tells me that it might be relevant @deklanw : CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Interesting paper, but unfortunately not directly usable for our purposes, I think. The 'open-domain' part means these are (q, a) pairs not (q, a, context) triples.

Thanks for keeping a lookout for new datasets, though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better datasets? #21

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Better datasets? #21

Uh oh!

deklanw Sep 18, 2021

SQuAD

Perfect dataset?

Candidates

TriviaQA

LearningQ

Something else?

Replies: 4 comments · 5 replies

Uh oh!

thiswillbeyourgithub Sep 18, 2021 Collaborator

Uh oh!

deklanw Sep 18, 2021 Author

Uh oh!

thiswillbeyourgithub Sep 18, 2021 Collaborator

Uh oh!

deklanw Sep 19, 2021 Author

Uh oh!

deklanw Oct 21, 2021 Author

Uh oh!

deklanw Oct 21, 2021 Author

Uh oh!

thiswillbeyourgithub Oct 21, 2021 Collaborator

Uh oh!

thiswillbeyourgithub Nov 5, 2021 Collaborator

Uh oh!

deklanw Nov 8, 2021 Author

deklanw
Sep 18, 2021

Replies: 4 comments 5 replies

thiswillbeyourgithub
Sep 18, 2021
Collaborator

deklanw
Sep 18, 2021
Author

thiswillbeyourgithub Sep 18, 2021
Collaborator

deklanw Sep 19, 2021
Author

deklanw
Oct 21, 2021
Author

deklanw
Oct 21, 2021
Author

thiswillbeyourgithub Oct 21, 2021
Collaborator

thiswillbeyourgithub Nov 5, 2021
Collaborator

deklanw Nov 8, 2021
Author