🍒CherryTree: GUI for picking from a ton of premade cards #25

deklanw · 2021-09-23T02:57:08Z

deklanw
Sep 23, 2021

I think the advice of making your own cards is overrated. It seems to me that if you have already 'learned' the context of the card, the card is well-made, and you care to remember the card it's a good card.

I wish every textbook/learning course came with its own expertly-made Anki deck. Fantasy aside, what can we do? I wish I just had a search-engine of premade cards. Then, I could search for cards about topics I'm learning and quickly import them. Well, what if you just look at other people's cards and pick and choose them as you learn relevant material? AnkiWeb doesn't support searching or downloading individual cards from shared decks, and they have taken measures to discourage scraping. Ok, well, there's an add-on for importing from Quizlet. Maybe we can just pick and choose cards from Quizlet and import them to Anki. But, Quizlet also doesn't support searching or downloading individual cards. Hmph.

We find Autocards. Great, now we can make cards from any material. Did you know that the Rosetta Stone was discovered by Napoleon's army while he was in Egypt? That's pretty cool, I'd like to have a card about that. Ok... so I just... feed in this paragraph from my book. I don't have a good dev environment ready. Gotta wait 5 minutes for my premade Colab notebook to start up and install... got the results. Ok, there's a few cards from this paragraph. A couple of them make sense, but none really capture what I wanted to remember. Oh well, I guess I'll just make it myself.

What if we combine the idea of a card search-engine with Autocards? We could generate a ton of cards (at least hundreds of thousands) from pre-selected material and then search, filter, rank, select, etc from a GUI.

This scheme addresses some of the core weaknesses of Autocards: it's slow and inconvenient to generate cards, and the cards are of variable quality. The first one should be obvious: we're computing AOT. So, consider the second. Autocard question-generation generation quality is highly dependent on the phrasing of the text. An easy solution is ensuring redundancy in source text. How many history books explain how the Rosetta Stone was discovered by Napoleon's army in Egypt? Surely one of these books phrases it just so for Autocards. And, if we happen to find multiple high-quality differently-phrased cards about our subject from different sources, that's only a bonus (see point 17).

History seems like the subject particularly amenable to this approach: there are many comprehensive texts, knowledge is redundant across texts, facts alone (without broader reasoning) can get you pretty far, images/videos aren't critical, etc. The practical plan is something like this: select dozens of good history books, make cards for them all, measure linguistic acceptability and perplexity scores, dump into a database, find a search backend like ElasticSearch or some alternative, throw on a GUI. From the GUI you can search, filter, rerank, see original context, easily select multiple cards and then one-click import, see previously-imported cards etc. I'm thinking an Electron app would be nice so no one needs to worry about servers, and we can remember previously-imported cards without user accounts.

Copyright could be a problem, I'm not sure. I don't think it would be possible to reconstruct an entire text from this, even with full contexts. If anything this whole scheme would only encourage people to go buy the books.

Long-term reach ideas:

community voting on card usefulness/grammar/etc
open-sourcing scripts for people to pre-compute card databases from their own texts/websites/etc which can then be loaded

I would like to experiment with training QAG on better datasets before trying all of this. But, I think it's not too big of a project relative to the potential.

Thoughts?

thiswillbeyourgithub · 2021-09-23T15:38:02Z

thiswillbeyourgithub
Sep 23, 2021
Collaborator

(I'm in kind of a rush so I'll keep it short)

I didn't know about sonic. This kind of "finding a needle in a haystack" problem for sentences seems to be part of why sentence-bert was invented. I recommend you take a look at their website, on the left the section "usage" contains different examples. This could maybe me an interesting alternative as far as search engines go.
instead of creating tons of flashcards beforehand, wouldn't it be better to just make a "card creating website" where the user supplies a document, gets q&a as an output, and also stores all the cards it ever created? The same website could also be used to simply query for already stored q&a.

As long as autocards is not as efficient as possible then we have no idea what it would cost to host for other users.
So to me, the idea is really interesting but I think we should not really think about it before making autocards faster.

You talked about upgrading transformers from 3.0 to 4.1ish.
I'm sure making the code more independent than patil's code (which included several other features) could make it way more efficient.
testing other datasets or other finetuned models

What do you think? Those three points seem to be good objectives.

3 replies

deklanw Sep 23, 2021
Author

Addressing your two comments

Yeah, semantic search is a possibility too. I just think basic low-latency search would be sufficient for something like this. Just type "napoleon rosetta" and get results in milliseconds. Any good card should explicitly have the named entities in text, so I don't see semantic search as a high priority.

A version of that site you describe bootstrapped with content like in my scheme sounds great. But, consider how much more complexity it brings: user accounts, usage limits, content moderation (can't have people making Mein Kampf cards), backend with accelerators and all the devops that come with it. Not to mention the cost. The juice doesn't seem worth the squeeze to me. At the very least I think my idea is a better starting point, to test the idea

Anyway, the frustration of Autocards isn't just the speed and start-up complexity. The inconsistency of the generated cards is far more frustrating. Offloading the computation to us won't change the feeling of frustration when a user looks at the generated cards from the content they submitted. The issue is the same as in my post: you look for a card about that part you read, but they're low-quality. IMO, the mass scale via redundant books scheme is a satisfactory work-around (I hypothesize, we'll see).

As for improving code, yeah I've already upgraded parts of patil's code to newer transformers versions. I'll be posting about different datasets in that other thread soon

thiswillbeyourgithub Sep 24, 2021
Collaborator

Okay, you convinced me.

What roadmap do you suggest?

edit: I just noticed that @paulbricman has not approved my PR. Merging this could be a good start to make sure we are on the same page. That seems like the first step to add on top of your roadmap

deklanw Sep 24, 2021
Author

Cool.

PR Looks merged now :)

And, roadmap: experiment with different QG datasets/models first. I'll post about datasets later today, probably

paulbricman · 2021-09-24T13:29:02Z

paulbricman
Sep 24, 2021
Maintainer

@deklanw To better understand the use case you envision, I'm curious what's your take on the following: Searching for one card about "napoleon rosetta" and getting a premade one wouldn't be that much faster than creating your own. Wouldn't the point be to grab a batch of flashcards related to a topic, like napoleon? This through the semantic search brought up by @thiswillbeyourgithub, but for the specific purpose of getting the top ones, rather than the single best one.

2 replies

deklanw Sep 24, 2021
Author

Searching for one card about "napoleon rosetta" and getting a premade one wouldn't be that much faster than creating your own.

I disagree, even for one card reading a few candidates and clicking a button is definitely faster. But, I think I take your broader point. I didn't mention another idea I had about this: preserving the location in the ToC hierarchy for every card in a navigable way.

Let's say (true story) I watch a KhanAcademy video about the Fifth Coalition and learn about how Napoleon was in conflict with the Pope because of the refusal of the Papal States to participate in the Continental System. Napoleon ended up abducting the Pope! Ok, so I search CherryTree for "napoleon pope". I find a card: "Who frequently clashed with Napoleon over continued interference in central Italy and the extent of papal involvement in the Continental System? Pope Pius VII". Pretty decent card! I look at the context and see something like:

The Napoleonic Wars by Alexander Mikaberidze > Chapter 13: The Grand Empire: 1807-1812 > Paragraph 19

“French Italy” steadily expanded in the northwestern corner of the peninsula, where Piedmont was replaced by six departments that were administered as French provinces. In later years the French-governed areas extended to Parma and Piacenza; the Kingdom of Etruria survived until 1807, when Napoleon dissolved it and established three new departments. In the Papal States, Pope Pius VII frequently clashed with Napoleon over continued interference in central Italy and the extent of papal involvement in the Continental System. The French insistence on the Italian states signing a concordat with the pope only further strained relations, for while the treaty recognized Catholicism as the state religion, it also confirmed freedom of religion, introduced civil marriage and divorce, authorized the republic to nominate bishops, and confirmed the new owners of church land that had been confiscated and sold. Pope Pius VII, unsurprisingly, opposed these changes and fought to preserve the traditions of his office, including the spiritual and temporal independence of the Holy See; neither was he keen on participating in the Continental System, which would have had a profound impact on the local economy. These frictions with the imperial government culminated in a papal humiliation in 1809, when Napoleon occupied and annexed the Papal States while the pope, who excommunicated anyone who participated in this spoliation, was made prisoner and transported to Savona and later to France, where he remained under house arrest for the next five years.

I can then click any level of that hierarchy to see the other cards from the same book, chapter, section, subsection, subsubsection, ...., paragraph. This particular ebook isn't split very finely, but many are. As you recognize, it's unlikely we're just interested in a random fact about Napoleon. We probably are learning about that period of history and may want to find other related cards. Filtering by section in the hierarchy addresses that. Of course, we might be interested in several sections from different books. We could address that with more advanced queries/filters: FROM book1.chapter1.section1 OR book2.chapter3. Maybe I find another card from the same paragraph: "When did Napoleon occupy and annex the Papal States? 1809". Perfect.

But, yes, some semantic embedding with classification and maybe a visualization could work well. We could hope that one of our labels would correspond to "Napoleonic History" and navigate with that, or look at apparent clusters in a 2D dimensionality reduction, or have a "related cards" section, etc. Worth trying, probably. History is probably one of the subjects most amenable to the ToC-hierarchy-approach. Fuzzier subjects would probably benefit from fuzzier, continuous, solutions

paulbricman Sep 26, 2021
Maintainer

Ah, I see what you mean. And the ToC structure would be really interesting, could play really well with textbooks. Though I think Quizlet also has many packs like "Textbook A: Chapter B". But they're limited to user contributions. I'll follow-up with a separate thread.

paulbricman · 2021-09-26T10:38:42Z

paulbricman
Sep 26, 2021
Maintainer

How would you prioritize the content which has been "indexed" as flashcards (and eventually structured through a ToC)? Most used textbooks of all time? User requests?

1 reply

deklanw Sep 27, 2021
Author

I would have to get a better estimate on how long generation takes for a whole book before really saying. But, I was thinking the "General" subsections from this r/AskHistorians list would be solid guide. Of course, only the ones with a proper ebook with a ToC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🍒CherryTree: GUI for picking from a ton of premade cards #25

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

🍒CherryTree: GUI for *picking* from a ton of premade cards #25

deklanw Sep 23, 2021

Replies: 3 comments · 6 replies

thiswillbeyourgithub Sep 23, 2021 Collaborator

deklanw Sep 23, 2021 Author

thiswillbeyourgithub Sep 24, 2021 Collaborator

deklanw Sep 24, 2021 Author

paulbricman Sep 24, 2021 Maintainer

deklanw Sep 24, 2021 Author

paulbricman Sep 26, 2021 Maintainer

paulbricman Sep 26, 2021 Maintainer

deklanw Sep 27, 2021 Author

🍒CherryTree: GUI for picking from a ton of premade cards #25

deklanw
Sep 23, 2021

Replies: 3 comments 6 replies

thiswillbeyourgithub
Sep 23, 2021
Collaborator

deklanw Sep 23, 2021
Author

thiswillbeyourgithub Sep 24, 2021
Collaborator

deklanw Sep 24, 2021
Author

paulbricman
Sep 24, 2021
Maintainer

deklanw Sep 24, 2021
Author

paulbricman Sep 26, 2021
Maintainer

paulbricman
Sep 26, 2021
Maintainer

deklanw Sep 27, 2021
Author