Paginate results from API /sentences when sort=random#3263
Conversation
Make the sort value row-dependent for random sort. This allows to use seek-based pagination for random sorts, too. It is definitely not very optimized, because we need to compute the random value for every row, but it works and it is fast enough.
|
I don't understand all the details, but I wonder if your algorithm also works, when a new sentence is added (or removed) while switching from one page to the next. |
|
I didn’t try inserting a new sentence, but it shouldn’t be a problem. The position of each sentence in the result set is directly derived from the sentence id, so if a sentence is added, it will get inserted somewhere inside the result set, without affecting the order of all the other sentences. The main website tatoeba.org paginates using page numbers, but the API uses keyset pagination instead: API consumers can only go from one page to the next (and not the other way around), and the position of the next page is based on the position of the last sentence of the current page, so everything gets shifted without affecting any ongoing page browsing. |
The initial call with sort=random computes a seed value. Paginated links include the seed value as sort=random:<seed>, e.g. sort=random:12345. Subsequent calls read the seed value from the sort= parameter. This ensures the same sentence won’t appear twice within the complete result.
5ee9dc1 to
79fe164
Compare
|
I have deployed this branch on https://api.dev.tatoeba.org/ if you want to test out. |
|
Unfortunately the test didn't pass… After adding a sentence the "random" order is completely changed. :-( |
|
I just tried and it worked for me. Here is my method: # Get the first 50 sentences ids in French using random seed 657749637
curl -s "https://api.dev.tatoeba.org/v1/sentences?sort=random%3A657749637&lang=fra"| jq '.data[].id' > /tmp/ids
# Add a new sentence on dev.tatoeba.org ("Il faut mettre un peu d’ordre dans ce bazar.")
# Wait until the new sentence gets indexed, check if it is with this command:
curl -s "https://api.dev.tatoeba.org/v1/sentences?sort=created&lang=fra" -G --data-urlencode "q=Il faut mettre un peu d’ordre dans ce bazar." | jq .
# Get the list of sentences ids again, using the same random seed
curl -s "https://api.dev.tatoeba.org/v1/sentences?sort=random%3A657749637&lang=fra"| jq '.data[].id' > /tmp/ids.new
# Compare the two lists
diff -u /tmp/ids /tmp/ids.new |
|
I didn't use the api (I haven't look at it yet), just dev.tatoeba.org. Maybe that is the reason? |
This PR allows to paginate results from the API
/sentencesendpoint whensort=randomis used.sort=randomcomputes a random seed value.sort=random:<seed>, e.g.sort=random:12345.sort=parameter.This ensures the same sentence won’t appear twice within a complete result set.