-
Notifications
You must be signed in to change notification settings - Fork 462
Re-introduce content language for YouTube #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only related problem is channel subscription count, so I fixed it this way: if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.
What is the exact problem? Do you get the wrong format or no result?
...in/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java
Show resolved
Hide resolved
...in/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java
Outdated
Show resolved
Hide resolved
The problem is: Months ago, YouTube shortened sub count for channels, leading to no exact number. As it's shortened, it gives 250k, 1M… but in other languages, it could be 250 k (french), 250 somewhat (with space), and then we only gather the number, leading to TeamNewPipe/NewPipe#2632 |
Ah yes, I remember. In this case, the best solution would be to create a list of the abbreviations for all supported languages and then convert the numbers correctly. Making a new request for a single value seems not the correct approach to me as we cause a bunch of traffic. |
...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the effort. This was a massive work wich definitely cost you hours to complete.
...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
...org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeAbbreviationSubCountMap.java
Outdated
Show resolved
Hide resolved
...org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeAbbreviationSubCountMap.java
Outdated
Show resolved
Hide resolved
...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java
Outdated
Show resolved
Hide resolved
...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java
Outdated
Show resolved
Hide resolved
|
||
import java.util.HashMap; | ||
|
||
public class YoutubeAbbreviationSubCountMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a JavaDoc for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the class name, too. I assume that these abbreviations can be used by other services, too. Is this correct? If yes, please move this file to the extractor utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be used if one wants to parse abbreviations from languages (and if some are missing for a service, it can easily be added).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the JavaDoc good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, looks good. however, the class is not a map. we should rename it and change that part in the doc, too.
and comment some other files related
extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the suggested testing approach, I found that some patterns are missing. Almost sure that even more is missing, but there's too many to get all of them like this (would have to find channels within all the possible ranges).
I found this approach too brittle (referring to how the parsing is done and patterns stored).
What about using a similar approach like the time ago parser? Luckily, it seems that YouTube, unlike the dates, follow closely the data from Unicode (maybe even use it?).
ALL (or most part) the patterns are well known, distributed by Unicode, and freely available:
https://github.com/unicode-org/cldr/tree/master/common/main.
This would make developing a parser a lot easier.
For example, th
is failing for some thousands cases, it would be this case right here, or, using the segmented version, here.
PS: As of now, it seems like the hi
from time ago parser is failing because some year patterns are not included, would have to be fixed before enabling all languages. Will open a PR later.
extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java
Outdated
Show resolved
Hide resolved
...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java
Show resolved
Hide resolved
extractor/src/test/java/org/schabi/newpipe/extractor/utils/UtilsTest.java
Show resolved
Hide resolved
...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
So, I made another test file with YouTube discover page, we test about 112 channels for each 80 languages. It just test abbreviations though, as it's not a channel page. I included below actual tests with the extractor for the 80 languages, you can easily test a channel. The downside is that there are more false negatives. The map is fullfilled now, the crash report is straightforward (here), it doesn't make the app crash, and there is an easy workaround for users if somehow one language fails: switch content language to English (until next update). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. almost done
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
extractor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubcriberTest.java
Outdated
Show resolved
Hide resolved
extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java
Outdated
Show resolved
Hide resolved
It now doesn't fail the whole test if one language fail, but show an error on console. You may want to check individually the language that failed after the test, with testOneLanguageExtractor(). for mixedwordtolong, using power of tens may lead to a small rounding error.
@Stypox @mauriciocolli When you think that this is good to go, please merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed through everything once again and the code is good. Thank you for the effort :-D @B0pol
@B0pol Tarvis gives two warnings. Could you fix them?
Also, there is an error with full links in description, could this be related to your changes?
|
Because I switched to raw text, instead of HTML, so full links www.youtube.com are not provided, but only youtu.be. |
That's bad, we need the full links provided in the html, otherwise long links in the description won't work... Are you sure there is no way to fix HTML formatting? |
Oh, I saw you edited your comment... Yeah, that would be ok ;-) |
.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java
Show resolved
Hide resolved
I tested, and links in description are OK. But this is weird. Look at the video, it's only youtu.be links. If you try to change PLAIN_TEXT with HTML first, and print the description, ctrl-F, there are no youtu.be links, but it's still pass the test. |
Ok, then maybe the issue has to do with the video description that was changed (if that's the case ignore my review and please revert the changes in the description test and then replace youtube.com with youtu.be). |
When extracting HTML the description is processed and abbreviated links are converted into the correct ones. Lines 219 to 269 in ea68770
|
What's wrong with having shortened links? |
I think we are misunderstanding ourselves ;-) With the new JSON method I am not sure if full links are provided or if there are Sorry for my misunderstanding 🤦♂️ |
Yes they are ok. (Why close button is exactly at the same place as cancel on issues???) |
.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java
Outdated
Show resolved
Hide resolved
Otherwise, about this pr: yt_new, ie #258 breaks again, by adding I think it could break more things because i've seen other places where they added like As it's in March, so soon, I'm closing the PR, because it will be pretty useless if we have to then comment the supported languages as yt_new broke subscriber count again. I'll wait for yt_new to be merged, and try to fix again, then if I succeed, reopen the PR and mention you. |
Ok |
Reintroduced content language.
This fixes content language selector being useless, then titles and descriptions are now in the good language.
Fixes TeamNewPipe/NewPipe#3089
The only related problem is channel subscription count, so I fixed it this way:
if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.We replace the abbreviation by its English equivalent with a HashMap, and then use the
mixedNumberWordToLong
function (as it is right now)