Re-introduce content language for YouTube #257

B0pol · 2020-02-15T11:37:21Z

I carefully read the contribution guidelines and agree to them.
I did test the API against NewPipe.
I agree to ASAP create a PULL request for NewPipe for making in compatible when I changed the api.

Reintroduced content language.
This fixes content language selector being useless, then titles and descriptions are now in the good language.
Fixes TeamNewPipe/NewPipe#3089

The only related problem is channel subscription count, so I fixed it this way:
~~if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.~~
We replace the abbreviation by its English equivalent with a HashMap, and then use the mixedNumberWordToLong function (as it is right now)

TobiGr

The only related problem is channel subscription count, so I fixed it this way: if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.

What is the exact problem? Do you get the wrong format or no result?

...in/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java

B0pol · 2020-02-15T12:18:02Z

The only related problem is channel subscription count, so I fixed it this way: if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.

What is the exact problem? Do you get the wrong format or no result?

The problem is: Months ago, YouTube shortened sub count for channels, leading to no exact number. As it's shortened, it gives 250k, 1M… but in other languages, it could be 250 k (french), 250 somewhat (with space), and then we only gather the number, leading to TeamNewPipe/NewPipe#2632
Enforcing english fixed this. But then broke title & description in the wrong language. Making a new request in English if the number is possibly wrong is the solution I thought

TobiGr · 2020-02-15T12:24:27Z

Ah yes, I remember. In this case, the best solution would be to create a list of the abbreviations for all supported languages and then convert the numbers correctly. Making a new request for a single value seems not the correct approach to me as we cause a bunch of traffic.

...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java

TobiGr

Thank you for the effort. This was a massive work wich definitely cost you hours to complete.

...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java

extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java

...org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeAbbreviationSubCountMap.java

...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java

...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java

TobiGr · 2020-02-16T19:27:36Z

...org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeAbbreviationSubCountMap.java

+
+import java.util.HashMap;
+
+public class YoutubeAbbreviationSubCountMap {


Can you please add a JavaDoc for this class?

I am not sure about the class name, too. I assume that these abbreviations can be used by other services, too. Is this correct? If yes, please move this file to the extractor utils.

Yes, it can be used if one wants to parse abbreviations from languages (and if some are missing for a service, it can easily be added).

Is the JavaDoc good?

yes, looks good. however, the class is not a map. we should rename it and change that part in the doc, too.

and comment some other files related

extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java

mauriciocolli

With the suggested testing approach, I found that some patterns are missing. Almost sure that even more is missing, but there's too many to get all of them like this (would have to find channels within all the possible ranges).

I found this approach too brittle (referring to how the parsing is done and patterns stored).

What about using a similar approach like the time ago parser? Luckily, it seems that YouTube, unlike the dates, follow closely the data from Unicode (maybe even use it?).

ALL (or most part) the patterns are well known, distributed by Unicode, and freely available:
https://github.com/unicode-org/cldr/tree/master/common/main.

This would make developing a parser a lot easier.

For example, th is failing for some thousands cases, it would be this case right here, or, using the segmented version, here.

PS: As of now, it seems like the hi from time ago parser is failing because some year patterns are not included, would have to be fixed before enabling all languages. Will open a PR later.

extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java

...actor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubscriberTest.java

extractor/src/test/java/org/schabi/newpipe/extractor/utils/UtilsTest.java

...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java

extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java

extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java

B0pol

.

B0pol · 2020-02-17T21:40:29Z

So, I made another test file with YouTube discover page, we test about 112 channels for each 80 languages. It just test abbreviations though, as it's not a channel page.

I included below actual tests with the extractor for the 80 languages, you can easily test a channel. The downside is that there are more false negatives.

The map is fullfilled now, the crash report is straightforward (here), it doesn't make the app crash, and there is an easy workaround for users if somehow one language fails: switch content language to English (until next update).

TobiGr

Thanks. almost done

extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java

extractor/src/test/java/org/schabi/newpipe/extractor/services/youtube/YoutubeSubcriberTest.java

extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java

It now doesn't fail the whole test if one language fail, but show an error on console. You may want to check individually the language that failed after the test, with testOneLanguageExtractor(). for mixedwordtolong, using power of tens may lead to a small rounding error.

TobiGr · 2020-02-20T11:49:06Z

@Stypox @mauriciocolli When you think that this is good to go, please merge.

Stypox

I skimmed through everything once again and the code is good. Thank you for the effort :-D @B0pol

Everything was fixed

Stypox · 2020-02-21T13:26:18Z

@B0pol Tarvis gives two warnings. Could you fix them?

linkhandler/SearchQueryHandlerFactory.java:48: warning - @return tag has no arguments.
utils/Utils.java:108: warning - @param argument "loc:" is not a parameter name.

Also, there is an error with full links in description, could this be related to your changes?

org.schabi.newpipe.extractor.services.youtube.stream.YoutubeStreamExtractorDefaultTest$DescriptionTestUnboxing > testGetFullLinksInDescription FAILED

B0pol · 2020-02-21T13:57:28Z

Because I switched to raw text, instead of HTML, so full links www.youtube.com are not provided, but only youtu.be.

Stypox · 2020-02-21T13:59:30Z

That's bad, we need the full links provided in the html, otherwise long links in the description won't work... Are you sure there is no way to fix HTML formatting?

Stypox · 2020-02-21T14:04:50Z

Oh, I saw you edited your comment... Yeah, that would be ok ;-)
Are you sure other urls are not abbreviated?

.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java

B0pol · 2020-02-21T14:10:35Z

I tested, and links in description are OK.
I didn't find other shortened link.

But this is weird. Look at the video, it's only youtu.be links. If you try to change PLAIN_TEXT with HTML first, and print the description, ctrl-F, there are no youtu.be links, but it's still pass the test.

Stypox · 2020-02-21T14:14:54Z

Ok, then maybe the issue has to do with the video description that was changed (if that's the case ignore my review and please revert the changes in the description test and then replace youtube.com with youtu.be).
Could you run a fast check with this video? It has a long link in the description: https://www.youtube.com/watch?v=gd5bynDvDUw

B0pol · 2020-02-21T14:18:17Z

on YouTube website it has no full links:

same thing with m.youtube.com

and same thing when gathering with PLAIN_TEXT way, there are no full links.

Stypox · 2020-02-21T14:26:03Z

When extracting HTML the description is processed and abbreviated links are converted into the correct ones.
See parseHtmlAndGetFullLinks:

NewPipeExtractor/extractor/src/main/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java

Lines 219 to 269 in ea68770

    
           private String parseHtmlAndGetFullLinks(String descriptionHtml) 
        
                   throws MalformedURLException, UnsupportedEncodingException, ParsingException { 
        
               final Document description = Jsoup.parse(descriptionHtml, getUrl()); 
        
               for (Element a : description.select("a")) { 
        
                   final String rawUrl = a.attr("abs:href"); 
        
                   final URL redirectLink = new URL(rawUrl); 
        
                   final Matcher onClickTimestamp; 
        
                   final String queryString; 
        
                   if ((onClickTimestamp = DESCRIPTION_TIMESTAMP_ONCLICK_REGEX.matcher(a.attr("onclick"))) 
        
                           .find()) { 
        
                       a.removeAttr("onclick"); 
        
                       String hours = coalesce(onClickTimestamp.group(1), "0"); 
        
                       String minutes = onClickTimestamp.group(2); 
        
                       String seconds = onClickTimestamp.group(3); 
        
                       int timestamp = 0; 
        
                       timestamp += Integer.parseInt(hours) * 3600; 
        
                       timestamp += Integer.parseInt(minutes) * 60; 
        
                       timestamp += Integer.parseInt(seconds); 
        
                       String setTimestamp = "&t=" + timestamp; 
        
                       // Even after clicking https://youtu.be/...?t=6, 
        
                       // getUrl() is https://www.youtube.com/watch?v=..., never youtu.be, never &t=. 
        
                       a.attr("href", getUrl() + setTimestamp); 
        
                   } else if ((queryString = redirectLink.getQuery()) != null) { 
        
                       // if the query string is null we are not dealing with a redirect link, 
        
                       // so we don't need to override it. 
        
                       final String link = 
        
                               Parser.compatParseMap(queryString).get("q"); 
        
                       if (link != null) { 
        
                           // if link is null the a tag is a hashtag. 
        
                           // They refer to the youtube search. We do not handle them. 
        
                           a.text(link); 
        
                           a.attr("href", link); 
        
                       } else if (redirectLink.toString().contains("https://www.youtube.com/")) { 
        
                           a.text(redirectLink.toString()); 
        
                           a.attr("href", redirectLink.toString()); 
        
                       } 
        
                   } else if (redirectLink.toString().contains("https://www.youtube.com/")) { 
        
                       descriptionHtml = descriptionHtml.replace(rawUrl, redirectLink.toString()); 
        
                       a.text(redirectLink.toString()); 
        
                       a.attr("href", redirectLink.toString()); 
        
                   } 
        
               } 
        
               return description.select("body").first().html(); 
        
           }

B0pol · 2020-02-21T14:30:32Z

What's wrong with having shortened links?

Stypox · 2020-02-21T14:36:37Z

I think we are misunderstanding ourselves ;-)
By "full links" I mean "links that are not abbreviated using ...". Those cannot be clicked (without being converted to full links beforehand) in NewPipe descriptions since the full url is missing. Shortened youtube urls (i.e. "youtu.be"), on the other hand, have nothing wrong and work without problems.

With the new JSON method I am not sure if full links are provided or if there are .... Could you check that the JSON microformat description for video "https://www.youtube.com/watch?v=gd5bynDvDUw" contains the full link "https://www.youtube.com/channel/UCf5q0cbFOLbphljteZ9d4Pw" and not "https://www.youtube.com/channel/UCf5q..."?

Sorry for my misunderstanding 🤦‍♂️

B0pol · 2020-02-21T14:40:55Z

Yes they are ok. (Why close button is exactly at the same place as cancel on issues???)

.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java

see this: TeamNewPipe#257 (comment)

B0pol · 2020-02-21T15:45:20Z

Otherwise, about this pr: yt_new, ie #258 breaks again, by adding subscribers to the count, eg 100M subscribers, but in other languages, 100 M abonnés, and with the current method, the abbreviation got would be Mabonnés.

I think it could break more things because i've seen other places where they added like views (but I think remove non-digit number still do the job as it's not rounded).

As it's in March, so soon, I'm closing the PR, because it will be pretty useless if we have to then comment the supported languages as yt_new broke subscriber count again.

I'll wait for yt_new to be merged, and try to fix again, then if I succeed, reopen the PR and mention you.

Stypox · 2020-02-21T17:17:48Z

Ok

B0pol added 3 commits February 13, 2020 23:34

reintroduce content language for youtube

d70b127

refix subscriber count

0d17625

Merge branch 'dev' into localisation

a9c6bee

B0pol mentioned this pull request Feb 15, 2020

Multiple localization fixes TeamNewPipe/NewPipe#3098

Merged

1 task

TobiGr reviewed Feb 15, 2020

View reviewed changes

...in/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java Show resolved Hide resolved

...in/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeStreamExtractor.java Outdated Show resolved Hide resolved

B0pol commented Feb 15, 2020

View reviewed changes

...n/java/org/schabi/newpipe/extractor/services/youtube/extractors/YoutubeChannelExtractor.java Outdated Show resolved Hide resolved

HashMap for abbreviations

2eb9f5c

TobiGr reviewed Feb 16, 2020

View reviewed changes

B0pol added 2 commits February 16, 2020 21:54

added comment for explaination

f5d1952

Refactor AbbreviationHashMap, add javadoc for it

0fd7484

and comment some other files related

B0pol force-pushed the localisation branch from 0108975 to 0fd7484 Compare February 16, 2020 20:55

TobiGr reviewed Feb 16, 2020

View reviewed changes

extractor/src/main/java/org/schabi/newpipe/extractor/utils/AbbreviationHashMap.java Outdated Show resolved Hide resolved

mauriciocolli previously requested changes Feb 17, 2020

View reviewed changes

B0pol commented Feb 17, 2020

View reviewed changes

B0pol added 3 commits February 17, 2020 12:01

address suggestino on AbbreviationHelper and related

e51cd2a

Merge branch 'dev' into localisation

b4ddb08

resolve merge conflicts

62effa0

B0pol force-pushed the localisation branch from e228605 to 62effa0 Compare February 17, 2020 11:42

3 more abbreviations found, improve ut method

9b61203

B0pol requested review from TobiGr and mauriciocolli February 17, 2020 21:33

Refactored YouTubeSubscriberTest

88f9ab9

B0pol force-pushed the localisation branch from d612170 to 88f9ab9 Compare February 17, 2020 21:41

TobiGr reviewed Feb 17, 2020

View reviewed changes

small reformatting

ff6c6a8

B0pol force-pushed the localisation branch from 06c0961 to ff6c6a8 Compare February 17, 2020 22:08

Stypox requested changes Feb 19, 2020

View reviewed changes

extractor/src/main/java/org/schabi/newpipe/extractor/utils/Utils.java Outdated Show resolved Hide resolved

B0pol added 2 commits February 19, 2020 16:35

debug branch

4e91921

TobiGr approved these changes Feb 20, 2020

View reviewed changes

TobiGr requested a review from Stypox February 20, 2020 11:48

Stypox approved these changes Feb 21, 2020

View reviewed changes

fix travis CI + some typos

ea68770

Stypox requested changes Feb 21, 2020

View reviewed changes

.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java Show resolved Hide resolved

B0pol closed this Feb 21, 2020

B0pol reopened this Feb 21, 2020

Stypox requested changes Feb 21, 2020

View reviewed changes

.../org/schabi/newpipe/extractor/services/youtube/stream/YoutubeStreamExtractorDefaultTest.java Outdated Show resolved Hide resolved

actually fix testGetFullLinksInDescription

a60aa10

see this: TeamNewPipe#257 (comment)

B0pol closed this Feb 21, 2020

wb9688 mentioned this pull request Feb 26, 2020

Improve yt_new #262

Merged

23 tasks

B0pol deleted the localisation branch March 1, 2020 19:41


		import java.util.HashMap;

		public class YoutubeAbbreviationSubCountMap {

Re-introduce content language for YouTube #257

Re-introduce content language for YouTube #257

Uh oh!

Conversation

B0pol commented Feb 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobiGr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

B0pol commented Feb 15, 2020

Uh oh!

TobiGr commented Feb 15, 2020

Uh oh!

Uh oh!

TobiGr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TobiGr Feb 16, 2020

Choose a reason for hiding this comment

Uh oh!

TobiGr Feb 16, 2020

Choose a reason for hiding this comment

Uh oh!

B0pol Feb 16, 2020

Choose a reason for hiding this comment

Uh oh!

B0pol Feb 16, 2020

Choose a reason for hiding this comment

Uh oh!

TobiGr Feb 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mauriciocolli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B0pol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B0pol commented Feb 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobiGr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TobiGr commented Feb 20, 2020

Uh oh!

Stypox left a comment

Choose a reason for hiding this comment

Uh oh!

Stypox commented Feb 21, 2020

Uh oh!

B0pol commented Feb 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stypox commented Feb 21, 2020

Uh oh!

Stypox commented Feb 21, 2020

B0pol commented Feb 15, 2020 •

edited

Loading

B0pol left a comment •

edited

Loading

B0pol commented Feb 17, 2020 •

edited

Loading

B0pol commented Feb 21, 2020 •

edited

Loading

B0pol commented Feb 21, 2020 •

edited

Loading

Stypox commented Feb 21, 2020 •

edited

Loading

B0pol commented Feb 21, 2020 •

edited

Loading

Stypox commented Feb 21, 2020 •

edited

Loading

Stypox commented Feb 21, 2020 •

edited

Loading

B0pol commented Feb 21, 2020 •

edited

Loading