Skip to content

Re-introduce content language for YouTube #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

B0pol
Copy link
Member

@B0pol B0pol commented Feb 15, 2020

  • I carefully read the contribution guidelines and agree to them.
  • I did test the API against NewPipe.
  • I agree to ASAP create a PULL request for NewPipe for making in compatible when I changed the api.

Reintroduced content language.
This fixes content language selector being useless, then titles and descriptions are now in the good language.
Fixes TeamNewPipe/NewPipe#3089

The only related problem is channel subscription count, so I fixed it this way:
if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.
We replace the abbreviation by its English equivalent with a HashMap, and then use the mixedNumberWordToLong function (as it is right now)

Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only related problem is channel subscription count, so I fixed it this way: if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.

What is the exact problem? Do you get the wrong format or no result?

@B0pol
Copy link
Member Author

B0pol commented Feb 15, 2020

The only related problem is channel subscription count, so I fixed it this way: if the content language is not English and if the sub count is shortened, it makes a new request in English and get channel sub count.

What is the exact problem? Do you get the wrong format or no result?

The problem is: Months ago, YouTube shortened sub count for channels, leading to no exact number. As it's shortened, it gives 250k, 1M… but in other languages, it could be 250 k (french), 250 somewhat (with space), and then we only gather the number, leading to TeamNewPipe/NewPipe#2632
Enforcing english fixed this. But then broke title & description in the wrong language. Making a new request in English if the number is possibly wrong is the solution I thought

@TobiGr
Copy link
Contributor

TobiGr commented Feb 15, 2020

Ah yes, I remember. In this case, the best solution would be to create a list of the abbreviations for all supported languages and then convert the numbers correctly. Making a new request for a single value seems not the correct approach to me as we cause a bunch of traffic.

Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort. This was a massive work wich definitely cost you hours to complete.


import java.util.HashMap;

public class YoutubeAbbreviationSubCountMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a JavaDoc for this class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the class name, too. I assume that these abbreviations can be used by other services, too. Is this correct? If yes, please move this file to the extractor utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be used if one wants to parse abbreviations from languages (and if some are missing for a service, it can easily be added).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the JavaDoc good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, looks good. however, the class is not a map. we should rename it and change that part in the doc, too.

Copy link
Contributor

@mauriciocolli mauriciocolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the suggested testing approach, I found that some patterns are missing. Almost sure that even more is missing, but there's too many to get all of them like this (would have to find channels within all the possible ranges).

I found this approach too brittle (referring to how the parsing is done and patterns stored).

What about using a similar approach like the time ago parser? Luckily, it seems that YouTube, unlike the dates, follow closely the data from Unicode (maybe even use it?).

ALL (or most part) the patterns are well known, distributed by Unicode, and freely available:
https://github.com/unicode-org/cldr/tree/master/common/main.

This would make developing a parser a lot easier.

For example, th is failing for some thousands cases, it would be this case right here, or, using the segmented version, here.


PS: As of now, it seems like the hi from time ago parser is failing because some year patterns are not included, would have to be fixed before enabling all languages. Will open a PR later.

Copy link
Member Author

@B0pol B0pol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@B0pol
Copy link
Member Author

B0pol commented Feb 17, 2020

So, I made another test file with YouTube discover page, we test about 112 channels for each 80 languages. It just test abbreviations though, as it's not a channel page.

I included below actual tests with the extractor for the 80 languages, you can easily test a channel. The downside is that there are more false negatives.

The map is fullfilled now, the crash report is straightforward (here), it doesn't make the app crash, and there is an easy workaround for users if somehow one language fails: switch content language to English (until next update).

Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. almost done

It now doesn't fail the whole test if one language fail, but show an error on console. You may want to check individually the language that failed
after the test, with testOneLanguageExtractor().

for mixedwordtolong, using power of tens may lead to a small rounding
error.
@TobiGr TobiGr requested a review from Stypox February 20, 2020 11:48
@TobiGr
Copy link
Contributor

TobiGr commented Feb 20, 2020

@Stypox @mauriciocolli When you think that this is good to go, please merge.

Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through everything once again and the code is good. Thank you for the effort :-D @B0pol

@Stypox Stypox dismissed mauriciocolli’s stale review February 21, 2020 13:10

Everything was fixed

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

@B0pol Tarvis gives two warnings. Could you fix them?

linkhandler/SearchQueryHandlerFactory.java:48: warning - @return tag has no arguments.
utils/Utils.java:108: warning - @param argument "loc:" is not a parameter name.

Also, there is an error with full links in description, could this be related to your changes?

org.schabi.newpipe.extractor.services.youtube.stream.YoutubeStreamExtractorDefaultTest$DescriptionTestUnboxing > testGetFullLinksInDescription FAILED

@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

Because I switched to raw text, instead of HTML, so full links www.youtube.com are not provided, but only youtu.be.

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

That's bad, we need the full links provided in the html, otherwise long links in the description won't work... Are you sure there is no way to fix HTML formatting?

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

Oh, I saw you edited your comment... Yeah, that would be ok ;-)
Are you sure other urls are not abbreviated?

@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

I tested, and links in description are OK.
I didn't find other shortened link.

But this is weird. Look at the video, it's only youtu.be links. If you try to change PLAIN_TEXT with HTML first, and print the description, ctrl-F, there are no youtu.be links, but it's still pass the test.

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

Ok, then maybe the issue has to do with the video description that was changed (if that's the case ignore my review and please revert the changes in the description test and then replace youtube.com with youtu.be).
Could you run a fast check with this video? It has a long link in the description: https://www.youtube.com/watch?v=gd5bynDvDUw

@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

on YouTube website it has no full links:
image
same thing with m.youtube.com

and same thing when gathering with PLAIN_TEXT way, there are no full links.

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

When extracting HTML the description is processed and abbreviated links are converted into the correct ones.
See parseHtmlAndGetFullLinks:

private String parseHtmlAndGetFullLinks(String descriptionHtml)
throws MalformedURLException, UnsupportedEncodingException, ParsingException {
final Document description = Jsoup.parse(descriptionHtml, getUrl());
for (Element a : description.select("a")) {
final String rawUrl = a.attr("abs:href");
final URL redirectLink = new URL(rawUrl);
final Matcher onClickTimestamp;
final String queryString;
if ((onClickTimestamp = DESCRIPTION_TIMESTAMP_ONCLICK_REGEX.matcher(a.attr("onclick")))
.find()) {
a.removeAttr("onclick");
String hours = coalesce(onClickTimestamp.group(1), "0");
String minutes = onClickTimestamp.group(2);
String seconds = onClickTimestamp.group(3);
int timestamp = 0;
timestamp += Integer.parseInt(hours) * 3600;
timestamp += Integer.parseInt(minutes) * 60;
timestamp += Integer.parseInt(seconds);
String setTimestamp = "&t=" + timestamp;
// Even after clicking https://youtu.be/...?t=6,
// getUrl() is https://www.youtube.com/watch?v=..., never youtu.be, never &t=.
a.attr("href", getUrl() + setTimestamp);
} else if ((queryString = redirectLink.getQuery()) != null) {
// if the query string is null we are not dealing with a redirect link,
// so we don't need to override it.
final String link =
Parser.compatParseMap(queryString).get("q");
if (link != null) {
// if link is null the a tag is a hashtag.
// They refer to the youtube search. We do not handle them.
a.text(link);
a.attr("href", link);
} else if (redirectLink.toString().contains("https://www.youtube.com/")) {
a.text(redirectLink.toString());
a.attr("href", redirectLink.toString());
}
} else if (redirectLink.toString().contains("https://www.youtube.com/")) {
descriptionHtml = descriptionHtml.replace(rawUrl, redirectLink.toString());
a.text(redirectLink.toString());
a.attr("href", redirectLink.toString());
}
}
return description.select("body").first().html();
}

@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

What's wrong with having shortened links?

@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

I think we are misunderstanding ourselves ;-)
By "full links" I mean "links that are not abbreviated using ...". Those cannot be clicked (without being converted to full links beforehand) in NewPipe descriptions since the full url is missing. Shortened youtube urls (i.e. "youtu.be"), on the other hand, have nothing wrong and work without problems.

With the new JSON method I am not sure if full links are provided or if there are .... Could you check that the JSON microformat description for video "https://www.youtube.com/watch?v=gd5bynDvDUw" contains the full link "https://www.youtube.com/channel/UCf5q0cbFOLbphljteZ9d4Pw" and not "https://www.youtube.com/channel/UCf5q..."?

Sorry for my misunderstanding 🤦‍♂️

@B0pol B0pol closed this Feb 21, 2020
@B0pol B0pol reopened this Feb 21, 2020
@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

Yes they are ok. (Why close button is exactly at the same place as cancel on issues???)

@B0pol
Copy link
Member Author

B0pol commented Feb 21, 2020

Otherwise, about this pr: yt_new, ie #258 breaks again, by adding subscribers to the count, eg 100M subscribers, but in other languages, 100 M abonnés, and with the current method, the abbreviation got would be Mabonnés.

I think it could break more things because i've seen other places where they added like views (but I think remove non-digit number still do the job as it's not rounded).

As it's in March, so soon, I'm closing the PR, because it will be pretty useless if we have to then comment the supported languages as yt_new broke subscriber count again.

I'll wait for yt_new to be merged, and try to fix again, then if I succeed, reopen the PR and mention you.

@B0pol B0pol closed this Feb 21, 2020
@Stypox
Copy link
Member

Stypox commented Feb 21, 2020

Ok

@wb9688 wb9688 mentioned this pull request Feb 26, 2020
23 tasks
@B0pol B0pol deleted the localisation branch March 1, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request: Have video titles and descriptions in the default content language where available
5 participants