-
Notifications
You must be signed in to change notification settings - Fork 473
Locale order support #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Locale order support #789
Conversation
if not locales and use_given_order: | ||
raise ValueError("locales must be given if use_given_order is True") | ||
if not (locales or languages) and settings.USE_GIVEN_LANGUAGE_ORDER: | ||
raise ValueError("locales or languages must be given if USE_GIVEN_LANGUAGE_ORDER is True") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure of this failure. If the given list of languages is empty in this case then it should use the defaults.
I'm thinking on the case where this is combined with a language detection model. For some cases, the prediction could fail and not return any language, but still, dateparser must try its best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see your point, and I agree. The question is if this is what's expected: to restrict to the given locales (or languages) or to use those as a starting point but try the other locales anyways.
If the second is true, then removing this check should be the right choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can talk about the use case I have in mind: using a page language detector to provide hints to have better date parsing. In this case, what you have is just hints
:
- In some case, you don't get any language from the language detector. Still, you want dateparser to try its best.
- In some cases, the list won't be complete. You don't want to remove locales just because the language detector didn't detect them.
So from my point of view, this list should be just hints: they can change the order of the default locales, but never restrict them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ivanprado, the languages
and locales
parameters have been used historically for restricting the used languages. This is sometimes really important and helps to improve the performance, so I think we can't change them to work as you mentioned.
However, what you say sounds like adding something new like a exclude_languages
parameter, so you could do something like this:
date = dateparser.parse("<date string>", languages=['ru', 'tr', 'es'])
if not date:
date = dateparser.parse("<date string>", exclude_languages=['ru', 'tr', 'es'])
What do you think? Could this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, on the argument of not raising an exception because the calling code may provide an empty list, the calling code should also be able not to set that setting when that happens. I fail to see the point of removing the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noviluni I see your point. It was not clear to me that locales
and languages
were used as contraints to improve performance. The documentation (https://github.com/scrapinghub/dateparser/blob/master/dateparser/__init__.py#L23-L31) is ambiguous, and my first thought was that they would be used as "hints" . I think it would be great to update the documentation to make it clear that they restrict the languages used.
Given that languages
and locales
are for restricting I agree with you than then maybe is not the best place for the use case I'm proposing. What I'm proposing is to be able to propose "hints" for the locales/languages. I don't think exclude_languages
should be way.
What do you think about a new parameter language_hints
? It would be used just to alter the languages/locales priorities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mirceachira! Good job!
It seems that there are some errors in the flake8 pipeline: https://travis-ci.org/github/scrapinghub/dateparser/jobs/728909794, could you fix them?
@@ -299,7 +294,7 @@ class DateDataParser: | |||
|
|||
@apply_settings | |||
def __init__(self, languages=None, locales=None, region=None, try_previous_locales=False, | |||
use_given_order=False, settings=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably deprecate this instead of just removing it, but maybe we can do it as we are going to release a new version with multiple breaking changes. cc: @Gallaecio
@@ -21,4 +21,5 @@ | |||
'RETURN_TIME_AS_PERIOD': False, | |||
'PARSERS': default_parsers, | |||
'REQUIRE_PARTS': [], | |||
'USE_GIVEN_LANGUAGE_ORDER': False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's better to keep the False
by default or to change to True
. Any strong opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think False
is right as a default because that way you'd get the added benefit of trying your most likely language first. If you give a list of locales I'd assume you want the date to be extracted correctly regardless of which of your locales is the right one. Also, I would argue that expecting it to be as efficient as possible by default is better if you allow the customization.
2d534ae
to
09e9dd1
Compare
Codecov Report
@@ Coverage Diff @@
## master #789 +/- ##
==========================================
- Coverage 98.37% 98.37% -0.01%
==========================================
Files 231 231
Lines 2522 2519 -3
==========================================
- Hits 2481 2478 -3
Misses 41 41
Continue to review full report at Codecov.
|
closes #770