Locale order support #789

dariuschira · 2020-09-17T14:55:32Z

closes #770

… in DateDataParser

docs/usage.rst

ivanprado · 2020-09-18T09:15:10Z

dateparser/date.py

-        if not locales and use_given_order:
-            raise ValueError("locales must be given if use_given_order is True")
+        if not (locales or languages) and settings.USE_GIVEN_LANGUAGE_ORDER:
+            raise ValueError("locales or languages must be given if USE_GIVEN_LANGUAGE_ORDER is True")


Not sure of this failure. If the given list of languages is empty in this case then it should use the defaults.

I'm thinking on the case where this is combined with a language detection model. For some cases, the prediction could fail and not return any language, but still, dateparser must try its best.

I do see your point, and I agree. The question is if this is what's expected: to restrict to the given locales (or languages) or to use those as a starting point but try the other locales anyways.
If the second is true, then removing this check should be the right choice.

I can talk about the use case I have in mind: using a page language detector to provide hints to have better date parsing. In this case, what you have is just hints:

In some case, you don't get any language from the language detector. Still, you want dateparser to try its best.

In some cases, the list won't be complete. You don't want to remove locales just because the language detector didn't detect them.

So from my point of view, this list should be just hints: they can change the order of the default locales, but never restrict them.

Hi @ivanprado, the languages and locales parameters have been used historically for restricting the used languages. This is sometimes really important and helps to improve the performance, so I think we can't change them to work as you mentioned.

However, what you say sounds like adding something new like a exclude_languages parameter, so you could do something like this:

date = dateparser.parse("<date string>", languages=['ru', 'tr', 'es']) if not date: date = dateparser.parse("<date string>", exclude_languages=['ru', 'tr', 'es'])

What do you think? Could this work for you?

Also, on the argument of not raising an exception because the calling code may provide an empty list, the calling code should also be able not to set that setting when that happens. I fail to see the point of removing the exception.

@noviluni I see your point. It was not clear to me that locales and languages were used as contraints to improve performance. The documentation (https://github.com/scrapinghub/dateparser/blob/master/dateparser/__init__.py#L23-L31) is ambiguous, and my first thought was that they would be used as "hints" . I think it would be great to update the documentation to make it clear that they restrict the languages used.

Given that languages and locales are for restricting I agree with you than then maybe is not the best place for the use case I'm proposing. What I'm proposing is to be able to propose "hints" for the locales/languages. I don't think exclude_languages should be way.

What do you think about a new parameter language_hints? It would be used just to alter the languages/locales priorities.

noviluni

Hey @mirceachira! Good job!

It seems that there are some errors in the flake8 pipeline: https://travis-ci.org/github/scrapinghub/dateparser/jobs/728909794, could you fix them?

noviluni · 2020-09-17T15:34:31Z

dateparser/date.py

@@ -299,7 +294,7 @@ class DateDataParser:

    @apply_settings
    def __init__(self, languages=None, locales=None, region=None, try_previous_locales=False,
-                 use_given_order=False, settings=None):


we should probably deprecate this instead of just removing it, but maybe we can do it as we are going to release a new version with multiple breaking changes. cc: @Gallaecio

noviluni · 2020-09-17T15:36:07Z

dateparser_data/settings.py

@@ -21,4 +21,5 @@
    'RETURN_TIME_AS_PERIOD': False,
    'PARSERS': default_parsers,
    'REQUIRE_PARTS': [],
+    'USE_GIVEN_LANGUAGE_ORDER': False


I'm not sure if it's better to keep the False by default or to change to True. Any strong opinion?

I think False is right as a default because that way you'd get the added benefit of trying your most likely language first. If you give a list of locales I'd assume you want the date to be extracted correctly regardless of which of your locales is the right one. Also, I would argue that expecting it to be as efficient as possible by default is better if you allow the customization.

codecov · 2020-09-24T10:28:41Z

Codecov Report

Merging #789 (94c96fd) into master (a18cc09) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #789      +/-   ##
==========================================
- Coverage   98.37%   98.37%   -0.01%     
==========================================
  Files         231      231              
  Lines        2522     2519       -3     
==========================================
- Hits         2481     2478       -3     
  Misses         41       41

Impacted Files	Coverage Δ
dateparser/conf.py	`100.00% <ø> (ø)`
dateparser/date.py	`99.56% <100.00%> (-0.01%)`	⬇️
dateparser/languages/loader.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a18cc09...94c96fd. Read the comment docs.

dateparser/date.py

…d fixed tests

dariuschira added 3 commits September 17, 2020 17:47

new USE_GIVEN_LANGUAGE_ORDER setting

40e117f

remove use_given_order param and use USE_GIVEN_LANGUAGE_ORDER setting…

719e767

… in DateDataParser

docs for USE_GIVEN_LANGUAGE_ORDER

fc8d799

dariuschira requested a review from noviluni September 17, 2020 14:56

ivanprado reviewed Sep 18, 2020

View reviewed changes

docs/usage.rst Show resolved Hide resolved

ivanprado reviewed Sep 18, 2020

View reviewed changes

dariuschira added 4 commits September 21, 2020 10:07

fix date order parsing with empty languages or locales

56d35e4

added empty locales case in docs

1e766cb

removed use_given_order tests from init date tests

5bed806

added USE_GIVEN_LANGUAGE_ORDER tests

ee67371

noviluni reviewed Sep 21, 2020

View reviewed changes

flake8 fix for tests and conf.py

09e9dd1

dariuschira force-pushed the locale-order-support branch from 2d534ae to 09e9dd1 Compare September 24, 2020 10:23

Gallaecio requested changes Sep 24, 2020

View reviewed changes

dateparser/date.py Outdated Show resolved Hide resolved

dariuschira self-assigned this Sep 24, 2020

noviluni added this to the v1.0.0 milestone Sep 28, 2020

noviluni added the breaking-change label Sep 28, 2020

noviluni modified the milestones: v1.0.0, 1.1.0 Oct 26, 2020

noviluni mentioned this pull request Oct 27, 2020

document settings #722

Merged

add mandatory locales or languages when USE_GIVEN_LANGUAGES is set an…

94c96fd

…d fixed tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Locale order support #789

Locale order support #789

Uh oh!

dariuschira commented Sep 17, 2020

Uh oh!

Uh oh!

ivanprado Sep 18, 2020

Uh oh!

dariuschira Sep 21, 2020

Uh oh!

ivanprado Sep 21, 2020

Uh oh!

noviluni Sep 21, 2020

Uh oh!

Gallaecio Sep 21, 2020

Uh oh!

ivanprado Sep 22, 2020

Uh oh!

noviluni left a comment

Uh oh!

noviluni Sep 17, 2020

Uh oh!

noviluni Sep 17, 2020

Uh oh!

dariuschira Sep 22, 2020

Uh oh!

codecov bot commented Sep 24, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Locale order support #789

Are you sure you want to change the base?

Locale order support #789

Uh oh!

Conversation

dariuschira commented Sep 17, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noviluni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 24, 2020 •

edited

Loading