Skip to content

Locale order support #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Locale order support #789

wants to merge 9 commits into from

Conversation

dariuschira
Copy link
Contributor

closes #770

if not locales and use_given_order:
raise ValueError("locales must be given if use_given_order is True")
if not (locales or languages) and settings.USE_GIVEN_LANGUAGE_ORDER:
raise ValueError("locales or languages must be given if USE_GIVEN_LANGUAGE_ORDER is True")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure of this failure. If the given list of languages is empty in this case then it should use the defaults.

I'm thinking on the case where this is combined with a language detection model. For some cases, the prediction could fail and not return any language, but still, dateparser must try its best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see your point, and I agree. The question is if this is what's expected: to restrict to the given locales (or languages) or to use those as a starting point but try the other locales anyways.
If the second is true, then removing this check should be the right choice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can talk about the use case I have in mind: using a page language detector to provide hints to have better date parsing. In this case, what you have is just hints:

  • In some case, you don't get any language from the language detector. Still, you want dateparser to try its best.
  • In some cases, the list won't be complete. You don't want to remove locales just because the language detector didn't detect them.

So from my point of view, this list should be just hints: they can change the order of the default locales, but never restrict them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ivanprado, the languages and locales parameters have been used historically for restricting the used languages. This is sometimes really important and helps to improve the performance, so I think we can't change them to work as you mentioned.

However, what you say sounds like adding something new like a exclude_languages parameter, so you could do something like this:

date = dateparser.parse("<date string>", languages=['ru', 'tr', 'es'])
if not date:
    date = dateparser.parse("<date string>", exclude_languages=['ru', 'tr', 'es'])

What do you think? Could this work for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, on the argument of not raising an exception because the calling code may provide an empty list, the calling code should also be able not to set that setting when that happens. I fail to see the point of removing the exception.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noviluni I see your point. It was not clear to me that locales and languages were used as contraints to improve performance. The documentation (https://github.com/scrapinghub/dateparser/blob/master/dateparser/__init__.py#L23-L31) is ambiguous, and my first thought was that they would be used as "hints" . I think it would be great to update the documentation to make it clear that they restrict the languages used.

Given that languages and locales are for restricting I agree with you than then maybe is not the best place for the use case I'm proposing. What I'm proposing is to be able to propose "hints" for the locales/languages. I don't think exclude_languages should be way.

What do you think about a new parameter language_hints? It would be used just to alter the languages/locales priorities.

Copy link
Collaborator

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mirceachira! Good job!

It seems that there are some errors in the flake8 pipeline: https://travis-ci.org/github/scrapinghub/dateparser/jobs/728909794, could you fix them?

@@ -299,7 +294,7 @@ class DateDataParser:

@apply_settings
def __init__(self, languages=None, locales=None, region=None, try_previous_locales=False,
use_given_order=False, settings=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably deprecate this instead of just removing it, but maybe we can do it as we are going to release a new version with multiple breaking changes. cc: @Gallaecio

@@ -21,4 +21,5 @@
'RETURN_TIME_AS_PERIOD': False,
'PARSERS': default_parsers,
'REQUIRE_PARTS': [],
'USE_GIVEN_LANGUAGE_ORDER': False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's better to keep the False by default or to change to True. Any strong opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think False is right as a default because that way you'd get the added benefit of trying your most likely language first. If you give a list of locales I'd assume you want the date to be extracted correctly regardless of which of your locales is the right one. Also, I would argue that expecting it to be as efficient as possible by default is better if you allow the customization.

@codecov
Copy link

codecov bot commented Sep 24, 2020

Codecov Report

Merging #789 (94c96fd) into master (a18cc09) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #789      +/-   ##
==========================================
- Coverage   98.37%   98.37%   -0.01%     
==========================================
  Files         231      231              
  Lines        2522     2519       -3     
==========================================
- Hits         2481     2478       -3     
  Misses         41       41              
Impacted Files Coverage Δ
dateparser/conf.py 100.00% <ø> (ø)
dateparser/date.py 99.56% <100.00%> (-0.01%) ⬇️
dateparser/languages/loader.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a18cc09...94c96fd. Read the comment docs.

@dariuschira dariuschira self-assigned this Sep 24, 2020
@noviluni noviluni added this to the v1.0.0 milestone Sep 28, 2020
@noviluni noviluni modified the milestones: v1.0.0, 1.1.0 Oct 26, 2020
@noviluni noviluni mentioned this pull request Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong prioritization of languages
4 participants