Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rv0
Copy link

@rv0 rv0 commented Feb 2, 2021

Currently when you combine the options --fuzzy and --removeuntranslated together, the fuzzy translations are also removed.
Both of these options work fine on their own, the problem lies in the check itself, that currently ignores the fuzzy parameter and will skip any fuzzy strings when --removeuntranslated is passed on.

The following PR will improve the logic in the "skip" check, to make sure that:

  • When --removeuntranslated is passed, the string must be empty or missing to be skipped
    or
  • When --fuzzy is passed, the string will be included

@rv0
Copy link
Author

rv0 commented Feb 2, 2021

I made a mistake, will fix :)

@rv0
Copy link
Author

rv0 commented Feb 2, 2021

Not sure how to fix/handle/interpret the flake8 test that is failing.

Also, would love to add a unittest for these changes, but I haven't figured out yet how to run them

Copy link
Member

@nijel nijel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rv0 rv0 marked this pull request as draft February 2, 2021 18:31
@rv0
Copy link
Author

rv0 commented Feb 2, 2021

@nijel
Thanks for providing me the information, got unit tests to work eventually and noticed the flaws of my initial patch.

However, I don't think the test cases reflect the expected behaviour, so I changed them too. This might be radical so allow me to explain the logic used,

given that we start with po input:

#: .foo
msgid "foo"
msgstr "oof"

#: .bar
#, fuzzy
msgid "bar"
msgstr "rab"

#: .baz
#, fuzzy
msgid "baz"
msgstr ""

#: .qux
msgid "qux"
msgstr ""
"""

And json file:

{
    "foo": "foo",
    "bar": "bar",
    "baz": "baz",
    "qux": "qux"
}

I adapted the following tests:

include_fuzzy=false and remove_untranslated=false

This was the expected output:

{
    "foo": "oof",
     "bar": "bar",
     "baz": "baz",
     "qux": "qux"
}

As both bar and baz are fuzzy, they should be removed if include_fuzzy is false:

{
    "foo": "oof",
     "qux": "qux"
}

include_fuzzy=true and remove_untranslated=true

This was the expected output:

{
    "foo": "oof"
}

As bar is fuzzy and translated, imo it should also be included:

{
    "foo": "oof",
     "bar": "bar",
}

In closing..

These changes seem logical to me, however doing a search in the issue queue I see they have caused confusion to others in the past, some examples:
#3456
#3550

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

@rv0 rv0 marked this pull request as ready for review February 2, 2021 19:43
@nijel
Copy link
Member

nijel commented Feb 3, 2021

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

This depends on how you interpret "is translated". Right now it means "the translation is completed", which definitely should exclude fuzzy (and other needing review states, as in xliff). I don't think it's reasonable to change the semantics as that could easily silently break third-party code. So, the unit is either untranslated, fuzzy or translated (or approved, but that's out of scope here).

@rv0
Copy link
Author

rv0 commented Feb 3, 2021

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

This depends on how you interpret "is translated". Right now it means "the translation is completed", which definitely should exclude fuzzy (and other needing review states, as in xliff). I don't think it's reasonable to change the semantics as that could easily silently break third-party code. So, the unit is either untranslated, fuzzy or translated (or approved, but that's out of scope here).

There's not really any indication that it means that a translation is completed, the comments in code say:

Indicates whether this unit is translated.

As for the semantics, the GNU documentation does more or less say a fuzzy entry is a translated entry, for must purposes, only from the viewpoint of the translator one that needs additional review.
I think the fact that the difference is not clear is potentially a source of confusion and bugs.

In any case, the changes in my PR are made with the idea in mind that a fuzzy translation is a translation. Any comments on that?

@nijel
Copy link
Member

nijel commented Feb 3, 2021

I have no clue about the original intents behind this behavior, I'm just describing how it behaves. Most of the things in the translate-toolkit are modelled around XLIFF, so it might be also the reason for this behavior.

Another issue with --removeuntranslated and --no-fuzzy is that each convertor implements this on its own and the behavior is not consistent. This is described in issues you've referenced.

include_fuzzy=false and remove_untranslated=false
As both bar and baz are fuzzy, they should be removed if include_fuzzy is false:

The question is whether they should be removed (what you implement) or source used instead (what is current implementation). po2prop and po2dtd both include source string in this case.

include_fuzzy=true and remove_untranslated=true
As bar is fuzzy and translated, imo it should also be included:

po2prop doesn't include fuzzy in this case and po2dtd does.

@rv0
Copy link
Author

rv0 commented Feb 3, 2021

I understand. So how to continue from here?

In my use case, I need to make sure no source strings end up in the localized json, as the strings are used with speech synthesis and this will make for example an English word to be spoken with a French voice. Not very good ;) If I don't include the source string, the application will automatically fall back to the English word and speak it with an English voice. That's fine.
On the other hand some words are machine translated and thus fuzzy, I want to include these in the localized json as in most cases the translation is good or good enough (using generic short sentences and machine learning driven translations is surprisingly effective nowadays).

I'd be happy to perform additional work on this PR if needed, I think my use-case is not that uncommon so a solution would benefit others, but I'm going to need some guidance on the approach to take. Perhaps additional parameters should be added?

@nijel
Copy link
Member

nijel commented Feb 3, 2021

In the end the "correct" behavior for the first case depends on usage of the resulting file - whether all keys should be present there or not. And I think --removeuntranslated is there to indicate this and in case it's not present, then all keys should be present in the output, even with the source strings. Therefore, I'd prefer reverting this part of the PR.

For fuzzy inclusion, I think the behavior you've made makes more sense - when you explicitly specify --fuzzy you want these strings to be present (as --nofuzzy is the default behavior). Even with --removeuntranslated.

Don't get me wrong, I just don't want to merge change which will end up breaking existing workflows and this easily can do that. The perfect solution would be to have a single logic used by all the convertors (see #3573).

@rv0
Copy link
Author

rv0 commented Feb 3, 2021

Does any of the other converters use some sort of parameter to disable the inclusion of source strings? If so I can add that as well (either here or in another PR)

@nijel
Copy link
Member

nijel commented Feb 3, 2021

I think --removeuntranslated is really the parameter which controls this, but it's supported only by few converters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants