Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow headers to be capitalized TASK-1174 #332

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rgraber
Copy link

@rgraber rgraber commented Jan 30, 2025

Summary

Allow capitalization of the 'label' and 'name' headers in an xls upload.

Notes

When an XLSForm is uploaded with capital letters in the label or name headers, interpret it the same as their lowercase counterparts. This fixes a ValueError on deploy, which was happening because the import didn't recognize the label column as one that doesn't expect translations.

Preview

Note: when switching between main and this branch in the formpack code, make sure to run pip-compile and pip install again.

  1. clone this repo into your local copy of kpi
  2. update kpi's requirements.in to use -e /container/path/to/kpi/formpack instead of the git address currently there
  3. pip-compile dependencies/pip/requirements.in && pip-compile dependencies/pip/dev_requirements.in
  4. pip install -r dependencies/pip/dev_requirements.txt
  5. Create a project in kpi by uploading this form
    SimpleForm-copy.xlsx
  6. Deploy the project
  7. 🔴 [on main] Error. If you have debug mode on, it will say ValueError: "Label" column is not translated
  8. 🟢 [on branch] Project deploys successfully

@rgraber rgraber changed the title fix: allow headers to be capitalized fix: allow headers to be capitalized TASK-1174 Jan 30, 2025
@rgraber rgraber requested a review from jnm January 30, 2025 18:56
@rgraber rgraber self-assigned this Jan 30, 2025
@coveralls
Copy link

Coverage Status

coverage: 86.674%. remained the same
when pulling cb00206 on rsgraber/TASK-1174-allow-capital-label-header
into cc9aaf6 on main.

@jnm
Copy link
Member

jnm commented Feb 4, 2025

5. Create a project in kpi by uploading this form
SimpleForm-copy.xlsx

I couldn't reproduce the problem with this form, locally or on kf.kobotoolbox.org, but I was able to get the ValueError by using Impact-Study-Consent-Form-DRC-1.xlsx from https://chat.kobotoolbox.org/#narrow/stream/6-Kobo-Support/topic/UI.20Issue/near/470119 (internal link). It seems like the important difference is that this file has mixed-case Label (in survey) and label (in choices). The "SimpleForm" XLSForm has Label in both.

Here's a basic reproduction case that works for me:
survey label upper, choices label lower.xlsx

@jnm
Copy link
Member

jnm commented Feb 4, 2025

Two observations:

  1. This fix results in both Label and label appearing in the translated array in the form JSON (Asset.content in KPI). That's not desirable:
     "translated": [
        "Label",
        "label",
        "media::image"
    ],
    • Contrast that with all questions and choices having the standardized lowercase label in the same form JSON (which is good)
  2. Because the fix affects manipulation of the content when the XLSForm is first imported, existing forms will continue to 500 on deployment attempts. This isn't strictly a dealbreaker, but it's not ideal.
    • It looks like we'd have to tolerate bare-string labels in the form JSON if we want to avoid having people reimport. That seems icky, as does messing with the asset content in KPI's bugfix.py. Let's not worry about it.

@jnm
Copy link
Member

jnm commented Feb 4, 2025

@tiritea, I'm curious what you think about diabolical behavior such as this (in the survey sheet) that we begin to allow with case-insensitive column names:

type name Label label
text name What Is Your Name what is your name

diabolical.xlsx

https://getodk.org/xlsform/ doesn't complain at all, and spits out XForm containing:

<text id="/data/name:label">
	<value>What Is Your Name</value>
</text>

I don't really want to introduce a case-insensitive duplicate column detection mechanism right now, so I'm inclined to just ape the pyxform behavior and take the first column encountered—assuming that's what it's doing.

@jnm
Copy link
Member

jnm commented Feb 4, 2025

OK, final comment of the night from me, I think 😅

This PR aims to tolerate any kind of capitalization in untranslated label and hint columns. It doesn't address case sensitivity in other columns, like label::English (en), media::image, or media::image::English (en).

If we want to be insensitive about case in column names, I think we need to standardize everything as lowercase except for language names. For example, English (en) and eNgLiSh (en), while evil, should be treated separately, and indeed an XLSForm with this survey sheet:

type name label::eNgLiSh (en) label::english (en)
text name What Is Your Name what is your name

…when run through pyxform yields:

<translation lang="eNgLiSh (en)">
	<text id="/data/name:label">
		<value>What Is Your Name</value>
	</text>
</translation>
<translation lang="english (en)">
	<text id="/data/name:label">
		<value>what is your name</value>
	</text>
</translation>

The real use case is that people will capitalize their language names as they see fit, and we should not interfere with that.

Everything else, though, should get standardized, e.g. Media::Image::Español (es) should become media::image::Español (es). This will require some study of what's going on in _get_special_survey_cols().

Oh, and one more thing: I don't see why there's not a continue after the first conditional in the loop. It seems like an old oversight:

    for column_name in uniq_cols.keys():
        if column_name in ['label', 'hint']:
            …
            # put a `continue` here?

I added one out of curiosity, and all the tests still pass 🤷 None of the other string comparisons or regex matches should evaluate as true for label or hint, so I think we're just doing a bunch of logic for nothing without the continue.

@tiritea
Copy link

tiritea commented Feb 4, 2025

I didn't see them referenced, but here are some related pyxform issues regarding case-(in)sensitivity:

XLSForm/pyxform#138
XLSForm/pyxform#738

@tiritea
Copy link

tiritea commented Feb 4, 2025

I'm curious what you think about diabolical behavior such as this (in the survey sheet) that we begin to allow with case-insensitive column names:

My thoughts: 🤮

TBH I dont have no idea if there is proscribed behavior for this! It could well be pyxform is taking the first, or it could just as easily be it is taking the last (but processing them in reverse order?), or...

So I'm probably also inclined to just ape whatever pyxfrom seems to be doing. From the above links, it certainly appears they should be treated case insensitively, but in saying that pyxfrom is throwing an error if you have all upercase 'LABEL' (!?)

type name LABEL
text name What Is Your Name

XLSForm Online sayeth: "Error: The survey element named 'name' has no label or hint."

@tiritea
Copy link

tiritea commented Feb 4, 2025

FWIW, it looks like rather than simply picking up the first matching column, pyxform may scan all them, till it finds a suitable populated match, but then doesn't continue looking further (!?)

Screenshot 2025-02-04 at 4 56 29 PM
DuplicateColumns.xlsx

Screenshot 2025-02-04 at 4 57 01 PM

@rgraber
Copy link
Author

rgraber commented Feb 4, 2025

Somewhat tangential question @jnm were you using kobo-compose or kobo-install? I did repro the issue on my machine using the form I attached (and was able to deploy successfully on the PR branch) and I wonder what the difference was.

@jnm
Copy link
Member

jnm commented Feb 4, 2025

Somewhat tangential question @jnm were you using kobo-compose or kobo-install?

To be honest, I was using my own weird thing this time, but I tried to control for that by testing the example file on production (kf.kobotoolbox.org).

@rgraber
Copy link
Author

rgraber commented Feb 4, 2025

Is the solution to the immediate problem here to convert all column headers to lowercase unless they contain "::" before processing? It doesn't answer the question of which label column we use but it wouldn't make things worse, presumably.

@jnm
Copy link
Member

jnm commented Feb 5, 2025

convert all column headers to lowercase unless they contain "::"

Unfortunately not; the logic is more complex. Some portions of column headers containing :: should be lowercased—essentially everything but language names—as I mentioned earlier:

Everything else, though, should get standardized, e.g. Media::Image::Español (es) should become media::image::Español (es). This will require some study of what's going on in _get_special_survey_cols().


Yes, I'm happy to punt on the duplicate-column question and agree that what we do here won't make that situation worse.

@rgraber
Copy link
Author

rgraber commented Feb 10, 2025

If I am reading this code correctly, we look for

"label" -> untranslated,
"hint" -> untranslated,
regex ^(media\s*::?\s*)?(audio|image|video)\s*::?\s*([^:]+)$ -> translation is the last capture group (eg "media:audio:Spanish", "audio:Spanish", "media :: audio :: Spanish (es)")
regex f'^(media\s*::?\s*)?({audio|image|video})$' -> untranslated (eg "media:audio", "video")
regex '^([^:]+)\s*::?\s*([^:]+)$' -> translation is last capture group (eg "foo::English", "label :: spanish (es)")

It will be a bit of a pain, but we should be able to use these same regexes to pre-process columns and put everything that isn't a language into lowercase. I'm not yet 100% sure where the best place to do that is.

ETA: part of me thinks the place to do this pre-processing is actually in kpi where we first read/load the spreadsheet but I don't love the idea of kpi having to know the internals of formpack. It's possible we can just put a 'standardize column name' method in formpack and have kpi call it though, which keeps everything nice and contained.

EATA: or we could put this pre-processing at the beginning of expand_content. As you may have noticed, I'm working this out while typing this comment : P

@rgraber
Copy link
Author

rgraber commented Feb 10, 2025

@jnm ignoring the current very ugly state of the implementation, does this test look about right:

    @data(
        ('FOO', 'foo'),
        ('LABEL', 'label'),
        ('HINT', 'hint'),
        ('BIND::FOO', 'bind::FOO'),
        ('BODY : FOO', 'body : FOO'),
        ('MEDIA:AUDIO:Spanish', 'media:audio:Spanish'),
        ('VIDEO :: SPANISH', 'video :: SPANISH'),
        ('MEDIA:AUDIO', 'media:audio'),
        ('IMAGE', 'image'),
        ('LABEL : SPANISH', 'label : SPANISH')
    )
    @unpack
    def test_clean_column_name(self, name, expected):
        assert clean_column_name(name) == expected

at the moment i'm calling it on every key in every row, which i don't love, but right now i just want it to work and will optimize later

@rgraber
Copy link
Author

rgraber commented Feb 12, 2025

@jnm @tiritea do either of you have an example handy of an xlsx or xls that has a 'translations' sheet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants