Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LANG=C, LC_ALL=C, etc. (Problems with extended ASCII in strings) #529

Open
Crestwave opened this issue Nov 9, 2019 · 8 comments

Comments

@Crestwave
Copy link
Contributor

osh$ LC_ALL=C
osh$ c=$(printf \\377)
osh$ echo "${#c}"
  echo "${#c}"
           ^
[ interactive ]:3: warning: Invalid start of UTF-8 character
-1
osh$ case $c in '') echo a ;; "$c") echo b ;; esac
a
@andychu
Copy link
Contributor

andychu commented Nov 9, 2019

OK yes this is #363 and #528

Not sure exactly what we will do here yet. Although Oil implements length in code points because bash supports it, it's actually a fairly useless operation.

Usually you want length in bytes, e.g. for a length-prefixed serialization like netstrings. I can't think of any use case for length in code points.

@Crestwave
Copy link
Contributor Author

Crestwave commented Nov 9, 2019

As with the previous issue, my main concern is that OSH silently matches it successfully against any pattern. I don't seem to see that mentioned in the other issues.

@andychu
Copy link
Contributor

andychu commented Nov 9, 2019

That may be because a truncated utf-8 character is treated as empty by fnmatch() in libc. In any case we will make sure to use that snippet in the spec tests.


Also note that bash's behavior is totally incoherent when there's invalid utf-8:

https://github.com/oilshell/oil/blob/master/spec/var-op-len.test.sh#L58

When given a monotonically increasing number of bytes, it produces this sequence of length in chars, which goes up and down!

0
1
2
3
3
4
5
6
5
6
7
8
9
7

@Crestwave
Copy link
Contributor Author

That may be because a truncated utf-8 character is treated as empty by fnmatch() in libc.

Why would something empty be successfully matched against anything, though (case $c in asdf) echo a; esac)? This even happens if there are other characters in the string.

@Crestwave
Copy link
Contributor Author

Also note that bash's behavior is totally incoherent when there's invalid utf-8:

https://github.com/oilshell/oil/blob/master/spec/var-op-len.test.sh#L58

When given a monotonically increasing number of bytes, it produces this sequence of length in chars, which goes up and down!

0
1
2
3
3
4
5
6
5
6
7
8
9
7

I guess it's fine not to be compatible with the length, then. I'd probably rather have the error than this.

The matching actually works on Haiku, so it seems that it is a libc issue. At least you seem to already be considering rewriting the function in #523, so supporting this might not take much extra work?

@andychu andychu changed the title Problems with extended ASCII in strings Support LANG=C, LC_ALL=C, etc. (Problems with extended ASCII in strings) Nov 12, 2019
@andychu
Copy link
Contributor

andychu commented Nov 12, 2019

@Crestwave I just copied your test case in to the spec tests, and the underlying cause is that Oil assumes everything is UTF-8 now.

It doesn't support LANG=C or LC_ALL=C, etc.

But after a conversation with @asokoloski on Zulip, I intend for Oil to support those features.

This works depends on #527 , which isn't too bad.

So when we add libc env variable support, then this bug along with a few others should be fixed. Thanks for the report! It took awhile to figure out what we want, but the concrete test cases help.

(BTW the underlying issue LC_ALL=C is handled specially in bash. Just setting that variable doesn't make libc see it. A shell has to do a setlocale() call whenver the variable is changed! Basically Oil isn't calling libc the right way right now. )

@andychu
Copy link
Contributor

andychu commented Jun 10, 2020

Note: the j JSON parser uses LANG=C, and I believe that's why it doesn't run correctly

@andychu
Copy link
Contributor

andychu commented Mar 23, 2021

Hit this with the regex [\xff] -- this doesn't match the right byte in libc's UTF-8 mode, but it would work if LANG=C.

But Oil doesn't respect that yet


I would also like some alias like this:

shopt --unset libc_utf8 {
  if (s ~ pat) {
    echo "matched in C / bytes mode"
  }
}

Test files affected:

  • spec/var-op-patsub
  • spec/oil-regex (for the \xff issue)

andychu pushed a commit that referenced this issue Mar 23, 2021
CPython DOES call it, so there's a workaround in the dev build only:
libc.cpython_reset_locale().

- Remove explicit setlocale() calls from native/libc.c
- Remove setlocale() calls when OVM_MAIN is defined
- Add the cpython_reset_locale() hack for the remaining case
- Refactor OVM_MAIN check into into pyutil.IsAppBundle()

Spec tests:

- spec/glob: Test started passing!  Woohoo we're more consistent.
- spec/oil-regex: Documented that we need LANG=C support.  This is issue
  #529.

Addresses issue #912.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants