Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test effect of locale environment variables on Oil #522

Closed
andychu opened this issue Nov 4, 2019 · 11 comments
Closed

Test effect of locale environment variables on Oil #522

andychu opened this issue Nov 4, 2019 · 11 comments

Comments

@andychu
Copy link
Contributor

andychu commented Nov 4, 2019

  • bash and zsh respect LANG=C, but dash doesn't. What does Oil do? The sh_spec.py framework hard codes LANG, but we should test with a variety of settings.
  • And LC_CTYPE also affects how libc fnmatch(), glob(), and regcomp() work

Also:

  • Consider a hybrid solution to the char class problem, discussed here:

https://oilshell.zulipchat.com/#narrow/stream/121539-oil-dev

@andychu
Copy link
Contributor Author

andychu commented Nov 4, 2019

(moved this comment) to #523

@andychu andychu changed the title Test Unicode char class behavior more Test effect of locale environment variables on Oil Nov 8, 2019
@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

Right now all the operations that use libc respect LANG

  • fnmatch() -- for case, and var-op-strip
  • regcomp() -- for var-op-patsub
  • TODO: glob() isn't consistent. See unicode char test case in test/spec.sh glob
  • and I think wcswidth() relies on the locale

However the ones we code ourselves don't:

  • length ${#x}
  • slicing ${x:1:3}

Should we respect LANG in those cases? @asokoloski might have some thoughts.

Basically I think OSH should be consistent across all string operations.

One way to do that would be:

  • Only accept C and utf-8 settings like OpenBSD
  • implement length and slicing on bytes (not code points) when LANG=C
    • basically we're emulating libc a bit

Another way would be:

  • Only accept utf-8 settings. Have some other setting besides LANG for Oil, like set +o unicode-chars or something.

Actually the second option is a subset of the first, so it might be OK to start with that. And if people complain, we can allow LANG=C?

I'm not sure of all the details, but these are my thoughts. Feedback is appreciated! If I have misunderstood how libc works, let me know :)


Another problem: are there systems where you can't set the locale to utf-8? And utf-8 only semantics won't work? That would argue in favor of also allowing LANG=C

@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

Related to #523 . But even if we write our own glob() and fnmatch(), we're going to depend on libc for understanding [[:alpha:]], etc. So we don't have to maintain our own unicode tables in Oil.

andychu pushed a commit that referenced this issue Nov 8, 2019
And also add a test to see what happens when LANG=C.

OSH fails because it shouldn't respect unicode with _?_.

Addresses issue #522.
@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

@asokoloski

Hm I added a test case and am confused by the result. This new test case explicitly exports export LANG=C, but it differs from other shells.

I would have thought the setlocale("") call in the patch means that is respected. From man setlocale:

If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent. For glibc, first (regardless of category), the envi‐

But it does the substitution of _[mu]_ when other shells don't. Hm.

 16     150     pass    BUG     pass    pass    Substitute one unicode character
 17     171     pass    pass    pass    FAIL    Can't substitute one unicode character when LANG=C

@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

Yeah I undid the patch by commenting out setlocale in func_regex_first_group_match(), and now case 16 fails and 17 passes.

Not saying there was anything wrong with the patch -- I still don't have a 100% clear idea of how Oil should behave. As mentioned I want it to be consistent between the operations that use libc and the ones that don't.

I'm just confused because setlocale() doesn't appear to be working the way the man page describes. This is on Ubuntu 16.04 with GNU libc.

andychu pushed a commit that referenced this issue Nov 8, 2019
Move the setlocale() restoration after the regexec() call just in case.
It doesn't appear to change the behavior.

Related to issue #522.
@asokoloski
Copy link
Contributor

@andychu I see the same thing. But the problem, as far as I can tell, is not with setlocale, but with the way we're handling environment vars. Printing out LANG and LC_CTYPE (using getenv) from within func_regex_first_group_match gives en_US.UTF-8 for case 17, not C as I would have expected.

@asokoloski
Copy link
Contributor

@andychu ah, so osh/state.py has its own Mem thingy, which doesn't seem to interact with the actual process environment (in this particular case, at least).

The most straightforward way to make this work, that I can think of right now, is to explicitly pass mem, or just the LANG and LC_CTYPE values we want, down through all the calls down to func_regex_first_group_match and the other calls that need them. I believe the logic in setlocale is to check the particular category of interest like LC_CTYPE first, and then fallback to LANG if it's not set. We can replicate that.

But I'm sure you can tell me if there's a better way :)

@asokoloski
Copy link
Contributor

Other related stuff -- it seems like you can't set locales that use an encoding like UTF-16 or UTF-32, because linux requires character encodings to be null-safe. So the only character encodings I've seen are ones like UTF-8, C, or ISO-8859-.

I've been playing around a bit with the interesting cases where filenames are in a different encoding than the current one. Here's what happens in bash when you create two files containing the word "needle" in swedish ("nål"), one with utf-8 encoding, and the other with iso-8859-1 encoding:

$ LANG=en_US.utf8 echo *$'\345'* | xxd
00000000: 6ee5 6c2d 6973 6f2d 3838 3539 2d31 0a    n.l-iso-8859-1.
$ LANG=en_US.utf8 echo *$'\345'*
n�l-iso-8859-1
$ LANG=en_US.utf8 echo *�* # character copied and pasted from previous line
*�*

Globbing works fine if you escape the weird character, as that escape is probably interpreted after decoding the pattern, during the compilation step, but when you pass an invalid utf-8 string directly, it seems like bash gives up and doesn't treat it as a glob pattern anymore.

I'll try to look into this a bit more soon.

@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

Ah I see! Thanks for debugging it. The issue is that in shell, export means:

"set these environment variables on subsequent process invocations"

It doesn't actually call call setenv(). In other words, it mean "THIS process will see different environment variables".

So how does bash work? Here is a little hack I found. It has a special hook.

I think we might need this in some other cases too. Let me file another bug about it.

From locale.c

/* Called when LANG is assigned a value.  Tracks value in `lang'.  Calls                                                                                                                                                                      
   reset_locale_vars() to reset any default values if LC_ALL is unset or                                                                                                                                                                      
   null. */                                                                                                                                                                                                                                   
int                                                                                                                                                                                                                                           
set_lang (var, value)                                                                                                                                                                                                                         
     char *var, *value;                                                                                                                                                                                                                       
{                                                                                                                                                                                                                                             
  FREE (lang);                                                                                                                                                                                                                                
  if (value)                                                                                                                                                                                                                                  
    lang = savestring (value);                                                                                                                                                                                                                
  else                                                                                                                                                                                                                                        
    {                                                                                                                                                                                                                                         
      lang = (char *)xmalloc (1);                                                                                                                                                                                                             
      lang[0] = '\0';                                                                                                                                                                                                                         
    }                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                              
  return ((lc_all == 0 || *lc_all == 0) ? reset_locale_vars () : 0);                                                                                                                                                                          
}                                                                                                                                                                                                                                             
/* Called to reset all of the locale variables to their appropriate values                                                                                                                                                                    
   if (and only if) LC_ALL has not been assigned a value.  DO NOT CALL THIS                                                                                                                                                                   
   IF LC_ALL HAS BEEN ASSIGNED A VALUE. */                                                                                                                                                                                                    
static int                                                                                                                                                                                                                                    
reset_locale_vars ()                                                                                                                                                                                                                          
{                                                                                                                                                                                                                                             
  char *t;                                                                                                                                                                                                                                    
#if defined (HAVE_SETLOCALE)                                                                                                                                                                                                                  
  if (lang == 0 || *lang == '\0')                                                                                                                                                                                                             
    maybe_make_export_env ();   /* trust that this will change environment for setlocale */                                                                                                                                                   
  if (setlocale (LC_ALL, lang ? lang : "") == 0)                                                                                                                                                                                              
    return 0;                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                              
#  if defined (LC_CTYPE)                                                                                                                                                                                                                      
  t = setlocale (LC_CTYPE, get_locale_var ("LC_CTYPE"));   

@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

Hm I filed #527 ... but on the other hand, compatibility aside, I wonder if it even makes sense to have a special hook for LANG?

Maybe this is a good opportunity to have something like shopt -s utf8 for the shell. Which could be on by default.

To disable it, you could use blocks:

shopt -u utf8 {
  echo ${#s}  # length in bytes
}
echo ${#s}  # length in code points

This is better than a special global variable with a hidden side effect IMO ...

Again I don't like the global LANG because it's inherently kind of ignorant of a program dealing with data in multiple encodings.

You can easily have a single file system where one dir has filenames in latin1 and another dir has utf-8. Just because grep is dealing with certain encodings, doesn't mean that the bash that calls grep is also dealing with those.

@andychu
Copy link
Contributor Author

andychu commented Nov 8, 2019

OK this testing is done, but the solution isn't. Let's continue discussion on Zulip or #527 and #528

Thanks a lot for looking carefully at this!

@andychu andychu closed this as completed Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants