Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

Merged
merged 6 commits into from
Nov 1, 2023

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Oct 24, 2023

Closes JuliaLang/julia#40071 — never break an identifier within a grapheme boundary. (Identifiers and graphemes may still only start within the pre-existing allowed set of characters.)

This allows us to accept identifiers like 🏳️‍🌈 == 🏳️‍[ZWJ]🌈 which contain the U+200D "Zero-width joiner" (ZWJ) character, which was not previously allowed since it's in category Cf (Other, format). However, to prevent incomplete graphemes ending with an invisible ZWJ character, we disallow ZWJ at the end of an identifier; similarly for U+200C "Zero-width non-joiner" (ZWNJ).

@stevengj
Copy link
Member Author

stevengj commented Oct 24, 2023

I'm not sure if we also need to update the Scheme parser as well? A similar patch for the Scheme parser would be

diff --git a/src/flisp/julia_extensions.c b/src/flisp/julia_extensions.c
index f29e397275..fa04339b5d 100644
--- a/src/flisp/julia_extensions.c
+++ b/src/flisp/julia_extensions.c
@@ -178,6 +178,7 @@ JL_DLLEXPORT int jl_op_suffix_char(uint32_t wc)
 static int never_id_char(uint32_t wc)
 {
      utf8proc_category_t cat = utf8proc_category((utf8proc_int32_t) wc);
+     if (wc == 0x200c || wc == 0x200d) return 0; // ZWNJ/ZWJ is allowed in emoji, despite being in category Cf
      return (
           // spaces and control characters:
           (cat >= UTF8PROC_CATEGORY_ZS && cat <= UTF8PROC_CATEGORY_CS) ||
@@ -329,6 +330,8 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
     if (!iscprim(args[0]) || ((cprim_t*)ptr(args[0]))->type != fl_ctx->wchartype)
         type_error(fl_ctx, "accum-julia-symbol", "wchar", args[0]);
     uint32_t wc = *(uint32_t*)cp_data((cprim_t*)ptr(args[0])); // peek the first character we'll read
+    uint32_t prev_wc = wc;
+    utf8proc_int32_t grapheme_state = 0;
     ios_t str;
     int allascii = 1;
     ios_mem(&str, 0);
@@ -345,9 +348,10 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
         }
         allascii &= (wc <= 0x7f);
         ios_pututf8(&str, wc);
+        prev_wc = wc;
         if (safe_peekutf8(fl_ctx, s, &wc) == IOS_EOF)
             break;
-    } while (jl_id_char(wc));
+    } while (jl_id_char(wc) || !utf8proc_grapheme_break_stateful(prev_wc, wc, &grapheme_state));
     ios_pututf8(&str, 0);
     return symbol(fl_ctx, allascii ? str.buf : normalize(fl_ctx, str.buf));
 }

although this doesn't implement the rule of forbidding ZWJ at the end of an identifier.

@c42f
Copy link
Member

c42f commented Oct 26, 2023

Cool this looks sensible. I worry about the extra cost of calling into the C code per character for such a relatively hot (presumably) code path. What does the script in test/benchmark.jl say about performance impact?

Personally I'm not worried about keeping the scheme parser perfectly up to date. Primarily it only needs to be able to parse Base until we figure out the bootstrapping story and can eventually delete it.

(Side note... we really should bring Base.is_id_char() into JuliaSyntax at some point --- this library does aim to be installable as a normal package on both old and new versions of Julia and behave the same everywhere...)

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Merging #372 (75a1fc2) into main (1b048aa) will decrease coverage by 0.01%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #372      +/-   ##
==========================================
- Coverage   96.58%   96.57%   -0.01%     
==========================================
  Files          14       14              
  Lines        4160     4179      +19     
==========================================
+ Hits         4018     4036      +18     
- Misses        142      143       +1     
Files Coverage Δ
src/tokenize.jl 99.08% <100.00%> (+0.01%) ⬆️

... and 2 files with indirect coverage changes

@stevengj
Copy link
Member Author

stevengj commented Oct 26, 2023

Here are the results of test/benchmark.jl:

main branch, minimum times:

  • ParseStream: 205.944 ms
  • GreenNode: 214.803 ms
  • SyntaxNode: 403.870 ms
  • Expr: 472.283 ms
  • flisp: 3.644 s

sgj/ZWJ branch:

  • ParseStream: 179.254 ms
  • GreenNode: 241.707 ms
  • SyntaxNode: 427.242 ms
  • Expr: 496.298 ms
  • flisp: 3.652 s

It looks like a 15% slowdown on ParseStream, a 10% slowdown on GreenNode, and a 5% slowdown on SyntaxNode?

@stevengj
Copy link
Member Author

stevengj commented Oct 26, 2023

As a more realistic data point, here is the cost to parse all of base (Julia 1.9.2) with the following code:

using JuliaSyntax, BenchmarkTools
include(dirname(pathof(JuliaSyntax)) * "/../test/test_utils.jl")

function bench_parse_all_in_path(basedir)
    for filepath in find_source_in_path(basedir)
        text = read(filepath, String)
        stream = ParseStream(text)
        parse!(stream)
        ex = build_tree(Expr, stream, filename=filepath)
    end
end

p = joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base")
p = isdir(p) ? p : joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "src", "base") 

@btime bench_parse_all_in_path($p)

which gives:

  • main branch: 543.924 ms (2117659 allocations: 139.46 MiB)
  • sgj/ZWJ branch: 558.836 ms (2719985 allocations: 148.77 MiB)
    i.e. a slowdown of < 3%.

My conclusion is that the slowdown in realistic workloads will be negligible, especially since parsing is usually only a very small portion of load time.

@c42f
Copy link
Member

c42f commented Oct 27, 2023

I don't love these slowdowns - we could live with them of course, but it's very significant for such a small change and it does show this is a relatively hot code path (no surprise I guess).

Luckily there's probably a super simple optimization: if you throw in an isascii() before you hit the slow path of isgraphemebreak!() it's probably almost free for most code?

@c42f
Copy link
Member

c42f commented Oct 27, 2023

Another thing - while 3% is somewhat insignificant in building Base, and the slowdown for GreenNode and SyntaxNode may seem irrelevant it's not exactly: it's quite relevant to surface-level tooling which works across a whole registry - for example package analyzer.

@Keno
Copy link
Member

Keno commented Oct 27, 2023

Just hardcore the answer for ascii characters? That feels like is should solve 99% of the perf issues.

@stevengj
Copy link
Member Author

stevengj commented Oct 27, 2023

Following the suggestion by @c42f and @Keno, I added an optimization for the common case of ASCII identifiers. The minimum times are now (on a different computer than above):

sgj/ZWJ branch:

  • ParseStream: 114.662 ms ms
  • GreenNode: 134.884 ms
  • SyntaxNode: 247.248 ms
  • Expr: 300.598 ms
  • flisp: 2.306 s

main branch:

  • ParseStream: 115.894 ms
  • GreenNode: 136.842 ms
  • SyntaxNode: 252.208 ms
  • Expr: 304.795 ms
  • flisp: 2.342 s

Voilá, no performance penalty! (It's actually very slighly faster, since we no longer call is_identifier_char at all for ASCII chars, avoiding a lookup of the Unicode category.)

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for sorting out the performance hit (and in fact, making the performance better!) I feel like this could benefit from another couple of test cases but other than that looks great to me.

test/tokenize.jl Outdated Show resolved Hide resolved
src/tokenize.jl Show resolved Hide resolved
Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes we do test other invisible chars elsewhere, thanks for the reminder 😅

Merge when ready :)

@stevengj stevengj merged commit 05c594b into main Nov 1, 2023
32 checks passed
@stevengj stevengj deleted the sgj/ZWJ branch November 1, 2023 12:26
@stevengj
Copy link
Member Author

stevengj commented Nov 1, 2023

I suppose a julia/NEWS.md item will be added when the JuliaSyntax version is bumped in the julia repo? Is there a place to make a note of that in the meantime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What to do about ZWJ emoji sequences
3 participants