Skip to content

String patterns #140

@AnthonyDGreen

Description

@AnthonyDGreen

This builds on proposal #124 "Pattern Matching" with a pattern specifically designed for the decomposing (parsing) strings. The syntax is designed to as closely as practical mirror the interpolated string syntax introduced in VB14 for creating strings.

Select Case input
    Case Match $"{scheme}://{rest}" When scheme = "ms-word"
        
    Case Match $"{user}@{domain}" When domain <> "outlook.com"
        Throw New NotSupportedException("All must use outlook.com!")
    Case Match $"{*}.docx"
                
    Case Match $"{CInt(id)}|{CDate(date)}|{description}|{CDbl(total)}"
        ' CSVs are never this easy to parse. Sooner or later 75 rows in somebody gets clever.
End Select

After 20 years of InStr, Mid, Left, and Right this really excites me. Even if I'm using IndexOf and Substring it still feels like this kind of parsing is still a frequently recurring task in my life.

This would compose with other patterns too, the "interpolations" bounded by { and } could contain other patterns. Right now I've only been able to figure out how to make it work with lazy matching and without backtracking and it falls apart if two interpolations appear side by side (with no text between) since the first will eat up all the text.

Should the "alignment" part of an interpolation be usable to require/match a substring of fixed or minimal length? Match $"{y,4}-{m}-{d}"
Maybe. It could help with the problem mentioned above.

Is there anything at all that would make sense with the "format" part of an interpolation? It seems hard since there's really no way to get that part to mean the same thing coming out as it does going in.

We need to find the sweet spot for productivity vs. power (as always). These be dangerous waters.

Can we do something to modernize the VB Like operator instead?
Not sure.

Why not full-blown regex literals like in Perl?
This has always been a great personal temptation for me. At the moment I don't think this is the right way to go. RegEx is about making extremely complicated things more terse (and cryptic). That's counter to our goals with VB of making things straight-forward and approachable. All but the very simplest of regexes (regexen?) quickly become arcane magic. One test of this proposal will be how much people still need to fall back to regex with it.

Does this pattern need built-in alternation?

It does seem natural to support alternation within this pattern as a way of describing optionality:

Select Case url
    Case Match $"{scheme}://{domain}:{port}/{path}?{query}",
               $"{scheme}://{domain}:{port}/{path}",
               $"{scheme}://{domain}/{path}",
               $"{domain}/{path}?{query}",
               $"{domain}/{path}",
               $"{domain}"
               
    Case Match $"{drive}:\{absolute}", ' Full path.
               $"\{absolute}",         ' Absolute path on current drive.
               $"{drive}:{relative}"   ' Relative path on current drive.
               $"{relative}"           ' Wait, what, this is a thing?
                              
End Select

Any term that isn't matched in all cases may be null (maybe we should require a ? after the name then). The second Case is complete (I think), but the first does not exhaustively handle all permutations. I think the number of cases you'd need to write to represent all possibilities is 16 (could be wrong). I think the correct code is:

Select Case url
    Case Match $"{scheme}://{domain}:{port}/{path}?{query}",
               $"{scheme}://{domain}:{port}/{path}",
               $"{scheme}://{domain}:{port}?{query}",
               $"{scheme}://{domain}:{port}",
               $"{scheme}://{domain}/{path}?{query}",
               $"{scheme}://{domain}/{path}",
               $"{scheme}://{domain}?{query}",
               $"{scheme}://{domain}",
               $"{domain}:{port}/{path}?{query}",
               $"{domain}:{port}/{path}",
               $"{domain}:{port}?{query}",
               $"{domain}:{port}",
               $"{domain}/{path}?{query}",
               $"{domain}/{path}?{query}",
               $"{domain}?{query}",
               $"{domain}"

Is that really so much more readable than "((<scheme>.+)://)?(<domain>*+)(:(<port>\d+))?(/(<path>.*))?(\?(<query>.+))?"
Well, yes, and infinitely easier to reason about (my brain froze several times writing it), but that's not the point.

Is there some way to support greedier matching or backtracking?
So far pattern functions as I envision them take the form <Function([ByRef p1 [, ByRef p2, ...]]) As Boolean>. Maybe there's some other form we could consider for strings (or maybe all enumerables?) that could let the match function too darn complicated.

This could be pretty hard to prototype and needs a lot of spec work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions