A new approach to string-$

The "string-`$`" feature can be a bit of a pain to implement, and we've had buggy implementations for every target language except C (or including C if you count `-Wshadow` warnings in generated code as a bug).

This feature is not used in any of the currently shipped stemmers, which is part of the reason why buggy implementations have taken years to detect.  We do now have runtime tests which include string-$, which should help.

Martin Porter's snowball implementation of the Schinke Latin stemmer uses string-$, and it would be hard to rewrite it to not do so, so it is a potentially useful feature for future algorithms.  We don't currently ship this Latin stemmer though, because it produces two stems (noun and verb) for each input which doesn't fit the pattern of the other stemmers.

Ideally an operation like `$s 'x'` (test if `s` has prefix _x_) should be O(1) rather than O(max(`size`, `sizeof s`)).  That's fairly easy to achieve in target languages where it's possible to swap around the values of string variables (e.g. C because strings are really just pointers, Java where variables are references to objects) since we can do something like this:

```
{
  string tmp = current;
  current = s;
  // code for the subcommand "inside" the string-$
  s = current; // Needed if update operations on current might create a new string (e.g. immutable string, COW, or reallocation as the string grows)
  current = tmp;
}
```

(To make the string handling part here clearer, I've omitted the code related to handling and propagating failure of the subcommand, for saving/resetting/restoring the cursor/limit/etc, and in practice the saving and restoring of `current` is sometimes done along with saving and restoring the `cursor`, etc by copying the base class or context object.)

If swapping around pointers or references is not possible, we can instead copy the strings which may well look very similar to the above, but the underlying operations being invoked are different.  This copying may be avoidable with a layer of indirection for some languages, but that probably adds overhead for any operation involving the current string.

We could potentially handle this in the compiler by tracking whether operations are being applied to the current string or a variable, and then just generating code based on that (e.g. by mapping node types in the parse tree: `c_len` to `c_lenof`, `c_size` to `c_sizeof`, `c_insert` to (new node type) `c_insert_on_string`, etc).

We would either need to track a separate cursor, limit and limit_backwards for each string variable (that is used in string-$) or have `c_dollar` still save/reset/restore these.

One wrinkle here is that string-$ is not simply lexically scoped (if you're familiar with Perl, it's more like `local` than `my`) - if a routine is called inside the string-$ it should also work on the string variable, but the same routine could be called outside string-$ and then should work on the current string so we'd need to be prepared to generate multiple versions of such routines (which could just be duplicating them in the parse tree).  That seems feasible to do but perhaps not worthwhile until it's more widely used (Latin's userbase isn't what it used to be!) though it would avoid this being one of the fiddlier parts of a new target language.

This probably isn't worth implementing until we actually have a stemmer we want to ship which uses string-$.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A new approach to string-$ #256

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A new approach to string-$ #256

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions