Skip to content

Conversation

@mauroporras
Copy link

@mauroporras mauroporras commented Oct 24, 2025

Summary

This PR adds a new selection-word-chars configuration option that allows users to customize which characters mark word boundaries during text selection operations (double-click, word selection, etc.).

Motivation

This's been on my wishlist for a while. Inspired by #9069 which added semicolon as a hardcoded word boundary, this PR takes the concept further by making word boundaries fully configurable. Different workflows and use cases benefit from different boundary characters - SQL developers might want semicolons as boundaries, while others working with file paths or URLs might prefer different settings.

This approach is similar to zsh's WORDCHARS environment variable, giving users fine-grained control over text selection behavior.

Changes

  • New config option: selection-word-chars with default value ` \t'"│`|:;,()[]{}<>$`
  • Runtime UTF-8 parsing: Boundary characters are parsed from UTF-8 string to u32 codepoints
  • Updated function signatures: selectWord() and selectWordBetween() now accept boundary characters as parameters
  • All call sites updated: Surface.zig, embedded.zig, and all test cases updated

Usage

Users can now customize word boundaries in their config:

# Remove semicolon from boundaries (treat as part of words)
selection-word-chars = " \t'\"│`|:,()[]{}<>$"

# Remove periods for better URL selection
selection-word-chars = " \t'\"│`|:;,()[]{}<>$"

Implementation Details

  • Boundary characters are stored in DerivedConfig and passed through to selection functions
  • UTF-8 parsing happens at runtime with graceful fallback for invalid input
  • Null character (U+0000) is always included as a boundary automatically
  • Multi-byte UTF-8 characters are fully supported

AI Assistance Disclosure

With gratitude for the team and respect for the Contributing Guidelines, I want to disclose that this PR was written with AI assistance (Claude Code). I have reviewed all the code, and to the extent of my understanding, I'm prepared to answer any questions about the changes.

Related

Add new `selection-word-chars` config option to customize which characters
mark word boundaries during text selection operations (double-click, word
selection, etc.). Similar to zsh's WORDCHARS environment variable, but
specifies boundary characters rather than word characters.

Default boundaries: ` \t'"│`|:;,()[]{}<>$`

Users can now customize word selection behavior, such as treating
semicolons as part of words or excluding periods from boundaries:

    selection-word-chars = " \t'\"│`|:,()[]{}<>$"

Changes:
- Add selection-word-chars config field with comprehensive documentation
- Modify selectWord() and selectWordBetween() to accept boundary_chars parameter
- Parse UTF-8 boundary string to u32 codepoints at runtime
- Update all call sites in Surface.zig and embedded.zig
- Update all test cases to pass boundary characters
@mauroporras mauroporras requested review from a team as code owners October 24, 2025 20:38
@mauroporras mauroporras marked this pull request as draft October 24, 2025 20:39
@mauroporras mauroporras marked this pull request as ready for review October 24, 2025 20:47
Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conceptually fine with this but I would use a slightly different approach, as noted in the comment.

/// selection-word-chars = " \t'\"│`|:,()[]{}<>$"
///
/// Available since: 1.2.0
@"selection-word-chars": []const u8 = " \t'\"│`|:;,()[]{}<>$",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this a []const u8, I'd recommend making a new type here that automatically expands these into a list of codepoints.

This way we don't need an arbitrary max, we can limit it by the allocator (or put a really high limit), and we can allocate, in general!

It also limits the runtime cost when we actually do selection since the boundary characters are already built up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellh, I made the change, thanks for your feedback.

Also, I ran zig build run and did a quick test with "some-hyphenated-words" and it worked.

Refactor the selection-word-chars implementation to parse UTF-8 boundary
characters once during config initialization instead of on every selection
operation.

Changes:
- Add SelectionWordChars type that stores pre-parsed []const u32 codepoints
- Parse UTF-8 to codepoints in parseCLI() during config load
- Remove UTF-8 parsing logic from selectWord() hot path (27 lines removed)
- Remove arbitrary 64-character buffer limit
- Update selectWord() and selectWordBetween() to accept []const u32
- Update DerivedConfig to store codepoints directly
- Update all tests to use codepoint arrays

Benefits:
- No runtime UTF-8 parsing overhead on every selection
- No arbitrary character limit (uses allocator instead)
- Cleaner separation of concerns (config handles parsing, selection uses data)
- Better performance in selection hot path
Copy link
Member

@pluiedev pluiedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of CI errors — have you tried running the tests yourself first? You have to run zig fmt to clean up the code, too.

const value = input orelse return error.ValueRequired;

// Parse UTF-8 string into codepoints
var list = std.ArrayList(u32).init(alloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Zig 0.15 collection types are unmanaged by default - you need to pass the allocator into every use of the list that may (de-)allocate memory

Suggested change
var list = std.ArrayList(u32).init(alloc);
var list: std.ArrayList(u32) = .empty;

};

/// The parsed codepoints. Always includes null (U+0000) at index 0.
codepoints: []const u32 = &default_codepoints,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode codepoints are expressed as u21s in the Zig standard library, so we should do the same here and avoid the @intCast below

@mauroporras mauroporras marked this pull request as draft October 26, 2025 15:57
@mauroporras
Copy link
Author

mauroporras commented Oct 26, 2025

There's a lot of CI errors — have you tried running the tests yourself first? You have to run zig fmt to clean up the code, too.

@mitchellh, sorry about this. I was using Zig 0.15.1 since 0.15.2 is not available in Homebrew just yet, so I built 0.15.2 from source.
I converted the PR to draft while I address your comments. Thanks.
Done, thanks.

- Change all codepoint types from u32 to u21 to align with Zig stdlib
- Update ArrayList to use Zig 0.15 unmanaged pattern (.empty)
- Remove unnecessary @intcast when encoding UTF-8
- Fix formatEntry to use stack-allocated buffer
@mauroporras mauroporras marked this pull request as ready for review October 26, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants