Skip to content

Add split and split_once to bit_array module#803

Open
ackurat wants to merge 6 commits intogleam-lang:mainfrom
ackurat:add-bitarray-split
Open

Add split and split_once to bit_array module#803
ackurat wants to merge 6 commits intogleam-lang:mainfrom
ackurat:add-bitarray-split

Conversation

@ackurat
Copy link

@ackurat ackurat commented Feb 12, 2025

Hello,
This PR solves issue #568 and also adds a split function. Both split and split_once function as their counterparts in the the gleam/string module, and have feature parity across the targets.

There are two older PRs (#571, #629) which addresses this, but they seem to be outdated/abandoned.

The split function could take options to make it have feature parity with Erlang's binary:split/3, but I opted for a simpler version here. Let me know if you'd like me to add the options feature.

@ackurat ackurat marked this pull request as ready for review February 12, 2025 11:59
Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I've left some notes inline.

[A, B] -> {ok, {A, B}};
_ -> {error, nil}
end
catch error:badarg -> {error, nil}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this try-catch for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It catches this case, where Erlang would raise.

Now that I reread about the Erlang implementation I realize that it can also raise a nif_error, so maybe the catch should be more generic to avoid raising altogether.

try {
const patternEmpty = pattern.buffer.length < 1
const patternLongerThanBits = pattern.buffer.length >= bits.buffer.length
const incorrectArguments = !(bits instanceof BitArray) || !(pattern instanceof BitArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No runtime type checking please 🙏

export function bit_array_split_once(bits, pattern) {
try {
const patternEmpty = pattern.buffer.length < 1
const patternLongerThanBits = pattern.buffer.length >= bits.buffer.length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be the length of the bit array itself.

}

return new Error(Nil);
} catch (e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this try-catch for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if it catches a specific case. I'll check and remove it if not.

/// // -> Error(Nil)
/// ```
@external(erlang, "gleam_stdlib", "bit_array_split_once")
@external(javascript, "../gleam_stdlib.mjs", "bit_array_split_once")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement this in Gleam rather than Erlang? Could be a bunch nicer, and we wouldn't need to use any private APIs which should not be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this comment: #629 (comment)

I can give a stab at implementing it in Gleam if you'd prefer it. The Erlang binary:split is a BIF so performance-wise it makes sense to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I made a typo here. I meant to say JavaScript rather than Erlang 😅

for (let j = 0; j < pattern.buffer.length; j++) {
if (bits.buffer[i + j] !== pattern.buffer[j]) {
continue find;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like quite an expensive algorithm, it is checking bytes multiple times even when we know they could not match.

There's a few established algorithms we could use https://en.wikipedia.org/wiki/String-searching_algorithm. Boyer–Moore–Horspool seems fairly straightforward, but two-way algorithm seems to be the most popular approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a naive approach. I'll have a go at one of the more efficient algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants