Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit

https://github.com/mattico/elasticlunr-rs/blob/29d97e4c8e91bb0d1813716fb2d1575066344d76/src/inverted_index.rs#L40-L42

During index building, `elasticlunr-rs` iterates over the token `&str`'s content in [Unicode Scalar Values](https://doc.rust-lang.org/std/primitive.str.html#method.chars).

While the JS library does it in this way:

```js
elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];
```
The JS string is actually iterated in [UTF-16 Code Units](https://stackoverflow.com/questions/8715980/javascript-strings-utf-16-vs-ucs-2), which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.

---

[Related issue](https://github.com/rust-lang/mdBook/issues/2393#issuecomment-2148698917) with mdBook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	fn add_token(&mut self, doc_ref: &str, token: &str, term_freq: f64) {
	let mut iter = token.chars();
	if let Some(character) = iter.next() {

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions