Add fuzzy datasets generation logic by VoletiRam · Pull Request #41 · valkey-io/valkey-perf-benchmark

VoletiRam · 2026-02-25T05:40:10Z

Add fuzzy datasets generation logic

Add fuzzy datasets generation logic Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

roshkhatri · 2026-02-25T18:51:25Z

scripts/setup_datasets.py

+    elif edit_type == "substitute":
+        pos = random.randint(0, len(word) - 1)
+        return word[:pos] + random.choice(ALPHABET) + word[pos + 1 :]


This can be a no-op substitution of the random alphabet is the same as the one at pos

roshkhatri · 2026-02-25T19:05:49Z

scripts/setup_datasets.py


+def _generate_random_word(seed: int, min_length: int, max_length: int) -> str:
+    """Generate deterministic random word from seed."""
+    random.seed(seed)


Why dont we use rng = random.Random(seed)

then rng.randint(min,max)

If the order of calling random differes in the flow then it wont be reproducable.

roshkhatri · 2026-02-25T19:07:49Z

scripts/setup_datasets.py

+                variant = base_word  # First variant is correct spelling
+            else:
+                # Generate consistent misspelling for this variant_id
+                random.seed(term_id * 1000 + variant_id)


same here:`rng = random.Random(seed)'

setting this local rng state will help to increase reproducibility.

roshkhatri · 2026-02-25T19:09:26Z

scripts/setup_datasets.py

+    return "".join(random.choices(ALPHABET, k=word_length))
+
+
+def _apply_fuzzy_edit(word: str, edit_type: str) -> str:


Also update this to accept the rng state
so we can use rng.randint for all the operations

roshkhatri · 2026-02-25T19:23:30Z

scripts/setup_datasets.py

+                variant = base_word  # First variant is correct spelling
+            else:
+                # Generate consistent misspelling for this variant_id
+                random.seed(term_id * 1000 + variant_id)


there can be a seed collision for different term_id and variant_id.

I know the chances are less, but we can completely eliminate these chances.

Can we use a what is a tuple hash for consistency?
like:
rng = random.Random(hash((term_id, variant_id)))

VoletiRam · 2026-02-26T00:37:56Z

Addressed all the comments. Please take a look when you get a chance @roshkhatri

Address comments of PR to use local RNG for stable seed generation Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Add fuzzy scenarios. ### Groups 10-11: Fuzzy search testing - **Group 10**: Fuzzy best (5 variants of words with distance 1, each × 20 copies = 100 docs/query) - **Group 11**: Fuzzy worst (200 variants of words with distance 3 × 20 copies = 4000 docs/query) Dataset generation logic is in perf-benchmark PR: valkey-io/valkey-perf-benchmark#41 Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com> Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>

sarthakaggarwal97 · 2026-03-12T17:25:32Z

scripts/setup_datasets.py

+                variant = base_word  # First variant is correct spelling
+            else:
+                # Use tuple hash for collision-free seed
+                variant_rng = random.Random(hash((term_id, variant_id)))


I’d avoid deriving the seed from hash(...). It is not guaranteed to be deterministic.

The random module currently accepts any hashable type as a possible seed value. Unfortunately, some of those types are not guaranteed to have a deterministic hash value.

Link: https://docs.python.org/3/whatsnew/3.9.html

sarthakaggarwal97

Minor comment. Did not take a look at the search benchmark specific code deeply.

Add fuzzy datasets generation logic

c7d342d

Add fuzzy datasets generation logic Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam mentioned this pull request Feb 25, 2026

Add fuzzy scenarios valkey-io/valkey-search#819

Merged

roshkhatri reviewed Feb 25, 2026

View reviewed changes

Address comments of PR to use local RNG for stable seed generation

8c57ec1

Address comments of PR to use local RNG for stable seed generation Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam force-pushed the fuzzy_dataset branch from e6a5b0a to 8c57ec1 Compare February 26, 2026 02:36

sarthakaggarwal97 reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy datasets generation logic#41

Add fuzzy datasets generation logic#41
VoletiRam wants to merge 2 commits intovalkey-io:mainfrom
VoletiRam:fuzzy_dataset

VoletiRam commented Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Uh oh!

VoletiRam commented Feb 26, 2026

Uh oh!

sarthakaggarwal97 Mar 12, 2026

Uh oh!

sarthakaggarwal97 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return "".join(random.choices(ALPHABET, k=word_length))


		def _apply_fuzzy_edit(word: str, edit_type: str) -> str:

Conversation

VoletiRam commented Feb 25, 2026

Uh oh!

roshkhatri Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

roshkhatri Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

roshkhatri Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

roshkhatri Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

roshkhatri Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

VoletiRam commented Feb 26, 2026

Uh oh!

sarthakaggarwal97 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarthakaggarwal97 left a comment •

edited

Loading