Skip to content

Add fuzzy datasets generation logic#41

Open
VoletiRam wants to merge 2 commits intovalkey-io:mainfrom
VoletiRam:fuzzy_dataset
Open

Add fuzzy datasets generation logic#41
VoletiRam wants to merge 2 commits intovalkey-io:mainfrom
VoletiRam:fuzzy_dataset

Conversation

@VoletiRam
Copy link
Copy Markdown
Contributor

Add fuzzy datasets generation logic

Add fuzzy datasets generation logic

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Comment on lines +73 to +75
elif edit_type == "substitute":
pos = random.randint(0, len(word) - 1)
return word[:pos] + random.choice(ALPHABET) + word[pos + 1 :]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a no-op substitution of the random alphabet is the same as the one at pos


def _generate_random_word(seed: int, min_length: int, max_length: int) -> str:
"""Generate deterministic random word from seed."""
random.seed(seed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dont we use rng = random.Random(seed)

then rng.randint(min,max)

If the order of calling random differes in the flow then it wont be reproducable.

variant = base_word # First variant is correct spelling
else:
# Generate consistent misspelling for this variant_id
random.seed(term_id * 1000 + variant_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here:`rng = random.Random(seed)'

setting this local rng state will help to increase reproducibility.

return "".join(random.choices(ALPHABET, k=word_length))


def _apply_fuzzy_edit(word: str, edit_type: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update this to accept the rng state
so we can use rng.randint for all the operations

variant = base_word # First variant is correct spelling
else:
# Generate consistent misspelling for this variant_id
random.seed(term_id * 1000 + variant_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be a seed collision for different term_id and variant_id.

I know the chances are less, but we can completely eliminate these chances.

Can we use a what is a tuple hash for consistency?
like:
rng = random.Random(hash((term_id, variant_id)))

@VoletiRam
Copy link
Copy Markdown
Contributor Author

Addressed all the comments. Please take a look when you get a chance @roshkhatri

Address comments of PR to use local RNG for stable seed generation

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
VoletiRam added a commit to valkey-io/valkey-search that referenced this pull request Feb 26, 2026
Add fuzzy scenarios.

### Groups 10-11: Fuzzy search testing
- **Group 10**: Fuzzy best (5 variants of words with distance 1, each ×
20 copies = 100 docs/query)
- **Group 11**: Fuzzy worst (200 variants of words with distance 3 × 20
copies = 4000 docs/query)


Dataset generation logic is in perf-benchmark PR:
valkey-io/valkey-perf-benchmark#41

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
variant = base_word # First variant is correct spelling
else:
# Use tuple hash for collision-free seed
variant_rng = random.Random(hash((term_id, variant_id)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d avoid deriving the seed from hash(...). It is not guaranteed to be deterministic.

The random module currently accepts any hashable type as a possible seed value. Unfortunately, some of those types are not guaranteed to have a deterministic hash value.

Link: https://docs.python.org/3/whatsnew/3.9.html

Copy link
Copy Markdown

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment. Did not take a look at the search benchmark specific code deeply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants