Conversation
Add fuzzy datasets generation logic Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
scripts/setup_datasets.py
Outdated
| elif edit_type == "substitute": | ||
| pos = random.randint(0, len(word) - 1) | ||
| return word[:pos] + random.choice(ALPHABET) + word[pos + 1 :] |
There was a problem hiding this comment.
This can be a no-op substitution of the random alphabet is the same as the one at pos
scripts/setup_datasets.py
Outdated
|
|
||
| def _generate_random_word(seed: int, min_length: int, max_length: int) -> str: | ||
| """Generate deterministic random word from seed.""" | ||
| random.seed(seed) |
There was a problem hiding this comment.
Why dont we use rng = random.Random(seed)
then rng.randint(min,max)
If the order of calling random differes in the flow then it wont be reproducable.
scripts/setup_datasets.py
Outdated
| variant = base_word # First variant is correct spelling | ||
| else: | ||
| # Generate consistent misspelling for this variant_id | ||
| random.seed(term_id * 1000 + variant_id) |
There was a problem hiding this comment.
same here:`rng = random.Random(seed)'
setting this local rng state will help to increase reproducibility.
scripts/setup_datasets.py
Outdated
| return "".join(random.choices(ALPHABET, k=word_length)) | ||
|
|
||
|
|
||
| def _apply_fuzzy_edit(word: str, edit_type: str) -> str: |
There was a problem hiding this comment.
Also update this to accept the rng state
so we can use rng.randint for all the operations
scripts/setup_datasets.py
Outdated
| variant = base_word # First variant is correct spelling | ||
| else: | ||
| # Generate consistent misspelling for this variant_id | ||
| random.seed(term_id * 1000 + variant_id) |
There was a problem hiding this comment.
there can be a seed collision for different term_id and variant_id.
I know the chances are less, but we can completely eliminate these chances.
Can we use a what is a tuple hash for consistency?
like:
rng = random.Random(hash((term_id, variant_id)))
|
Addressed all the comments. Please take a look when you get a chance @roshkhatri |
Address comments of PR to use local RNG for stable seed generation Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
e6a5b0a to
8c57ec1
Compare
Add fuzzy scenarios. ### Groups 10-11: Fuzzy search testing - **Group 10**: Fuzzy best (5 variants of words with distance 1, each × 20 copies = 100 docs/query) - **Group 11**: Fuzzy worst (200 variants of words with distance 3 × 20 copies = 4000 docs/query) Dataset generation logic is in perf-benchmark PR: valkey-io/valkey-perf-benchmark#41 Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com> Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
| variant = base_word # First variant is correct spelling | ||
| else: | ||
| # Use tuple hash for collision-free seed | ||
| variant_rng = random.Random(hash((term_id, variant_id))) |
There was a problem hiding this comment.
I’d avoid deriving the seed from hash(...). It is not guaranteed to be deterministic.
The random module currently accepts any hashable type as a possible seed value. Unfortunately, some of those types are not guaranteed to have a deterministic hash value.
Add fuzzy datasets generation logic