Inconsistent Diff Query Results / Incorrect Euclidean Distance Calculation for Chunk Selection #183

emanuel-skai · 2025-03-19T03:46:10Z

emanuel-skai
Mar 19, 2025

I'm following the nilrag examples to upload data to nilDB and retrieve the document with the closest distance to my query. However, I'm observing inconsistent results on each execution of my custom client script.

From the examples, it isn’t clear to me how the nilai_chat_completion method uses the nilrag payload to extend the LLM context with the closest document data. In my use case, I need to run a custom differences query that includes specific filters (such as user and agent IDs). For that reason, I am using the diff_query_execute and chunk_query_execute methods to manually find the document with the closest distance to my query, then retrieve and decode the text chunks.

Below is the complete client code I’ve written:

#!/usr/bin/env python3
"""
Client for retrieving and decrypting the closest chunk from nilDB with nilRAG.

This script:
  1. Loads nilDB configuration.
  2. Generates a query embedding from a text prompt.
  3. Encrypts and secret-shares the query embedding.
  4. Executes the difference query to get secret-shared difference vectors.
  5. Reconstructs the full difference vector per chunk and computes its Euclidean norm.
  6. Selects the chunk with the minimum norm.
  7. Retrieves the chunk shares for that chunk.
  8. Groups and decrypts the chunk shares to reveal the plain text.
"""

import argparse
import asyncio
import time
from typing import Any, Dict, List, Tuple

import numpy as np
import nilql
from nilrag.config import load_nil_db_config
from nilrag.nildb_requests import ChatCompletionConfig  # if needed elsewhere
from nilrag.util import (
    create_chunks,
    encrypt_float_list,
    generate_embeddings_huggingface,
    load_file,
    euclidean_distance,
)

DEFAULT_CONFIG = "examples/nildb_config.json"
DEFAULT_PROMPT = "Who is Danielle Miller?"


# -----------------------------------------------
# Utility Functions for Distance & Grouping
# -----------------------------------------------

def reconstruct_difference_vector(share_lists: List[List[int]]) -> List[int]:
    """
    Reconstruct the full difference vector by summing element‑wise the secret shares.
    """
    # Convert the list of share vectors to a NumPy array and sum over axis 0.
    return list(np.sum(np.array(share_lists), axis=0))




def get_euclidean_differences(difference_shares: List[List[Dict[str, Any]]]) -> Dict[str, float]:
    """
    Group difference shares by chunk ID (across all nodes), reconstruct the full
    difference vector, and compute its Euclidean norm.
    
    Args:
        difference_shares: A list (one per node) of lists of difference share dicts.
          Each dict contains an '_id' and a 'difference' (list of ints).
    
    Returns:
        A dictionary mapping each chunk ID to its Euclidean norm.
    """
    grouped_shares: Dict[str, List[List[int]]] = {}
    for node_results in difference_shares:
        for result in node_results:
            chunk_id = result["_id"]
            grouped_shares.setdefault(chunk_id, []).append(result["difference"])
    
    total_differences = {}
    for chunk_id, shares in grouped_shares.items():
        full_diff = reconstruct_difference_vector(shares)
        # Compute Euclidean norm of the full difference vector (distance from zero)
        distance = euclidean_distance(full_diff, [0] * len(full_diff))
        total_differences[chunk_id] = distance
    return total_differences


def get_min_euclidean_chunk_id(total_differences: Dict[str, float]) -> Tuple[str, float]:
    """
    Return the chunk ID that has the smallest Euclidean norm (i.e. is closest) and its distance.
    """
    return min(total_differences.items(), key=lambda item: item[1])


def group_chunks_by_id(chunk_shares_by_node: List[List[Dict[str, Any]]]) -> Dict[str, List[str]]:
    """
    Group chunk shares by their chunk ID across all nodes.
    
    Args:
        chunk_shares_by_node: List of lists of chunk dicts (one per node).
          Each dict contains at least '_id' and 'chunk'.
    
    Returns:
        Dictionary mapping each chunk ID to a list of chunk share strings.
    """
    grouped = {}
    for node_result in chunk_shares_by_node:
        for result in node_result:
            chunk_id = result["_id"]
            chunk = result["chunk"]
            grouped.setdefault(chunk_id, []).append(chunk)
    return grouped


# -----------------------------------------------
# Main Client Logic
# -----------------------------------------------

async def main():
    """
    Main function that:
      - Loads nilDB configuration.
      - Generates and encrypts a query embedding.
      - Executes the difference query.
      - Reconstructs and computes Euclidean distances.
      - Selects the closest chunk.
      - Retrieves, groups, and decrypts the chunk.
    """
    parser = argparse.ArgumentParser(description="Retrieve and decrypt the closest chunk from nilDB")
    parser.add_argument(
        "--config",
        type=str,
        default=DEFAULT_CONFIG,
        help=f"Path to nilDB config file (default: {DEFAULT_CONFIG})",
    )
    parser.add_argument(
        "--prompt",
        type=str,
        default=DEFAULT_PROMPT,
        help="Query prompt",
    )
    args = parser.parse_args()

    # Load nilDB configuration.
    nil_db, _ = load_nil_db_config(
        args.config,
        require_bearer_token=True,
        require_schema_id=True,
        require_diff_query_id=True,
    )
    print("NilDB configuration loaded:")
    print(nil_db)
    print()

    num_nodes = len(nil_db.nodes)

    # Generate a query embedding from the prompt.
    query_text = args.prompt
    print("Generating query embedding for:", query_text)
    query_embedding = generate_embeddings_huggingface([query_text])[0]

    # Generate an additive (sum) key and encrypt/share the query embedding.
    additive_key = nilql.ClusterKey.generate({"nodes": [{}] * num_nodes}, {"sum": True})
    nilql_query_embedding = encrypt_float_list(additive_key, query_embedding)

    print("Executing diff_query_execute on all nodes...")
    start_time = time.time()
    difference_shares = await nil_db.diff_query_execute(nilql_query_embedding)
    end_time = time.time()
    print(f"Difference query executed in {end_time - start_time:.2f} seconds.\n")

    # Reconstruct the full difference vector for each chunk and compute its Euclidean norm.
    total_differences = get_euclidean_differences(difference_shares)
    print("Total Euclidean differences by chunk ID:")
    for cid, dist in total_differences.items():
        print(f"  {cid}: {dist}")

    # Select the chunk with the minimum Euclidean distance.
    min_chunk_id, min_distance = get_min_euclidean_chunk_id(total_differences)
    print(f"\nChunk with minimum Euclidean distance: {min_chunk_id} (distance: {min_distance})\n")

    # Retrieve the chunk shares for the selected chunk.
    print("Retrieving chunk shares for chunk ID:", min_chunk_id)
    try:
        # chunk_query_execute expects a list of chunk IDs.
        chunk_shares = await nil_db.chunk_query_execute([min_chunk_id])
        for i, shares in enumerate(chunk_shares):
            print(f"Node {i} chunk shares: {shares}")
    except Exception as e:
        print("Error retrieving chunks:", e)
        return

    # Group the retrieved chunk shares by their chunk ID.
    grouped_chunks = group_chunks_by_id(chunk_shares)
    print("\nGrouped chunk shares by ID:")
    print(grouped_chunks)

    # Generate a XOR key (for "store" secret sharing) to decrypt the chunk.
    xor_key = nilql.ClusterKey.generate({"nodes": [{}] * num_nodes}, {"store": True})
    
    # Decrypt the chunk shares.
    decrypted_chunks = {}
    for cid, shares in grouped_chunks.items():
        decrypted_chunk = nilql.decrypt(xor_key, shares)
        decrypted_chunks[cid] = decrypted_chunk
    print("\nDecrypted chunk(s):")
    for cid, chunk in decrypted_chunks.items():
        print(f"Chunk ID: {cid}\nContent: {chunk}\n")


if __name__ == "__main__":
    asyncio.run(main())

Issue:

Although my expected output for the query prompt "Who is Danielle Miller?" should return the chunk with ID
6efb4de5-1eb0-4aba-b76d-c409f5220a81 (which contains the text:"Danielle Miller works at Bailey and Sons as a Engineer, mining. Danielle Miller was born on 2007-10-22 and lives at 61586 Michael Greens, New Holly, CO 29872."),

I am observing inconsistent results between executions.

For instance:

From iteration 1 I'm logging the following results :

Generating query embedding for: Who is Danielle Miller?
Executing diff_query_execute on all nodes...
Difference query executed in 1.63 seconds.

Total Euclidean differences by chunk ID:
  23e33584-52ef-4de4-9142-77d7519ae47e: 60282160740.39722
  e23add47-2ad7-4e20-9758-3b459f92ed5f: 58575171897.02381
  11b7bc29-5706-4513-a848-caa0ff8b9b99: 58733461718.640976
  6475520b-fc98-4887-8269-c6ded394d82c: 62976465907.51879
  0adfd929-51f8-4f09-b293-ccb365541c43: 57623830012.65394
  5c752d92-68e9-46d2-86d6-5c709760b99a: 59201676266.129944
  cd0855e0-f432-444d-bbc5-05da56984a50: 59821436416.10022
  160526ab-a464-40ea-b1a9-4e8c029fa7e1: 60588573168.52637
  0b3bded1-34a9-42bd-84be-a2fde8d74597: 59666997367.23322
  63bb0a3e-81f4-4f10-a778-1c8e3130714e: 59822711996.85322
  aacb5f9a-c0ef-474f-b4c2-8b3db3dca29f: 61494594822.23865
  6efb4de5-1eb0-4aba-b76d-c409f5220a81: 59358122941.27182
  c0535cb1-c579-4b4f-9874-1aee1ab8c505: 63414294385.157196
  c77d5698-3a58-4a9d-87d1-998beda156c9: 60589046962.35862
  57236530-38c0-4cf3-afb0-89f65bd77a4d: 58100848542.0182
  a9304415-d0ae-48ef-ac8e-22b7f066f99c: 60129203612.205666
  7f00361a-afc9-420d-ba16-ea73302e0c1a: 59046637037.39088
  ca72198b-cd70-4baa-90f0-408e86110f92: 59357408986.0792
  2259b09f-01ec-463d-9025-6b9b2ff24b16: 59513245884.44197
  a1a6c53e-ba80-4de0-80c2-fbe54eb01885: 56164092479.44692

Chunk with minimum Euclidean distance: a1a6c53e-ba80-4de0-80c2-fbe54eb01885 (distance: 56164092479.44692)

Grouped chunk shares by ID:
{'a1a6c53e-ba80-4de0-80c2-fbe54eb01885': ['iu4qEyqB4QJs7upc/ZfhdM/6UIxltErOsswl193Uk0bBCQMSmQytLKZ6ZnmrBNJJ7W0RrxWU3i2b8lgUodmWomgMfsUjg7c3hXhMXHSHoj+qKeWHho0PTf+RoVMJ09g1ZDq+37QZnWu6PmCi/Wl+Axk6U/BFxhCwCgwPcqgjlxBKIlN5XwsJzqeHvqfOcHIUq5V8M3tkbe/WCxmLol8=', 'shJNDM+vHUks4ng04hIdDKzqLboZHA2DLS9r6kqZjOYoifG8QwSlxRxzKQbrnlRDbBGhmx5PQaeM73w5DJRkS6O21CSMyw0SUTVDyuFON5icpW2VgsUx7I53QQ9xy/7Dg5wdRKjXK9tjT57cIr5bGiz0Xm2EJBWX41OtMNiSTCdpW1n6CJzP+78OdGb55SPKPIF8YhQwtcRqABUDOW8=', 'ObECc4xdjypgX/sFcOvcDwxiFkVcyTNtzYwsVPk+cM7EwpPHtm1xydt6bx5g2epj7xXTVWf77/lufkxCwSKVgLjOhMHiLdZMpz5utsag+MhYrP9zd2hczgOIwDMWOBfP0paOqyjjhIL5EJAa/7tMb1C9Lfy1wjAV2GyXYjbQqVpGCyrKJPunW3yl6o1Y5zit43VkfU8ZjAuFPTi7qx4=']}

Decrypted chunk(s):
Chunk ID: a1a6c53e-ba80-4de0-80c2-fbe54eb01885
Content: Melissa Simon works at Robinson-Bailey as a Clinical psychologist. Melissa Simon was born on 1950-04-22 and lives at 52135 Farmer Island, Loristad, MT 96430.

On a different run of the same script I'm getting:


Generating query embedding for: Who is Danielle Miller?
Executing diff_query_execute on all nodes...
Difference query executed in 1.32 seconds.

Total Euclidean differences by chunk ID:
  23e33584-52ef-4de4-9142-77d7519ae47e: 57783454882.33191
  e23add47-2ad7-4e20-9758-3b459f92ed5f: 57302481811.156296
  11b7bc29-5706-4513-a848-caa0ff8b9b99: 58733855911.456985
  6475520b-fc98-4887-8269-c6ded394d82c: 59668681930.68523
  0adfd929-51f8-4f09-b293-ccb365541c43: 59822916070.28668
  5c752d92-68e9-46d2-86d6-5c709760b99a: 58890426550.444405
  cd0855e0-f432-444d-bbc5-05da56984a50: 59202335479.172615
  160526ab-a464-40ea-b1a9-4e8c029fa7e1: 59669210059.4736
  0b3bded1-34a9-42bd-84be-a2fde8d74597: 60282678253.581375
  63bb0a3e-81f4-4f10-a778-1c8e3130714e: 60741389417.74279
  aacb5f9a-c0ef-474f-b4c2-8b3db3dca29f: 57463514984.55475
  6efb4de5-1eb0-4aba-b76d-c409f5220a81: 57141554153.9198
  c0535cb1-c579-4b4f-9874-1aee1ab8c505: 61345128775.45734
  c77d5698-3a58-4a9d-87d1-998beda156c9: 62389420482.57938
  57236530-38c0-4cf3-afb0-89f65bd77a4d: 58733000288.60714
  a9304415-d0ae-48ef-ac8e-22b7f066f99c: 59513719103.02795
  7f00361a-afc9-420d-ba16-ea73302e0c1a: 60588498392.363304
  ca72198b-cd70-4baa-90f0-408e86110f92: 59667959599.887405
  2259b09f-01ec-463d-9025-6b9b2ff24b16: 58890827502.28923
  a1a6c53e-ba80-4de0-80c2-fbe54eb01885: 58732980587.935875

Chunk with minimum Euclidean distance: 6efb4de5-1eb0-4aba-b76d-c409f5220a81 (distance: 57141554153.9198)
Grouped chunk shares by ID:
{'6efb4de5-1eb0-4aba-b76d-c409f5220a81': ['uAoHHS3osnADXU6ZWpeG89dzYKms/QWUPtG4IOMcNoQwBoynP26Du2KX335wT/oB2HMz+jI/3fFnHVNx1dtnfQsQve2DbRJBmn3f/Py/joeXYKl07JTBnqjyweu0aGAuMX5aabZB0FaFap/nswf6qM64fAmqk8crlFR1fyz9duw9+7cmsmoiCKj6+K1ESXg0De91CBpay1dxd9M+X3bc', '77IDHVF3Ycyt2aFXskhGDtILmMxx0L7jrhitons5ZFLvWQ1+D7r4diahvcRoV1+NbEvAl0zZp1MSNpaq9e5+G9AYRRDlbwnI/9AyuYLQk3DsgxaMQkzAKtFXQiW/FCHCu0c/gvVdzcD9nGn8egwAwCq9pq4CQ2EvZX+IG08TEjO4AJIcBXedE7CXRJ+uS3zZ1gZADzuSbegHm1o2VB9o', 'VvxlbhX6v9DLpKKnhLOljyUPlxe2XpsW5OlX4/FJN6//Pu+9EIcUozcWA8k4eYXJ2l+aAxuDCI5VRqy1SVt+SPtMmZMPZ3flAI2gLBIDeIVblN6LjrpuxheF7KArTnHcvRRU224uL7YZmJI7pWKMDZclu9OI5pcxyR3dKQqNDL7glwV9xXjadWtBnHyPdSSltIVZfg3o5fBW3rAwPFua']}

Decrypted chunk(s):
Chunk ID: 6efb4de5-1eb0-4aba-b76d-c409f5220a81
Content: Danielle Miller works at Bailey and Sons as a Engineer, mining. Danielle Miller was born on 2007-10-22 and lives at 61586 Michael Greens, New Holly, CO 29872.

which corresponds to the expected output.

I have verified that the client query example consistently returns the correct result. Therefore, it appears that there is an issue with my custom client script logic.

Request for Assistance:

Can you please review the client logic above and help me identify any potential issues that could be causing these non‑deterministic results? In particular:

Is my approach for reconstructing the difference vector and computing the Euclidean norm correct?
Are there any issues with how I'm aggregating or grouping the secret shares from the diff query?
Could the inconsistent results be caused by variations in the secret sharing or encryption/decryption process?
Any insights or suggestions to resolve the non‑consistent behavior would be greatly appreciated.

Thank you!

psofiterol · 2025-03-19T08:00:57Z

psofiterol
Mar 19, 2025

Hey @emanuel-skai 👋

The implementation on the nilAI repo here will probably be a very good resource in this case, as it demonstrates this whole flow with steps (essentially very close to mirroring the main client logic from your example) and showcases usage of both:

diff_query_execute https://github.com/NillionNetwork/nilAI/blob/main/nilai-api/src/nilai_api/routers/private.py#L257
chunk_query_execute https://github.com/NillionNetwork/nilAI/blob/main/nilai-api/src/nilai_api/routers/private.py#L290

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nillion

Inconsistent Diff Query Results / Incorrect Euclidean Distance Calculation for Chunk Selection #183

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Nillion

Inconsistent Diff Query Results / Incorrect Euclidean Distance Calculation for Chunk Selection #183

Uh oh!

emanuel-skai Mar 19, 2025

Replies: 1 comment

Uh oh!

psofiterol Mar 19, 2025

emanuel-skai
Mar 19, 2025

psofiterol
Mar 19, 2025