Skip to content

Kleenex-ultrasoft/Secure_ANN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Secure ANN

Source code for our secure approximate nearest neighbour (ANN) search under secret-shared MPC, accompanying the extended version of the HNSecW paper (paper.pdf in this repo). The package is source-only and excludes raw datasets, logs, and build outputs.

Repository layout

2pc/                  ABY (2PC) implementation: single, multi, batch query
3pc/aby3/             ABY3 (3PC) backend (faster mixed arithmetic/Boolean SS)
3pc/mp_spdz/          MP-SPDZ (3PC) backend + .mpc programs
baselines/
  panther_generalize/   PANTHER source-modified (DORAM + SS-IP + secret DB)
  sort_3pc/           Exact top-k 3PC SORT baseline (replicated radix sort)
rag/                  End-to-end RAG eval (HotpotQA + MS MARCO 1M)
scripts/              Build / sync helpers + dataset download
docs/                 Protocol notes (online metrics, plaintext HNSW, network)

Implementations

2PC (ABY) 3PC (MP-SPDZ + ABY3)
Single query 2pc/src/hnsecw/hnsecw_single_b2y.cpp 3pc/mp_spdz/Programs/Source/hnsecw_search_3pc.mpc
3pc/aby3/src/hnsecw_search_aby3.cpp (alternative backend)
Multiple queries 2pc/src/hnsecw/hnsecw_multi_b2y_dyn.cpp 3pc/mp_spdz/Programs/Source/hnsecw_search_3pc_multi.mpc
Batch queries 2pc/src/hnsecw/hnsecw_batch_b2y.cpp 3pc/mp_spdz/Programs/Source/hnsecw_search_3pc_batch.mpc

The deployed search parameters from §4.1 (ef^{(0)} = k+10, Δ̄^{(0)} = ⌈log₂(N·d/M)⌉ + C, C = 7) are derived automatically by 2pc/bench/build_npz.py and 3pc/mp_spdz/bench/gen_mpspdz_inputs.py from the dataset shape; pass --k, --ef_base, --tau_base, or --tau_upper to override. The reshuffled per-layer tables G_\ell, D_\ell, and the dummy pool DUM_\ell are built offline by build_hnsecw_index.sh.

Quick start

0. Dependencies

Tested on Ubuntu 20.04 / 24.04 with g++ ≥ 10, cmake ≥ 3.16, Python 3.9+.

0.1 System packages

sudo apt-get install -y build-essential cmake git python3 python3-pip \
  libssl-dev libboost-all-dev libgmp-dev libsodium-dev libntl-dev \
  libtool autoconf automake yasm
pip install hnswlib numpy scipy

0.2 MPC frameworks

Each MPC backend is a separate external dependency. Install only what you need; the table below maps each backend to which parts of this repository require it.

Framework Required for Build (one-time)
ABY 2pc/src/hnsecw/ (paper-default 2PC HNSecW) git clone https://github.com/encryptogroup/ABY && cmake -DABY_BUILD_EXE=ON ...
MP-SPDZ 2pc/mp_spdz/ (alt 2PC), 3pc/mp_spdz/ (3PC HNSecW), baselines/sort_3pc/, baselines/panther_generalize/run_oram_bench.sh git clone https://github.com/data61/MP-SPDZ && make -j tldr replicated-ring-party.x semi2k-party.x
OpenPanther baselines/panther_generalize/ only (source-modified PANTHER) git clone https://github.com/secretflow/spu and place our experimental/panther into it; bazel build (see baseline README)
ABY3 (optional) 3pc/aby3/ alternative 3PC backend git clone https://github.com/ladnir/aby3 && cmake ...

Set the matching env vars before running anything:

export ABY_ROOT=/path/to/ABY                       # for 2pc/src/hnsecw
export MPSPDZ_ROOT=/path/to/MP-SPDZ                # everything MP-SPDZ-based
export SPU_ROOT=/path/to/spu                       # for panther_generalize
ABY
git clone https://github.com/encryptogroup/ABY.git "$ABY_ROOT"
cd "$ABY_ROOT"
cmake -B build -S . -DABY_BUILD_EXE=ON \
      -DCMAKE_CXX_FLAGS='-include cstdlib'        # ABY misses a stdlib include
cmake --build build -j

# Wire our 2PC HNSecW sources into ABY's example tree, then rebuild.
rsync -a 2pc/src/hnsecw "$ABY_ROOT/src/examples/"
cmake --build "$ABY_ROOT/build" -j hnsecw_cli

ABY has two known upstream issues we work around: #114 "Erroneous Multiplication Results" and #152 "IKNP base-OT thread-unsafe". The mitigations are documented in 2pc/README.md and applied via patches/aby_arith_reset_clear_mvC.patch (run-time per-query process isolation also works around them — see scripts/run_multi_via_single.sh).

MP-SPDZ
git clone --depth 1 https://github.com/data61/MP-SPDZ.git "$MPSPDZ_ROOT"
cd "$MPSPDZ_ROOT"

# Use system g++ (clang++ is also fine but slower to set up).
# Most paper experiments only need replicated-ring-party.x and
# semi2k-party.x.  `make tldr` builds all the common deps.
make -j tldr
make -j replicated-ring-party.x        # 3PC honest-majority ring (used everywhere)
make -j semi2k-party.x                 # 2PC semi-honest ring (used by 2pc/mp_spdz)

If your libsodium / gmpxx come from conda, point MP-SPDZ's CONFIG.mine at them (we ship an example in 2pc/mp_spdz/README.md). No special patches needed.

OpenPanther (for baselines/panther_generalize/ only)
git clone --depth 1 https://github.com/secretflow/spu.git "$SPU_ROOT"
cp -r baselines/panther_generalize/OpenPanther/experimental/panther \
      "$SPU_ROOT/experimental/"
cd "$SPU_ROOT" && bazel build -c opt \
      //experimental/panther:random_panther_doram
# bazel ≥ 6.0 required; first build ~1-2 h (LLVM, abseil, SEAL, EMP, yacl).

Skip this step if you only run HNSecW + sort_3pc (MP-SPDZ-based) — neither needs SPU/bazel.

1. Get a small dataset

bash scripts/download_demo_dataset.sh ~/hnsecw_build/datasets
# downloads SIFT-1M (texmex.irisa.fr) and writes the first 10K vectors
# as ~/hnsecw_build/datasets/sift_base.npy so demo.sh / build_hnsecw_index.sh
# can run end-to-end.

Public download links for the 10 datasets reported in the paper:

Dataset Public download landing page
SIFT-1M / GIST-1M / BIGANN-1B http://corpus-texmex.irisa.fr/
Deep1B / DEEP-100M http://sites.skoltech.ru/compvision/noimi/
MNIST-60K http://yann.lecun.com/exdb/mnist/
Fashion-60K https://github.com/zalandoresearch/fashion-mnist
SPACEV-1B / SSNPP-1B https://big-ann-benchmarks.com/
LAION https://laion.ai/blog/laion-5b/
MS MARCO https://microsoft.github.io/msmarco/

2. Build a reshuffled index

bash build_hnsecw_index.sh ~/hnsecw_build/datasets/sift_base.npy /tmp/hnsecw_index
# Output:
#   /tmp/hnsecw_index/plain_hnsw_inputs.npz
#   /tmp/hnsecw_index/meta.json

3. Run a query in MPC

2PC (ABY)

ABY_ROOT=/path/to/ABY
rsync -a ./2pc/src/hnsecw "$ABY_ROOT/src/examples/hnsecw"
cmake -S "$ABY_ROOT" -B "$ABY_ROOT/build" -DABY_BUILD_EXE=ON
cmake --build "$ABY_ROOT/build" -j

./2pc/bench/run_single_query.sh /tmp/hnsecw_index/plain_hnsw_inputs.npz
./2pc/bench/run_multi_query.sh  /tmp/hnsecw_index/plain_hnsw_inputs.npz --num_queries 32
./2pc/bench/run_batch_query.sh  /tmp/hnsecw_index/plain_hnsw_inputs.npz --num_queries 32

3PC (MP-SPDZ replicated-ring)

The MP-SPDZ backend uses replicated-ring-party.x (3-party honest-majority, replicated additive ring sharing). Same revealed-fetch design as the 2PC ABY path: per-iter id' revealed once a layer to plaintext-fetch graph rows + base vectors, while the per-layer reshuffle (paper §4.3) hides the access pattern. The per-iter sort lowers to Asharov et al.'s replicated radix sort under the hood.

# Build MP-SPDZ replicated-ring backend (one-time)
cd $MPSPDZ_ROOT && make -j replicated-ring-party.x && cd -

# Single query
NPZ=/tmp/hnsecw_index/plain_hnsw_inputs.npz \
QUERY=/path/to/queries.npy \
MPSPDZ_ROOT=$MPSPDZ_ROOT \
  bash 3pc/mp_spdz/bench/run_search_3pc.sh
# ->  Recall@10        = 90.00%
# ->  Online latency  = 1.770 s/query

# Multi (Q queries serialized; per-query entry chains computed in plaintext)
NPZ=... QUERY=... MODE=multi NUM_Q=4 \
  bash 3pc/mp_spdz/bench/run_search_3pc.sh
# ->  Recall@10        = 95.00%  (avg over 4 queries)
# ->  Online latency  = 1.623 s/query  (avg over 4 queries)

# Batch (Q queries lockstep; sort circuit shared across queries)
NPZ=... QUERY=... MODE=batch NUM_Q=4 \
  bash 3pc/mp_spdz/bench/run_search_3pc.sh
# ->  Recall@10        = 95.00%  (avg over 4 queries)
# ->  Online latency  = 1.707 s/query  (avg over 4 queries)

Numbers above are SIFT-100K / M=32 / ef=32 / K=10, single LAN host.

Baselines

Two baselines for fair comparison; both ship with end-to-end run scripts.

# (1) PANTHER generalized — source-level modification of upstream
#     OpenPanther (Li et al., USENIX Sec 2025) that swaps:
#         FHE-PIR        → 2PC DORAM bridged to upstream Duoram cpir-read
#                          (Vadapalli, Henry, Goldberg, USENIX Sec '23)
#         FHE-distance   → SS-IP via kernel::hal::matmul on shares
#         Plaintext DB   → secret-shared DB (additive shares both ranks)
#     while preserving PANTHER's k-means ANN training + IVF index +
#     batched-argmin + GC top-k composition.
#
#     New file: baselines/panther_generalize/OpenPanther/experimental/
#               panther/benchmark/random_panther_doram.cc
#     Diff:     baselines/panther_generalize/patches/
#               random_panther_doram_substitution.patch
#
#     Build (requires SPU repo + bazel):
git clone --depth 1 https://github.com/secretflow/spu.git
cp -r baselines/panther_generalize/OpenPanther/experimental/panther \
      spu/experimental/
cd spu && bazel build -c opt \
      //experimental/panther:random_panther_doram
#     Run (LAN, single host):
./bazel-bin/.../random_panther_doram --rank=0 \
      --parties=127.0.0.1:9530,127.0.0.1:9531 &
./bazel-bin/.../random_panther_doram --rank=1 \
      --parties=127.0.0.1:9530,127.0.0.1:9531

# (2) Exact top-k 3PC SORT — distance over all N vectors + replicated
#     radix sort (Asharov et al.) + ORAM fetch of top-k.  Runs end-to-end:
MP_SPDZ=$MPSPDZ_ROOT bash baselines/sort_3pc/bench/run_sort_3pc.sh \
    --cfg /tmp/sort_cfg.json --vecs /path/to/vecs.npy \
    --queries /path/to/queries.npy --num-queries 4 --output id --reveal

See each subfolder's README.md for input-prep, parameter knobs, and the MPC threat-model layout.

4. Demo (one-shot)

Two end-to-end SIFT-100K demos are shipped — both pull a real SIFT query, run the full MPC pipeline, and print top-10 + recall@10:

# (1) HNSecW (2PC ABY) on SIFT-100K, M=16 ef=16
bash large_demo.sh
# ->  [mpc total] online latency(s) / comm(MB)
# ->  Top-1 (MPC) / MPC top-10 / Ground-truth top-10 / recall@10

# (2) panther_generalize baseline (2PC Cheetah + Duoram) on SIFT-100K, u=123
bash baselines/panther_generalize/demo.sh
# ->  Total Latency / Total Communication
# ->  Plaintext IVF top-10 (panther's algorithmic baseline) + recall@10

Both demos default to query #0 of sift_query.npy and the first 100 000 rows of sift_base.npy; pass --query-idx N to use a different query.

Network setup

We emulate two network conditions, matching the paper's experimental setup (§7.1):

  • LAN: 4000 Mbps, 1 ms RTT
  • WAN: 320 Mbps, 50 ms RTT

baselines/panther_generalize/throttle_ns.sh installs the matching tc-netem qdiscs. See docs/ONLINE_METRICS.md for the per-system counter mapping.

Notes

  • External MPC frameworks: ABY, ABY3, and MP-SPDZ.
  • Build / runtime issues are summarised in docs/TROUBLESHOOTING.md.
  • If a run reports unexpectedly low recall, the most common cause is a mismatch between the NPZ's ef_base / tau_base and the values baked into the framework binary. Inspect meta.json to verify.

License

Code in this repository is released for research purposes only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors