dataset instructions changed from hugging face to kaggle

AryanNanda17 · AryanNanda17 · commit 089f6f149742 · 2025-05-31T14:59:27.000+05:30
dataset instructions changed from hugging face to kaggle

Signed-off-by: Aryan Nanda &lt;nandaaryan823@gmail.com&gt;

changes in readme of cloud-edge-collaborative-inference done to use kaggle instead of huggingface

Signed-off-by: Aryan &lt;nandaaryan823@gmail.com&gt;
diff --git a/examples/cloud-edge-collaborative-inference-for-llm/Dockerfile b/examples/cloud-edge-collaborative-inference-for-llm/Dockerfile
@@ -9,8 +9,11 @@ RUN apt-get update && apt-get install -y \
     curl \
     gnupg \
     git \
-    git-lfs && \
-    git lfs install
+    unzip 
+
+# Copy kaggle.json (Make sure this file is in the same directory as your Dockerfile)
+COPY kaggle.json /root/.kaggle/kaggle.json
+RUN chmod 600 /root/.kaggle/kaggle.json
 
 # Clone Ianvs repo
 RUN git clone https://github.com/kubeedge/ianvs.git
@@ -26,16 +29,17 @@ RUN /bin/bash -c "source activate $CONDA_ENV && \
     pip install -r examples/cloud-edge-collaborative-inference-for-llm/requirements.txt && \
     python setup.py install"
 
-# Download and move dataset (still run inside /ianvs)
+# Download Kaggle CLI
+RUN pip install kaggle
+
+# Download dataset
 RUN cd /ianvs && \
-    git clone https://huggingface.co/datasets/FuryMartin/Ianvs-MMLU-5-shot && \
-    git lfs install && \
-    cd Ianvs-MMLU-5-shot && \
-    git lfs pull && \
-    mkdir -p /ianvs/dataset && \
-    mv mmlu-5-shot /ianvs/dataset/ && \
-    mv workspace-mmlu /ianvs/ && \
-    rm -rf Ianvs-MMLU-5-shot  # Optional cleanup
+    kaggle datasets download -d kubeedgeianvs/ianvs-mmlu-5shot && \
+    kaggle datasets download -d kubeedgeianvs/ianvs-gpqa-diamond && \
+    unzip -o ianvs-mmlu-5shot.zip && \
+    unzip -o ianvs-gpqa-diamond.zip && \
+    rm -rf ianvs-mmlu-5shot.zip && \
+    rm -rf ianvs-gpqa-diamond.zip
 
 # Set final working directory
-WORKDIR /ianvs
+WORKDIR /ianvs
diff --git a/examples/cloud-edge-collaborative-inference-for-llm/README.md b/examples/cloud-edge-collaborative-inference-for-llm/README.md
@@ -99,35 +99,47 @@ Before using this example, you need to have the device ready:
 
 The Docker-based setup assumes you have Docker installed on your system and are using an Ubuntu-based Linux distribution.
 
-*If you don't have Docker installed, follow the Docker Engine installation guide [here](https://docs.docker.com/engine/install/ubuntu/).*
+**Note**: 
+- If you don't have Docker installed, follow the Docker Engine installation guide [here](https://docs.docker.com/engine/install/ubuntu/). 
+- To enable Docker to download datasets from Kaggle within your docker container, you need to configure the Kaggle CLI authentication token. Please follow the [official Kaggle API documentation](https://www.kaggle.com/docs/api#:~:text=is%20%24PYTHON_HOME/Scripts.-,Authentication,-In%20order%20to) to download your `kaggle.json` token. Once downloaded, move the file to the `~/ianvs/examples/cloud-edge-collaborative-inference-for-llm/` directory after doing step 1(cloning the ianvs repo):
 
-1. From the root directory of Ianvs, build the `cloud-edge-collaborative-inference-for-llm` Docker image:
+```bash
+mv /path/to/kaggle.json ~/ianvs/examples/cloud-edge-collaborative-inference-for-llm/
+```
+
+1. Clone Ianvs Repo
+```
+git clone https://github.com/kubeedge/ianvs.git
+cd ianvs
+```
+
+2. From the root directory of Ianvs, build the `cloud-edge-collaborative-inference-for-llm` Docker image:
 
 **Note**: If you have already build the image, then move on to the second step directly. 
 
 ```bash 
 docker build -t ianvs-experiment-image ./examples/cloud-edge-collaborative-inference-for-llm/
 ```
 
-2. Run the image in an interactive shell:
+3. Run the image in an interactive shell:
 ```bash 
 docker run -it ianvs-experiment-image /bin/bash 
 ```
 
-3. Activate the ianvs-experiment Conda environment:
+4. Activate the ianvs-experiment Conda environment:
 ```bash 
 conda activate ianvs-experiment
 ```
 
-4. Set the required environment variables for the API (use either OpenAI or GROQ credentials):
+5. Set the required environment variables for the API (use either OpenAI or GROQ credentials):
 ```bash 
 export OPENAI_BASE_URL="https://api.openai.com/v1"
 export OPENAI_API_KEY=sk_xxxxxxxx
 ```
 
 `Alternatively, for GROQ, use GROQ_BASE_URL and GROQ_API_KEY.`
 
-5. Run the Ianvs benchmark:
+6. Run the Ianvs benchmark:
 ```bash 
 ianvs -f examples/cloud-edge-collaborative-inference-for-llm/benchmarkingjob.yaml
 ```
@@ -171,23 +183,14 @@ If you want to use speculative decoding models like [EAGLE](https://github.com/S
 
 ##### Dataset Configuration
 
-Here, we provide `MMLU-5-shot` dataset and `GPQA-diamond` dataset for testing. The following is the instruction for dataset preparation for `MMLU-5-shot`, `GPQA-diamond` follows the same progress.
-
-1. Download `mmlu-5-shot` from [Ianvs-MMLU-5-shot](https://huggingface.co/datasets/FuryMartin/Ianvs-MMLU-5-shot), (or [Ianvs-GPQA-diamond](https://huggingface.co/datasets/FuryMartin/Ianvs-GPQA-diamond)) which is a transformed MMLU-5-shot dataset formatted to fit Ianvs's requirements.
-
-```bash
-git clone https://huggingface.co/datasets/FuryMartin/Ianvs-MMLU-5-shot
-git lfs install 
-cd Ianvs-MMLU-5-shot
-git lfs pull
-cd ..
-```
-
-2. Create a `dataset` folder in the root directory of Ianvs and move `mmlu-5-shot` into the `dataset` folder.
+Here, we provide `MMLU-5-shot` dataset and `GPQA-diamond` dataset for testing. The following  instruction for dataset preparation for `MMLU-5-shot`, `GPQA-diamond` follows the same progress.
 
+1. Download `mmlu-5-shot` in the root directory of ianvs from [Ianvs-MMLU-5-shot](https://www.kaggle.com/datasets/kubeedgeianvs/ianvs-mmlu-5shot), which is a transformed MMLU-5-shot dataset formatted to fit Ianvs's requirements.
+**Note**: To enable Docker to download datasets from Kaggle within your docker container, you need to configure the Kaggle CLI authentication token. Please follow the [official Kaggle API documentation](https://www.kaggle.com/docs/api#:~:text=is%20%24PYTHON_HOME/Scripts.-,Authentication,-In%20order%20to) to download your `kaggle.json` token. 
 ```bash
-mkdir dataset
-mv Ianvs-MMLU-5-shot/mmlu-5-shot/ dataset/
+kaggle datasets download -d kubeedgeianvs/ianvs-mmlu-5shot
+unzip -o ianvs-mmlu-5shot.zip
+rm -rf ianvs-mmlu-5shot.zip
 ```
 
 3. Then, check the path of `train_data` and `test_data` in 
@@ -342,11 +345,8 @@ The testing process may take much time, depending on the number of test cases an
 
 To enable you directly get the results, here we provide a workspace folder with cached results of `Qwen/Qwen2.5-1.5B-Instruct`, `Qwen/Qwen2.5-3B-Instruct`,`Qwen/Qwen2.5-7B-Instruct` and `gpt-4o-mini`.
 
-You can download `workspace-mmlu` folder from [Ianvs-MMLU-5-shot](https://huggingface.co/datasets/FuryMartin/Ianvs-MMLU-5-shot) and put it under your `ianvs` folder.
-
-```bash
-mv Ianvs-MMLU-5-shot/workspace-mmlu/ .
-```
+You can download `workspace-mmlu` folder from [Ianvs-MMLU-5-shot](https://www.kaggle.com/datasets/kubeedgeianvs/ianvs-mmlu-5shot) and put it under your `ianvs` folder.
+- Since we have already downloaded the `Ianvs-MMLU-5-shot` folder. There is no need to do this again.
 
 ##### Run Joint Inference example
 
diff --git a/examples/cloud-edge-collaborative-inference-for-llm/requirements.txt b/examples/cloud-edge-collaborative-inference-for-llm/requirements.txt
@@ -3,5 +3,5 @@ transformers
 openai
 accelerate
 datamodel_code_generator
-git-lfs
+kaggle
 groq