This project demonstrates how to run GPU-accelerated XGBoost machine learning models using NVIDIA RAPIDS on Azure Container Apps. The implementation uses Apache Spark with RAPIDS acceleration to process the Agaricus mushroom dataset, train an XGBoost classification model to identify edible vs. poisonous mushrooms, and store vector embeddings in Azure Cosmos DB.
Shortlink: https://aka.ms/sparkrapidsgpudemo
The application:
- Leverages NVIDIA RAPIDS for GPU-accelerated data processing
- Trains an XGBoost model with GPU acceleration on the Agaricus mushroom dataset
- Runs in Azure Container Apps with GPU support
- Stores vector embeddings in Azure Cosmos DB for vector search capabilities
- Reads and writes data from Azure Storage
This implementation uses the Agaricus Mushroom dataset from the UCI Machine Learning Repository. The dataset includes descriptions of hypothetical samples of 23 species of gilled mushrooms in the Agaricus and Lepiota family. Each sample is classified as:
- Edible (e)
- Poisonous (p)
The dataset contains 8,124 instances with 22 categorical attributes like cap shape, odor, gill size, etc.
- Azure Subscription
- Azure CLI
- Docker
- Maven
- Java 11 JDK
- NVIDIA CUDA drivers (for local development)
-
Clone this repository:
git clone <repository-url> cd bbenz-azure-aca-rapids
-
Install the required dependencies:
mvn clean install
-
If you want to run the application locally with GPU support, ensure you have NVIDIA GPU with CUDA drivers installed.
The application uses the following configuration parameters:
| Parameter | Description | Default |
|---|---|---|
--data-source |
Path to the input data file (ABFS path) | abfss://[email protected]/agaricus_data.csv |
--cosmos-endpoint |
Azure Cosmos DB endpoint URL | From environment variable COSMOS_ENDPOINT |
--cosmos-key |
Azure Cosmos DB access key | From environment variable COSMOS_KEY |
--cosmos-db |
Cosmos DB database name | VectorDB |
--cosmos-container |
Cosmos DB container name | Vectors |
To run the application locally:
# Build the application
mvn clean package
# Run with sample data and Cosmos DB connection
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
--data-source "/scripts/tmp_data/agaricus_data.csv" \
--cosmos-endpoint "https://your-cosmos-account.documents.azure.com:443/" \
--cosmos-key "your-cosmos-key"You can retrieve your Cosmos DB connection details using Azure CLI:
# Get Cosmos DB connection details
COSMOS_ENDPOINT=$(az cosmosdb show --name your-cosmos-account --resource-group your-resource-group --query "documentEndpoint" -o tsv)
COSMOS_KEY=$(az cosmosdb keys list --name your-cosmos-account --resource-group your-resource-group --query "primaryMasterKey" -o tsv)
echo "COSMOS_ENDPOINT=$COSMOS_ENDPOINT"
echo "COSMOS_KEY=$COSMOS_KEY"The application supports both CPU and GPU modes for performance comparison:
# Run with GPU acceleration (default)
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
--data-source "/scripts/tmp_data/agaricus_data.csv" \
--use-gpu true
# Run in CPU-only mode
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
--data-source "/scripts/tmp_data/agaricus_data.csv" \
--use-gpu falseFor detailed performance comparison instructions, see comparing-gpu-performance.md
To build and run the Docker container locally:
# Build the Docker image
docker build -t xgboost-rapids .
# Run the container with GPU support
docker run --gpus all \
-e COSMOS_ENDPOINT="your-cosmos-endpoint" \
-e COSMOS_KEY="your-cosmos-key" \
xgboost-rapidsThis project supports running with Java 21, but due to changes in Java's module system, special JVM flags are needed.
Use the included run.sh script to run the application locally with Java 21:
# Make the script executable
chmod +x run.sh
# Run with default parameters (GPU mode)
./run.sh
# Run with specific data source
./run.sh "/path/to/data.csv"
# Run in CPU-only mode
./run.sh "./scripts/tmp_data/agaricus_data.csv" "false"
# Run with specific parameters including Cosmos DB
./run.sh "./scripts/tmp_data/agaricus_data.csv" "true" \
--cosmos-endpoint "https://your-cosmos-account.documents.azure.com:443/" \
--cosmos-key "your-cosmos-key"Alternatively, you can run directly with the JVM flags:
java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
--add-opens=java.base/java.nio=ALL-UNNAMED \
--add-opens=java.base/java.util=ALL-UNNAMED \
--add-opens=java.base/java.lang=ALL-UNNAMED \
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED \
--add-opens=java.base/java.net=ALL-UNNAMED \
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED \
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED \
-jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
--data-source "./scripts/tmp_data/agaricus_data.csv" \
--use-gpu "true"This project comes with two POM files for different scenarios:
- Default
pom.xml: Includes all dependencies for standalone execution spark-cluster-pom.xml: For Spark cluster environments where Spark is provided
To build with the Spark cluster POM:
mvn clean package -f spark-cluster-pom.xmlThe project includes a deployment script that provisions all required Azure resources:
# Make the script executable
chmod +x scripts/deploy-to-azure.sh
# Run the deployment script
./scripts/deploy-to-azure.shThis script will:
- Create an Azure Resource Group
- Provision an Azure Container Registry
- Create an Azure Cosmos DB account with a database and container
- Set up an Azure Storage account for data files
- Build and push the Docker image to ACR
- Create an Azure Container Apps environment with GPU support
- Deploy the application to Azure Container Apps
- Resource Group: Contains all resources for the application
- Azure Container Registry: Stores the Docker image
- Azure Cosmos DB: Stores vector embeddings with vector search capability
- Azure Storage: Stores input data files
- Azure Container Apps: Hosts the application with GPU support
The application expects the mushroom dataset in a CSV format with header row. The dataset includes categorical features that are encoded for the XGBoost model. Key columns include:
class- Target variable (edible='e' or poisonous='p')cap-shape- Bell=b, conical=c, convex=x, flat=f, etc.cap-surface- Fibrous=f, grooves=g, scaly=y, smooth=scap-color- Brown=n, buff=b, cinnamon=c, gray=g, etc.bruises- Bruises=t, no=fodor- Almond=a, anise=l, creosote=c, fishy=y, etc.- And 17 more categorical attributes
The code performs one-hot encoding on these categorical features before training.
/src/main/java/com/azure/rapids/xgboost/: Java source code/scripts/: Deployment and entrypoint scripts/Dockerfile: Container definition for building the application imagepom.xml: Maven project configuration
The XGBoost model is configured with the following parameters:
params.put("eta", 0.1);
params.put("max_depth", 8);
params.put("objective", "binary:logistic");
params.put("num_round", 100);
params.put("tree_method", "gpu_hist"); // GPU-accelerated training
params.put("gpu_id", 0);
params.put("eval_metric", "auc");These parameters can be adjusted in the trainModel() method to suit your specific use case.
The application stores feature vectors in Cosmos DB, which can be used with Cosmos DB's vector search capabilities. The vectors are stored along with prediction results and metadata.
To make it easy to compare CPU and GPU performance, use the included comparison script:
# Make the script executable
chmod +x compare-cpu-gpu.sh
# Run the comparison with default dataset
./compare-cpu-gpu.sh
# Run with a specific dataset and additional parameters
./compare-cpu-gpu.sh "./path/to/your/data.csv" --cosmos-endpoint "your-endpoint" --cosmos-key "your-key"The script will:
- Run the application in CPU-only mode
- Run the application in GPU-accelerated mode
- Record execution times for both runs
- Extract detailed timing metrics for each processing phase
- Generate a comprehensive performance report in Markdown format
The script generates a detailed report in ./results/performance_report.md with:
- Overall execution time comparison
- Phase-by-phase timing breakdown
- Speedup factors for each phase
- Accuracy comparison between CPU and GPU models
This project was inspired by the NVIDIA RAPIDS spark-rapids-examples which demonstrates the use of XGBoost for mushroom classification. Our implementation is converted to Java running on Azure and extended to use:
- The current implementation is optimized for binary classification problems
- For production use, consider implementing model evaluation metrics and cross-validation
- For very large datasets, consider implementing incremental training or distributed processing
- The vector search capabilities in Cosmos DB should be configured through the Azure Portal
Contributions are welcome! Please feel free to submit a Pull Request.