XGBoost RAPIDS Agaricus Mushroom Classification on Azure Container Apps

This project demonstrates how to run GPU-accelerated XGBoost machine learning models using NVIDIA RAPIDS on Azure Container Apps. The implementation uses Apache Spark with RAPIDS acceleration to process the Agaricus mushroom dataset, train an XGBoost classification model to identify edible vs. poisonous mushrooms, and store vector embeddings in Azure Cosmos DB.

Shortlink: https://aka.ms/sparkrapidsgpudemo

Overview

The application:

Leverages NVIDIA RAPIDS for GPU-accelerated data processing
Trains an XGBoost model with GPU acceleration on the Agaricus mushroom dataset
Runs in Azure Container Apps with GPU support
Stores vector embeddings in Azure Cosmos DB for vector search capabilities
Reads and writes data from Azure Storage

Dataset

This implementation uses the Agaricus Mushroom dataset from the UCI Machine Learning Repository. The dataset includes descriptions of hypothetical samples of 23 species of gilled mushrooms in the Agaricus and Lepiota family. Each sample is classified as:

Edible (e)
Poisonous (p)

The dataset contains 8,124 instances with 22 categorical attributes like cap shape, odor, gill size, etc.

Prerequisites

Azure Subscription
Azure CLI
Docker
Maven
Java 11 JDK
NVIDIA CUDA drivers (for local development)

Local Development Setup

Clone this repository:

git clone <repository-url>
cd bbenz-azure-aca-rapids

Install the required dependencies:
```
mvn clean install
```
If you want to run the application locally with GPU support, ensure you have NVIDIA GPU with CUDA drivers installed.

Configuration

The application uses the following configuration parameters:

Parameter	Description	Default
`--data-source`	Path to the input data file (ABFS path)	`abfss://[email protected]/agaricus_data.csv`
`--cosmos-endpoint`	Azure Cosmos DB endpoint URL	From environment variable `COSMOS_ENDPOINT`
`--cosmos-key`	Azure Cosmos DB access key	From environment variable `COSMOS_KEY`
`--cosmos-db`	Cosmos DB database name	`VectorDB`
`--cosmos-container`	Cosmos DB container name	`Vectors`

Running the Application

Local Execution

To run the application locally:

# Build the application
mvn clean package

# Run with sample data and Cosmos DB connection
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
  --data-source "/scripts/tmp_data/agaricus_data.csv" \
  --cosmos-endpoint "https://your-cosmos-account.documents.azure.com:443/" \
  --cosmos-key "your-cosmos-key"

You can retrieve your Cosmos DB connection details using Azure CLI:

# Get Cosmos DB connection details
COSMOS_ENDPOINT=$(az cosmosdb show --name your-cosmos-account --resource-group your-resource-group --query "documentEndpoint" -o tsv)
COSMOS_KEY=$(az cosmosdb keys list --name your-cosmos-account --resource-group your-resource-group --query "primaryMasterKey" -o tsv)
echo "COSMOS_ENDPOINT=$COSMOS_ENDPOINT" 
echo "COSMOS_KEY=$COSMOS_KEY"

Comparing CPU vs GPU Performance

The application supports both CPU and GPU modes for performance comparison:

# Run with GPU acceleration (default)
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
  --data-source "/scripts/tmp_data/agaricus_data.csv" \
  --use-gpu true

# Run in CPU-only mode
java -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
  --data-source "/scripts/tmp_data/agaricus_data.csv" \
  --use-gpu false

For detailed performance comparison instructions, see comparing-gpu-performance.md

Docker Execution

To build and run the Docker container locally:

# Build the Docker image
docker build -t xgboost-rapids .

# Run the container with GPU support
docker run --gpus all \
  -e COSMOS_ENDPOINT="your-cosmos-endpoint" \
  -e COSMOS_KEY="your-cosmos-key" \
  xgboost-rapids

Running with Java 21

This project supports running with Java 21, but due to changes in Java's module system, special JVM flags are needed.

Local Execution with Java 21

Use the included run.sh script to run the application locally with Java 21:

# Make the script executable
chmod +x run.sh

# Run with default parameters (GPU mode)
./run.sh

# Run with specific data source
./run.sh "/path/to/data.csv"

# Run in CPU-only mode
./run.sh "./scripts/tmp_data/agaricus_data.csv" "false"

# Run with specific parameters including Cosmos DB
./run.sh "./scripts/tmp_data/agaricus_data.csv" "true" \
  --cosmos-endpoint "https://your-cosmos-account.documents.azure.com:443/" \
  --cosmos-key "your-cosmos-key"

Alternatively, you can run directly with the JVM flags:

java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
     --add-opens=java.base/java.nio=ALL-UNNAMED \
     --add-opens=java.base/java.util=ALL-UNNAMED \
     --add-opens=java.base/java.lang=ALL-UNNAMED \
     --add-opens=java.base/java.util.concurrent=ALL-UNNAMED \
     --add-opens=java.base/java.net=ALL-UNNAMED \
     --add-opens=java.base/java.lang.invoke=ALL-UNNAMED \
     --add-opens=java.base/java.lang.reflect=ALL-UNNAMED \
     -jar target/xgboost-rapids-aca-1.0-SNAPSHOT.jar \
     --data-source "./scripts/tmp_data/agaricus_data.csv" \
     --use-gpu "true"

Building Different JAR Versions

This project comes with two POM files for different scenarios:

Default pom.xml: Includes all dependencies for standalone execution
spark-cluster-pom.xml: For Spark cluster environments where Spark is provided

To build with the Spark cluster POM:

mvn clean package -f spark-cluster-pom.xml

Deploying to Azure

The project includes a deployment script that provisions all required Azure resources:

# Make the script executable
chmod +x scripts/deploy-to-azure.sh

# Run the deployment script
./scripts/deploy-to-azure.sh

This script will:

Create an Azure Resource Group
Provision an Azure Container Registry
Create an Azure Cosmos DB account with a database and container
Set up an Azure Storage account for data files
Build and push the Docker image to ACR
Create an Azure Container Apps environment with GPU support
Deploy the application to Azure Container Apps

Azure Resources Created

Resource Group: Contains all resources for the application
Azure Container Registry: Stores the Docker image
Azure Cosmos DB: Stores vector embeddings with vector search capability
Azure Storage: Stores input data files
Azure Container Apps: Hosts the application with GPU support

Data Format

The application expects the mushroom dataset in a CSV format with header row. The dataset includes categorical features that are encoded for the XGBoost model. Key columns include:

class - Target variable (edible='e' or poisonous='p')
cap-shape - Bell=b, conical=c, convex=x, flat=f, etc.
cap-surface - Fibrous=f, grooves=g, scaly=y, smooth=s
cap-color - Brown=n, buff=b, cinnamon=c, gray=g, etc.
bruises - Bruises=t, no=f
odor - Almond=a, anise=l, creosote=c, fishy=y, etc.
And 17 more categorical attributes

The code performs one-hot encoding on these categorical features before training.

Project Structure

/src/main/java/com/azure/rapids/xgboost/: Java source code
/scripts/: Deployment and entrypoint scripts
/Dockerfile: Container definition for building the application image
pom.xml: Maven project configuration

XGBoost Configuration

The XGBoost model is configured with the following parameters:

params.put("eta", 0.1);
params.put("max_depth", 8);
params.put("objective", "binary:logistic");
params.put("num_round", 100);
params.put("tree_method", "gpu_hist");  // GPU-accelerated training
params.put("gpu_id", 0);
params.put("eval_metric", "auc");

These parameters can be adjusted in the trainModel() method to suit your specific use case.

Cosmos DB Vector Search

The application stores feature vectors in Cosmos DB, which can be used with Cosmos DB's vector search capabilities. The vectors are stored along with prediction results and metadata.

Performance Comparison

To make it easy to compare CPU and GPU performance, use the included comparison script:

# Make the script executable
chmod +x compare-cpu-gpu.sh

# Run the comparison with default dataset
./compare-cpu-gpu.sh

# Run with a specific dataset and additional parameters
./compare-cpu-gpu.sh "./path/to/your/data.csv" --cosmos-endpoint "your-endpoint" --cosmos-key "your-key"

The script will:

Run the application in CPU-only mode
Run the application in GPU-accelerated mode
Record execution times for both runs
Extract detailed timing metrics for each processing phase
Generate a comprehensive performance report in Markdown format

Sample Performance Report

The script generates a detailed report in ./results/performance_report.md with:

Overall execution time comparison
Phase-by-phase timing breakdown
Speedup factors for each phase
Accuracy comparison between CPU and GPU models

Inspiration

This project was inspired by the NVIDIA RAPIDS spark-rapids-examples which demonstrates the use of XGBoost for mushroom classification. Our implementation is converted to Java running on Azure and extended to use:

Limitations and Next Steps

The current implementation is optimized for binary classification problems
For production use, consider implementing model evaluation metrics and cross-validation
For very large datasets, consider implementing incremental training or distributed processing
The vector search capabilities in Cosmos DB should be configured through the Azure Portal

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
scripts		scripts
src/main/java/com/azure/rapids/xgboost		src/main/java/com/azure/rapids/xgboost
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
README.md		README.md
compare-cpu-gpu.sh		compare-cpu-gpu.sh
comparing-gpu-performance.md		comparing-gpu-performance.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml
run-gpu.sh		run-gpu.sh
run.cmd		run.cmd
run.sh		run.sh
spark-cluster-pom.xml		spark-cluster-pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XGBoost RAPIDS Agaricus Mushroom Classification on Azure Container Apps

Overview

Dataset

Prerequisites

Local Development Setup

Configuration

Running the Application

Local Execution

Comparing CPU vs GPU Performance

Docker Execution

Running with Java 21

Local Execution with Java 21

Building Different JAR Versions

Deploying to Azure

Azure Resources Created

Data Format

Project Structure

XGBoost Configuration

Cosmos DB Vector Search

Performance Comparison

Sample Performance Report

Inspiration

Limitations and Next Steps

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bbenz/bbenz-azure-aca-rapids

Folders and files

Latest commit

History

Repository files navigation

XGBoost RAPIDS Agaricus Mushroom Classification on Azure Container Apps

Overview

Dataset

Prerequisites

Local Development Setup

Configuration

Running the Application

Local Execution

Comparing CPU vs GPU Performance

Docker Execution

Running with Java 21

Local Execution with Java 21

Building Different JAR Versions

Deploying to Azure

Azure Resources Created

Data Format

Project Structure

XGBoost Configuration

Cosmos DB Vector Search

Performance Comparison

Sample Performance Report

Inspiration

Limitations and Next Steps

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages