- Overview
- Prerequisites
- Deployment Steps
- Deployment Validation
- Data Preparation
- Running the Guidance
- Next Steps
- Cleanup
Optional
This project provides complete infrastructure for training and deploying a machine learning model for predicting problematic gambling behavior. It allows the user to create an Amazon SageMaker Domain with custom JupyterLab environments and necessary AWS resources for ML workloads. It is a Maven based project, so you can open this project with any Maven compatible Java IDE to build and run.
The betting and gaming industry requires accurate detection of problematic play signals to protect players and maintain regulatory compliance. Late identification of at-risk players damages both player welfare and industry reputation. Traditional monitoring methods detect issues after harm occurs, leading to regulatory penalties and erosion of public trust.
AWS's AI/ML services enable automated, data-driven detection of at-risk players. This solution processes player behavioral and financial data to identify concerning patterns through a machine learning model deployed on Amazon SageMaker. The model analyzes the following key metrics:
| Metric | Description |
|---|---|
| NUM_BETTING_DAYS | Number of distinct days on which the user placed bets |
| AVG_BETS_PER_DAY | Average number of bets placed per day |
| AVG_TOTAL_STAKE_PER_DAY | Average total amount staked per day |
| AVG_TOTAL_PAYOUT_PER_DAY | Average total payout received per day |
| AVG_TOTAL_NET_POSITION_PER_DAY | Average daily net position (winnings minus losses) |
| AVG_STAKE_PER_BET | Average amount staked per individual bet |
| AVG_PRICE_PER_BET | Average odds (price) of bets placed |
| TOTAL_LATE_BETS_0004 | Total number of bets placed between midnight and 4 AM |
| TOTAL_LATE_BETS_2024 | Total number of bets placed between 8 PM and midnight |
| TOTAL_WINS | Total number of winning bets |
| TOTAL_LOSSES | Total number of losing bets |
| NET_ROI | Net Return on Investment (total net position divided by total amount staked) |
| MAX_STAKE_PER_DAY | Maximum amount staked in a single day |
| MIN_STAKE_PER_DAY | Minimum amount staked in a single day |
| STDDEV_STAKE_PER_DAY | Standard deviation of daily stake amounts |
| MAX_PAYOUT_PER_DAY | Maximum payout received in a single day |
| MIN_PAYOUT_PER_DAY | Minimum payout received in a single day |
| STDDEV_PAYOUT_PER_DAY | Standard deviation of daily payout amounts |
| MAX_NET_POSITION_PER_DAY | Maximum net position (profit or loss) in a single day |
| MIN_NET_POSITION_PER_DAY | Minimum net position (profit or loss) in a single day |
| STDDEV_NET_POSITION_PER_DAY | Standard deviation of daily net positions |
| WIN_RATIO | Ratio of winning bets to total bets placed |
The solution generates real-time risk assessments, allowing compliance teams to intervene immediately when the model identifies concerning patterns. This automation augments existing responsible gaming controls and strengthens regulatory compliance frameworks.
The provided data schema captures essential player behavior indicators. Organizations can extend these parameters based on their specific requirements and available data to optimize model performance.
The reference architecture of the guidance is shown below:
The platform includes:
- SageMaker Domain with custom configurations
- Dedicated VPC for ML workloads
- S3 bucket for data storage
- Custom IAM roles and permissions
- JupyterLab environment with lifecycle configurations
- AWS SageMaker Domain: Configured with custom JupyterLab settings
- Jupyter Notebook Notebook with feature engineering, training and inference steps
- VPC Configuration: Dedicated VPC with public subnets
- Storage: S3 bucket for storing ML datasets and notebooks
- Security: Custom IAM roles with specific permissions for SageMaker execution
- Environment: Custom lifecycle configurations for JupyterLab startup
You are responsible for the cost of the AWS services used while running this Guidance. As of April 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) is approximately $10.50 per month for training the model.
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.
The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.
| AWS service | Dimensions | Cost [USD] |
|---|---|---|
| Amazon S3 | 1 GB per month, 1000 PUT request and 1000 GET requests per month | $ 0.03 |
| Amazon SageMaker | 1 user, 8 hours per day and 10 training jobs per month | $ 10.46 |
| Amazon SageMaker | Serverless Inference, 10.000 requests with 100ms duration per request | $ 0.02 |
| Amazon SageMaker | Real-time Inference, 1 model deployed, 1 model per endpoint, 1 instance per endpoint, 24 hours per day, 30 days per month (ml.c4.2xlarge) | $ 344.16 |
Remember to delete any deployed SageMaker endpoints when not in use to avoid unnecessary charges.
With the exception of the Python data preparation, these deployment instructions are OS agnostic since it is an AWS Cloud Development Kit (AWS CDK) Java-based project. As soon as you have installed the required tools, you should be able to deploy the project
This is an AWS CDK project, so in order to build it, and deploy the resources the following tools are needed:
- Java 17 or later
- Apache Maven 3.8.6 or later
- CDK 2.177 or later
- npm 11.3.0 or later
- aws CLI 2.23.9 or later
- Python 3.7 or later
- pip (Python package installer)
Ensure Python is installed and added to your system PATH
python --versionThis Guidance uses aws-cdk. If you are using aws-cdk for first time, please perform the bootstrapping process described here
- Clone the repository
git clone <github_repo_url>
- cd to the repo folder
cd <repo-name>/deployment
- Install project dependencies:
mvn clean package
- Get the synthesized CloudFormation template
cdk synth
- Deploy the stack to your default AWS account and region
cdk deploy
- Create a Python virtual environemnt and Install Python dependencies:
cd ../source python -m venv myenv # create the Python virtual environment source myenv/bin/activate # activate the virtual environment pip install -r requirements.txt
- Other Useful commands
cdk ls #list all stacks in the app cdk diff #compare deployed stack with current state
Example:
- Clone the repo using command
git clone xxxxxxxxxx - cd to the repo folder
cd <repo-name> - Install packages in requirements using command
pip install -r requirement.txt - Edit content of file-name and replace s3-bucket with the bucket name in your account.
- Run this command to deploy the stack
cdk deploy - Capture the domain name created by running this CLI command
aws apigateway ............
- Open CloudFormation console and verify the status of the template with the name starting with xxxxxx.
- If deployment is successful, you should see an active SageMaker Domain with the name starting with in the SageMaker AI console.
- Run the following CLI command to validate the deployment:
aws cloudformation describe-stacks --stack-name xxxxxxxxxxxxx
Before training the model, you'll need to prepare the training data through the following steps:
Download the following datasets from The Transparency Project study "Behavioral Characteristics of Internet Gamblers Who Trigger Corporate Responsible Gambling Interventions":
- Raw Dataset 1 - Demographics
- Raw Dataset 2 - Daily Aggregates
Save both files to the source/ directory.
Create synthetic individual bet data from the daily aggregates using the generate_bets.py script:
cd source
python generate_bets.py raw_dataset_II_daily_aggregates synthetic_outputCombine the demographic data, daily aggregates, and synthetic wagers to create the final training dataset:
python analyze_betting_stats.py raw_dataset_II raw_dataset_I_demographics_data synthetic_output betting_statistics- synthetic_output is previously generated file
- betting_statistics is the name of the output file. If you select different name, you have to update the cell of the notebook which reads its data.
Upload the generated training dataset to your S3 bucket:
aws s3 cp betting_statistics s3://your-bucket-name/training_data/Note: Replace your-bucket-name with your actual S3 bucket name where the model training data will be stored. It has already been created from the cdk code. You can get the name from the AWS console
As soon as the cdk code has been deployed on AWS, navigate to the SageMaker AI console. Select the responsible gaming domain and start the Jupyter notebook already deployed there. The notebook includes step-by-step instructions on how to train the model and deploy different inference endpoints (serverless or provisioned) to test it with your test data.
Further action you can take to extend this Guidance:
- You can use your own data to train the model
- Build a CI/CD Pipeline to train the model, monitor it and fine tune it
- Build batch or real-time inference endpoints to invoke the mode according to your specific use case
In order to clean up the resources created do the following steps
- Empty the data of the created S3 buckets and delete them manually
- Delete the EFS service attached to the SageMaker AI domain
- Delete the SageMaker AI domain
- Delete the CloudFormation stack
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.
The dataset analyzed in this guidance was obtained from the Transparency Project, is a public data repository for privately-funded datasets, such as industry-funded data, specifically related to addictive behavior. The Division on Addiction of the Harvard Medical School created this repository to promote transparency for privately-funded science and better access to scientific information. The data originates from the study "Behavioral Characteristics of Internet Gamblers Who Trigger Corporate Responsible Gambling Interventions" and is publicly available at: http://www.thetransparencyproject.org/index.html
