Set of interfaces to run various AI models from public MAD.
An AI Models automation and dashboarding command-line tool to run LLMs and Deep Learning models locally or remotelly with CI.
The madengine library is to support AI automation having following features:
- AI Models run reliably on supported platforms and drive software quality
- Simple, minimalistic, out-of-the-box solution that enable confidence on hardware and software stack
- Real-time, audience-relevant AI Models performance metrics tracking, presented in clear, intuitive manner
- Best-practices for handling internal projects and external open-source projects
madengine is meant to be used in conjunction with MAD. Below are the steps to set it up and run it using the command line interface (CLI).
git clone [email protected]:ROCm/MAD.git
cd MAD
# Create virtual environment if necessary
python3 -m venv venv
# Active the virtual environment venv
source venv/bin/activate
# Clone madengine
git clone [email protected]:ROCm/madengine.git
# Change current working directory to madengine
cd madengine
# Install madengine from source:
pip install .
You can also install the madengine library directly from the Github repository.
pip install git+https://github.com/ROCm/madengine.git@main
How to run madengine CLI on your local machine.
(venv) test-node:~/MAD$ madengine --help
usage: madengine [-h] [-v] {run,discover,report,database} ...
A Model automation and dashboarding command-line tool to run LLMs and Deep Learning models locally.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Commands:
Available commands for running models, generating reports, and toolings.
{run,discover,report,database}
run Run models on container
discover Discover the models
report Generate report of models
database CRUD for database
Command to run LLMs and Deep Learning Models on container.
# An example CLI command to run a model
madengine run --tags pyt_huggingface_bert --live-output --additional-context "{'guest_os': 'UBUNTU'}"
(venv) test-node:~/MAD$ madengine run --help
usage: madengine run [-h] [--tags TAGS [TAGS ...]] [--timeout TIMEOUT] [--live-output] [--clean-docker-cache] [--additional-context-file ADDITIONAL_CONTEXT_FILE]
[--additional-context ADDITIONAL_CONTEXT] [--data-config-file-name DATA_CONFIG_FILE_NAME] [--tools-json-file-name TOOLS_JSON_FILE_NAME]
[--generate-sys-env-details GENERATE_SYS_ENV_DETAILS] [--force-mirror-local FORCE_MIRROR_LOCAL] [--keep-alive] [--keep-model-dir]
[--skip-model-run] [--disable-skip-gpu-arch] [-o OUTPUT]
Run LLMs and Deep Learning models on container
optional arguments:
-h, --help show this help message and exit
--tags TAGS [TAGS ...]
tags to run (can be multiple).
--timeout TIMEOUT time out for model run in seconds; Overrides per-model timeout if specified or default timeout of 7200 (2 hrs). Timeout of 0 will never
timeout.
--live-output prints output in real-time directly on STDOUT
--clean-docker-cache rebuild docker image without using cache
--additional-context-file ADDITIONAL_CONTEXT_FILE
additonal context, as json file, to filter behavior of workloads. Overrides detected contexts.
--additional-context ADDITIONAL_CONTEXT
additional context, as string representation of python dict, to filter behavior of workloads. Overrides detected contexts and additional-
context-file.
--data-config-file-name DATA_CONFIG_FILE_NAME
custom data configuration file.
--tools-json-file-name TOOLS_JSON_FILE_NAME
custom tools json configuration file.
--generate-sys-env-details GENERATE_SYS_ENV_DETAILS
generate system config env details by default
--force-mirror-local FORCE_MIRROR_LOCAL
Path to force all relevant dataproviders to mirror data locally on.
--keep-alive keep Docker container alive after run; will keep model directory after run
--keep-model-dir keep model directory after run
--skip-model-run skips running the model; will not keep model directory after run unless specified through keep-alive or keep-model-dir
--disable-skip-gpu-arch
disables skipping model based on gpu architecture
-o OUTPUT, --output OUTPUT
output file
For each model in models.json, the script
- builds docker images associated with each model. The images are named 'ci-$(model_name)', and are not removed after the script completes.
- starts the docker container, with name, 'container_$(model_name)'. The container should automatically be stopped and removed whenever the script exits.
- clones the git 'url', and runs the 'script'
- compiles the final perf.csv and perf.html
With the tag functionality, the user can select a subset of the models, that have the corresponding tags matching user specified tags, to be run. User specified tags can be specified with the --tags
argument. If multiple tags are specified, all models that match any tag is selected.
Each model name in models.json is automatically a tag that can be used to run that model. Tags are also supported in comma-separated form as a Jenkins parameter.
Use cases of running models with static and dynamic search. Tags option supports searching models in models.json, scripts/model_dir/models.json, and scripts/model_dir/get_models_json.py. A user can add new models not only to the models.json file of DLM but also to the model folder in Flexible. To do this, the user needs to follow these steps:
Update models.json: Add the new model's configuration details to the models.json file. This includes specifying the model's name, version, and any other relevant metadata. Place Model Files: Copy the model files into the appropriate directory within the model folder in Flexible. Ensure that the folder structure and file naming conventions match the expected format.
# 1. run models in ~/MAD/models.json
(venv) test-node:~/MAD$ madengine run --tags dummy --live-output
# 2. run model in ~/MAD/scripts/dummy2/models.json
(venv) test-node:~/MAD$ madengine run --tags dummy2:dummy_2 --live-output
# 3. run model in ~/MAD/scripts/dummy3/get_models_json.py
(venv) test-node:~/MAD$ madengine run --tags dummy3:dummy_3 --live-output
# 4. run model with configurations
(venv) test-node:~/MAD$ madengine run --tags dummy2:dummy_2:batch_size=512:in=32:out=16 --live-output
# 5. run model with configurations
(venv) test-node:~/MAD$ madengine run --tags dummy3:dummy_3:batch_size=512:in=32:out=16 --live-output
The configs of batch_size512:in32:out16 will be pass to environment variables and build arguments of docker.
The default timeout for model run is 2 hrs. This can be overridden if the model in models.json contains a 'timeout' : TIMEOUT
entry. Both the default timeout and/or timeout specified in models.json can be overridden using --timeout TIMEOUT
command line argument. Having TIMEOUT
set to 0 means that the model run will never timeout.
By default, madengine
is silent. The output is piped into log files. By specifying --live-output
, the output is printed in real-time to STDOUT.
Contexts are run-time parameters that change how the model is executed. Some contexts are auto-detected. Detected contexts may be over-ridden. Contexts are also used to filter Dockerfile used in model.
For more details, see How to provide contexts
Credentials to clone model git urls are provided in a centralized credential.json
file. Models that require special credentials for cloning have a special cred
field in the model definition in models.json
. This field denotes the specific credential in credential.json
to use. Public models repositories can skip the cred
field.
There are several types of credentials supported.
-
For HTTP/HTTPS git urls,
username
andpassword
should be provided in the credential. For Source Code Management(SCM) systems that support Access Tokens, the token can be substituted for thepassword
field. Theusername
andpassword
will be passed as a docker build argument and a container environment variable in the docker build and run steps. Fore example, for"cred":"AMD_GITHUB"
field inmodels.json
and entry"AMD_GITHUB": { "username": "github_username", "password":"pass" }
incredential.json
the following docker build arguments and container environment variables will be added:AMD_GITHUB_USERNAME="github_username"
andAMD_GITHUB_PASSWORD="pass"
. -
For SSH git urls,
username
andssh_key_file
should be provided in the credential. Theusername
is the SSH username, andssh_key_file
is the private ssh key, that has been registed with the SCM system. Due to legal requirements, the Credentials to access all models is not provided by default in DLM. Please contact the model owner if you wish to access and run the model. -
For NAS urls,
HOST
,PORT
,USERNAME
, andPASSWORD
should be provided in the credential. Please check env variables starting with NAS in [Environment Variables] (https://github.com/ROCm/madengine/blob/main/README.md#environment-variables) -
For AWS S3 urls,
USERNAME
, andPASSWORD
should be provided in the credential with var name as MAD_AWS_S3 as mentioned in [Environment Variables] (https://github.com/ROCm/madengine/blob/main/README.md#environment-variables)
The DLM user may wish to run a model locally multiple times, with the input data downloaded once, and reused subsquently. This functionality is only supported on models that support the Data Provider functionality. That is, the model specification in models.json
have the data
field, which points to a data specification in data.json
.
To use existing data on a local path, add to the data specification, using a local
field within data.json
. By default, this path is mounted read-only. To change this path to read-write, specify the readwrite
field to 'true'
in the data configuration.
If no data exists in local path, a local copy of data can be downloaded using by setting the mirrorlocal
field in data specification in data.json
. Not all providers support mirrorlocal
. For the ones that do support this feature, the remote data is mirrored on this host path during the first run. In subsequent runs, the data may be reused through synchronization mechanisms. If the user wishes to skip the remote synchronization, the same location can be set as a local
data provider in data.json, with higher precedence, or as the only provider for the data, by locally editing data.json
.
Alternatively, the command-line argument, --force-mirror-local
forces local mirroring on all workloads, to the provided FORCEMIRRORLOCAL path.
Commands for discovering models through models.json, scripts/{model_dir}/models.json, or scripts/{model_dir}/get_models_json.py
(venv) test-node:~/MAD$ madengine discover --help
usage: madengine discover [-h] [--tags TAGS [TAGS ...]]
Discover the models
optional arguments:
-h, --help show this help message and exit
--tags TAGS [TAGS ...]
tags to discover models (can be multiple).
Use cases about how to discover models:
# 1 discover all models in DLM
(venv) test-node:~/MAD$ madengine discover
# 2. discover specified model using tags in models.json of DLM
(venv) test-node:~/MAD$ madengine discover --tags dummy
# 3. discover specified model using tags in scripts/{model_dir}/models.json with static search i.e. models.json
(venv) test-node:~/MAD$ madengine discover --tags dummy2/dummy_2
# 4. discover specified model using tags in scripts/{model_dir}/get_models_json.py with dynamic search i.e. get_models_json.py
(venv) test-node:~/MAD$ madengine discover --tags dummy3/dummy_3
# 5. pass additional args to your model script from CLI
(venv) test-node:~/MAD$ madengine discover --tags dummy3/dummy_3:bs16
# 6. get multiple models using tags
(venv) test-node:~/MAD$ madengine discover --tags pyt_huggingface_bert pyt_huggingface_gpt2
Note: You cannot use a backslash '/' or a colon ':' in a model name or a tag for a model in models.json
or get_models_json.py
Commands for generating reports.
(venv) test-node:~/MAD$ madengine report --help
usage: madengine report [-h] {update-perf,to-html,to-email} ...
optional arguments:
-h, --help show this help message and exit
Report Commands:
Available commands for generating reports.
{update-perf,to-html,to-email}
update-perf Update perf.csv to database
to-html Convert CSV to HTML report of models
to-email Convert CSV to Email of models
Update perf.csv to database
(venv) test-node:~/MAD$ madengine report update-perf --help
usage: madengine report update-perf [-h] [--single_result SINGLE_RESULT] [--exception-result EXCEPTION_RESULT] [--failed-result FAILED_RESULT]
[--multiple-results MULTIPLE_RESULTS] [--perf-csv PERF_CSV] [--model-name MODEL_NAME] [--common-info COMMON_INFO]
Update performance metrics of models perf.csv to database.
optional arguments:
-h, --help show this help message and exit
--single_result SINGLE_RESULT
path to the single result json
--exception-result EXCEPTION_RESULT
path to the single result json
--failed-result FAILED_RESULT
path to the single result json
--multiple-results MULTIPLE_RESULTS
path to the results csv
--perf-csv PERF_CSV
--model-name MODEL_NAME
--common-info COMMON_INFO
Convert CSV to HTML report of models
(venv) test-node:~/MAD$ madengine report to-html --help
usage: madengine report to-html [-h] [--csv-file-path CSV_FILE_PATH]
Convert CSV to HTML report of models.
optional arguments:
-h, --help show this help message and exit
--csv-file-path CSV_FILE_PATH
Convert CSV to Email report of models
(venv) test-node:~/MAD$ madengine report to-email --help
usage: madengine report to-email [-h] [--csv-file-path CSV_FILE_PATH]
Convert CSV to Email of models.
optional arguments:
-h, --help show this help message and exit
--csv-file-path CSV_FILE_PATH
Path to the directory containing the CSV files.
Commands for database, such as create and update table of DB.
(venv) test-node:~/MAD$ madengine database --help
usage: madengine database [-h] {create-table,update-table,upload-mongodb} ...
optional arguments:
-h, --help show this help message and exit
Database Commands:
Available commands for database, such as creating and updating table in DB.
{create-table,update-table,upload-mongodb}
create-table Create table in DB
update-table Update table in DB
upload-mongodb Update table in DB
(venv) test-node:~/MAD$ madengine database create-table --help
usage: madengine database create-table [-h] [-v]
Create table in DB.
optional arguments:
-h, --help show this help message and exit
-v, --verbose verbose output
(venv) test-node:~/MAD$ madengine database update-table --help
usage: madengine database update-table [-h] [--csv-file-path CSV_FILE_PATH] [--model-json-path MODEL_JSON_PATH]
Update table in DB.
optional arguments:
-h, --help show this help message and exit
--csv-file-path CSV_FILE_PATH
Path to the csv file
--model-json-path MODEL_JSON_PATH
Path to the model json file
(venv) test-node:~/MAD$ madengine database upload-mongodb --help
usage: madengine database upload-mongodb [-h] [--type TYPE] [--file-path FILE_PATH] [--name NAME]
Update table in DB.
optional arguments:
-h, --help show this help message and exit
--type TYPE type of document to upload: job or run
--file-path FILE_PATH
total path to directory where perf_entry.csv, *env.csv, and *.log are stored
--name NAME name of model to upload
There are some tools distributed with madengine together. They work with madengine CLI to profile GPU and get trace of ROCm libraries.
Profile GPU usage of running LLMs and Deep Learning models.
(venv) test-node:~/MAD$ madengine run --tags pyt_huggingface_bert --additional-context "{'guest_os': 'UBUNTU','tools': [{'name':'rocprof'}]}"
Trace library usage of running LLMs and Deep Learning models. A demo of running model with tracing rocBlas.
(venv) test-node:~/MAD$ madengine run --tags pyt_huggingface_bert --additional-context "{'guest_os': 'UBUNTU','tools': [{'name':'rocblas_trace'}]}"
Madengine also exposes environment variables to allow for models location setting or data loading at DLM/MAD runtime.
Field | Description |
---|---|
MODEL_DIR | the location of models dir |
PUBLIC_GITHUB_ROCM_KEY | username and token of GitHub |
MAD_AWS_S3 | the username and password of AWS S3 |
NAS_NODES | the list of credentials of NAS Nodes |
Examples for running models using environment variables.
# Apply AWS S3
MAD_AWS_S3='{"USERNAME":"username","PASSWORD":"password"}' madengine run --tags dummy_data_aws --live-output
# Apply customized NAS
NAS_NODES=[{"HOST":"hostname","PORT":"22","USERNAME":"username","PASSWORD":"password"}] madengine run --tags dummy_data_austin_nas --live-output
Run pytest to validate unit tests of MAD Engine.
pytest -v -s