Skip to content

Files

Failed to load latest commit information.

Latest commit

 Cannot retrieve latest commit at this time.

History

History
 
 

docs

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Overview

ModelMesh is a mature, general-purpose model serving management/routing layer designed for high-scale, high-density and frequently-changing model use cases. It works with existing or custom-built model servers and acts as a distributed LRU cache for serving runtime models.

For full Kubernetes-based deployment and management of ModelMesh clusters and models, see the ModelMesh Serving repo. This includes a separate controller and provides K8s custom resource based management of ServingRuntimes and InferenceServices along with common, abstracted handling of model repository storage and ready-to-use integrations with some existing OSS model servers.

For more information on supported features and design details, see these charts.

What is a model?

In ModelMesh, a model refers to an abstraction of machine learning models. It is not aware of the underlying model format. There are two model types: model (regular) and vmodel. Regular models in ModelMesh are assumed and required to be immutable. VModels add a layer of indirection in front of the immutable models. See VModels Reference for further reading.

Implement a model runtime

  1. Wrap your model-loading and invocation logic in this model-runtime.proto gRPC service interface.
    • runtimeStatus() - called only during startup to obtain some basic configuration parameters from the runtime, such as version, capacity, model-loading timeout.
    • loadModel() - load the specified model into memory from backing storage, returning when complete.
    • modelSize() - determine size (memory usage) of previously-loaded model. If very fast, can be omitted and provided instead in the response from loadModel.
    • unloadModel() - unload previously loaded model, returning when complete.
    • Use a separate, arbitrary gRPC service interface for model inferencing requests. It can have any number of methods and they are assumed to be idempotent. See predictor.proto for a very simple example.
    • The methods of your custom applier interface will be called only for already fully-loaded models.
  2. Build a grpc server docker container which exposes these interfaces on localhost port 8085 or via a mounted unix domain socket.
  3. Extend the Kustomize-based Kubernetes manifests to use your docker image, and with appropriate memory and CPU resource allocations for your container.
  4. Deploy to a Kubernetes cluster as a regular Service, which will expose this grpc service interface via kube-dns (you do not implement this yourself), consume using grpc client of your choice from your upstream service components.
    • registerModel() and unregisterModel() for registering/removing models managed by the cluster
    • Any custom inferencing interface methods to make a runtime invocation of previously-registered model, making sure to set a mm-model-id or mm-vmodel-id metadata header (or -bin suffix equivalents for UTF-8 ids)

Development

Please see the Developer Guide for details.