@@ -55,60 +55,66 @@ spec:
55
55
FamilyName represents the model type, like llama2, which will be auto injected
56
56
to the labels with the key of `llmaz.io/model-family-name`.
57
57
type : string
58
- inferenceFlavors :
59
- description : |-
60
- InferenceFlavors represents the accelerator requirements to serve the model.
61
- Flavors are fungible following the priority represented by the slice order.
62
- items :
63
- description : |-
64
- Flavor defines the accelerator requirements for a model and the necessary parameters
65
- in autoscaling. Right now, it will be used in two places:
66
- - Pod scheduling with node selectors specified.
67
- - Cluster autoscaling with essential parameters provided.
68
- properties :
69
- name :
70
- description : Name represents the flavor name, which will be
71
- used in model claim.
72
- type : string
73
- nodeSelector :
74
- additionalProperties :
75
- type : string
76
- description : |-
77
- NodeSelector represents the node candidates for Pod placements, if a node doesn't
78
- meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin.
79
- If nodeSelector is empty, it means every node is a candidate.
80
- type : object
81
- params :
82
- additionalProperties :
83
- type : string
84
- description : |-
85
- Params stores other useful parameters and will be consumed by the autoscaling components
86
- like cluster-autoscaler, Karpenter.
87
- E.g. when scaling up nodes with 8x Nvidia A00, the parameter can be injected with
88
- instance-type: p4d.24xlarge for AWS.
89
- type : object
90
- requests :
91
- additionalProperties :
92
- anyOf :
93
- - type : integer
94
- - type : string
95
- pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
96
- x-kubernetes-int-or-string : true
58
+ inferenceConfig :
59
+ description : InferenceConfig represents the inference configurations
60
+ for the model.
61
+ properties :
62
+ flavors :
63
+ description : |-
64
+ Flavors represents the accelerator requirements to serve the model.
65
+ Flavors are fungible following the priority represented by the slice order.
66
+ items :
97
67
description : |-
98
- Requests defines the required accelerators to serve the model for each replica,
99
- like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
100
- the resource requirements for each replica. This may change in the future.
101
- Not recommended to set the cpu and memory usage here:
102
- - if using playground, you can define the cpu/mem usage at backendConfig.
103
- - if using inference service, you can define the cpu/mem at the container resources.
104
- However, if you define the same accelerator requests at playground/service as well,
105
- the requests here will be covered.
68
+ Flavor defines the accelerator requirements for a model and the necessary parameters
69
+ in autoscaling. Right now, it will be used in two places:
70
+ - Pod scheduling with node selectors specified.
71
+ - Cluster autoscaling with essential parameters provided.
72
+ properties :
73
+ name :
74
+ description : Name represents the flavor name, which will
75
+ be used in model claim.
76
+ type : string
77
+ nodeSelector :
78
+ additionalProperties :
79
+ type : string
80
+ description : |-
81
+ NodeSelector represents the node candidates for Pod placements, if a node doesn't
82
+ meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin.
83
+ If nodeSelector is empty, it means every node is a candidate.
84
+ type : object
85
+ params :
86
+ additionalProperties :
87
+ type : string
88
+ description : |-
89
+ Params stores other useful parameters and will be consumed by cluster-autoscaler / Karpenter
90
+ for autoscaling or be defined as model parallelism parameters like TP or PP size.
91
+ E.g. with autoscaling, when scaling up nodes with 8x Nvidia A00, the parameter can be injected
92
+ with <INSTANCE-TYPE: p4d.24xlarge> for AWS.
93
+ Preset parameters: TP, PP, INSTANCE-TYPE.
94
+ type : object
95
+ requests :
96
+ additionalProperties :
97
+ anyOf :
98
+ - type : integer
99
+ - type : string
100
+ pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
101
+ x-kubernetes-int-or-string : true
102
+ description : |-
103
+ Requests defines the required accelerators to serve the model for each replica,
104
+ like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
105
+ the resource requirements for each replica, usually equals to the TP size.
106
+ Not recommended to set the cpu and memory usage here:
107
+ - if using playground, you can define the cpu/mem usage at backendConfig.
108
+ - if using inference service, you can define the cpu/mem at the container resources.
109
+ However, if you define the same accelerator requests at playground/service as well,
110
+ the requests will be overwritten by the flavor requests.
111
+ type : object
112
+ required :
113
+ - name
106
114
type : object
107
- required :
108
- - name
109
- type : object
110
- maxItems : 8
111
- type : array
115
+ maxItems : 8
116
+ type : array
117
+ type : object
112
118
source :
113
119
description : |-
114
120
Source represents the source of the model, there're several ways to load
@@ -158,8 +164,10 @@ spec:
158
164
type : object
159
165
uri :
160
166
description : |-
161
- URI represents a various kinds of model sources following the uri protocol, e.g.:
162
- - OSS: oss://<bucket>.<endpoint>/<path-to-your-model>
167
+ URI represents a various kinds of model sources following the uri protocol, protocol://<address>, e.g.
168
+ - oss://<bucket>.<endpoint>/<path-to-your-model>
169
+ - ollama://llama3.3
170
+ - host://<path-to-your-model>
163
171
type : string
164
172
type : object
165
173
required :
0 commit comments