Each invocation of helm-run will run a number of evaluation runs. Each evaluation run uses a single scenario and a single model, and is performed independently form other evaluation runs. Evaluation runs are usually executed serially / one at a time by the default runner, though some alternate runners (e.g. SlurmRunner) may execute evaluation runs in parallel.
An evaluation run has the following steps:
- Get in-context learning and evaluation instances from scenario. Each instance has an input (e.g. question) and a set of reference outputs (e.g. multiple choice options).
- (Advanced) Run data augmenters / perturbations on the base instances to generate perturbed instances.
- Perform adaptation to transform the in-context learning instances and evaluation instances into model inference requests, which contain prompts and other request parameters such as request temperature and stop sequences.
- Send the requests to the models and receives the request responses.
- Compute the per-instance stats and aggregate them to per-run stats.
The following code and data objects are responsible involved in an evaluation run:
- A
Scenarioprovides the in context learning and evaluationInstances. - A
DataAugmentertakes in baseInstanceand generates perturbedInstance. - A
Adaptertransforms in-context learning instances and evaluation instances into model inferenceRequests. - A
Clientsends theRequeststo the models and receivesRequestResponses. Metricss take inRequestStates (which each contain aInstance,Request,RequestResponse, and additional instance context) and compute aggregated adn per-instanaceStats.
Each evaluation run is fully specified by a run specification (RunSpec), which contains a specification for each of the above code objects (except ClientSpec, which is a special case):
- A
ScenarioSpecspecifies aScenarioinstances. - A
DataAugmenterSpecspecifies aDataAugmenterinstance. - An
AdapterSpecspecifies anAdapterinstance. MetricSpecs specifiesMetricinstances.
Note: The RunSpec does not contain a ClientSpec specifies the Client instance. Instead, the RunSpec specifies the name of the model deployment inside AdapterSpec. During the evaluation run, the model deployment name is used to retreive the ClientSpec from built-in or user-provided model deployment configurations, which is then used to construct the Client. This late binding allows the HELM user to perform user-specific configuration of clients, such as changing the type or location of the model inference platform for the model.
The objects above can be grouped into three categories:
- Specifications (
RunSpec,ScenarioSpec,DataAugmenterSpec,AdapterSpec,ClientSpec, andMetricsSpec) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it. - Code objects (
Scenario,DataAugmenter,Adapter,Client,Metric) are not serializable. These contain program logic used for by the evlauation run. Users can implement custom subclasses of these objects if needed. - Data objects (
Instance,Request,Response,Stat) are serializable. These are typcically produced as outputs of code objects and written to the evaluation run output files.
When a user runs helm-run, the evaluation runner will perform a number of evaluation runs, each specified by a RunSpec. However, the user typically does not provide the RunSpecs directly. Instead, the RunSpecs are produced by run spec functions. The user instead passes one or more run entries to helm-run, which are short strings (e.g. mmlu:subject=anatomy,model=openai/gpt2) that specify how to invoke the run spec functions to get the actual RunSpecs.
The run entry format is explained further on its own documentation.