Open
Description
Problem description:
- Currently evalchemy supports only multi-gpu (and not multi-node): data-parallel, based on accelerate and tensor-parallel, based on HF transformers evaluation. These 2 options are mutually exclusive. There's already a PR that expands on the accelerate approach to support multi-node setup. This approach seems to work fine with small models.
- However, for larger models the approach that uses accelerate is not feasible.
Proposed solution:
- Support vLLM backend (based on Ray) like lm-eval-harness.
- vLLM supports highly optimized multi-node inference and both data- and tensor-parallelism.
- In principle, vLLM should give 2 advantages: 1) faster evaluations than HF, especially with batched inference and 2) support for bigger models that do not fit on one GPU.
Metadata
Metadata
Assignees
Labels
No labels