LLM Offline Inference

This is a comprehensive toolkit designed to streamline offline inference for large language models (LLMs). It focuses on processing large batches of prompts efficiently by wrapping popular libraries such as vLLM and Hugging Face Transformers, and it is built with future extensibility in mind — with plans to integrate additional libraries like SGLang.

Overview

Key Features

Multi-Library Integration: Seamless support for various LLM inference libraries, enabling you to switch or combine frameworks as needed.
Dynamic & Manual Batching: Optimize throughput with efficient batch processing strategies that allow for both dynamically determined and manually specified batch sizes.
Guided Generation: Leverage advanced guided generation techniques through JSON schemas or regular expression-based choices for more controlled and deterministic outputs with the outlines library.
Unified Generation Parameters: Fine-tune generation settings with a unified set of parameters over different libraries to control model behavior.

Performance and Scalability

Built for high-throughput offline inference, the toolkit's robust batching mechanisms and performance logging ensure efficient processing of large volumes of prompts. Its design focuses on maximizing resource utilization and delivering scalable performance even in resource-intensive scenarios.

Installation

Install the library with:

git clone https://github.com/brotSchimmelt/llm-offline-inference.git
cd llm-offline-inference

pip install -e .

Usage

from pydantic import BaseModel
from llm_offline_inference import VLLM, GenerationParams

# initialize the model
llm = VLLM(
  name="human friendly model name",
  model_path="/path/to/model",
  prompt_format="llama2" # supported prompt formats are found in config/prompt_formats.py
)

# set the parameters for model generation
generation_params = GenerationParams(
  temperature=0.5,
  max_tokens=16,
  # list with all parameters is found in generation_params.py
)

# setup JSON return schema (optional)
class City(BaseModel):
  city: str


output = llm.generate(
    "What is the capital of Iceland?",
    generation_params,
    return_string=True,
    json_schema=City, # optional
    system_prompt="You are a helpful assistant.", # optional
)
# output[0]: { "city": "Reykjavik" }

License

This project is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
llm_offline_inference		llm_offline_inference
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Offline Inference

Overview

Key Features

Performance and Scalability

Installation

Usage

License

About

Uh oh!

Releases 1

Languages

License

brotSchimmelt/llm-offline-inference

Folders and files

Latest commit

History

Repository files navigation

LLM Offline Inference

Overview

Key Features

Performance and Scalability

Installation

Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages