This document contains instructions about building and running the Inferflow tools and service on Windows.
Before getting started, please make sure that:
-
The source codes of Inferflow has been cloned to the local machine.
-
Microsoft Visual Studio (2017 or a newer version) has been installed.
-
NVCC 12.2 is installed after the installation of Visual Studio.
If Visual Studio is installed later, you may have to reinstall NVCC 12.2. During the reinstallation, please choose the custom installation option and then choose NOT to install the driver by unchecking the "driver components" checkbox.
Open one of the following solution files in build/vs_projects or build/vs2022_projects:
- inferflow.sln: The GPU version that supports GPU/CPU hybrid inference.
- inferflow_cpu.sln: The CPU-only version.
For the Debug configuration, executables are generated to bin/x64_Debug/; while the output directory for the Release configuration is bin/x64_Release/.
-
Example-1: Load a tiny model and perform inference
- Step-1: Download the model
#> cd {inferflow-root-dir}/data/models/llama2.c/ #> download.win.cmdInstead of running the above batch script, you can also manually download the model files and copy them to the above folder. The source URL and file names can be found from download.win.cmd.
- Step-2: Run the llm_inference tool:
#> cd {inferflow-root-dir}/bin/ #> x64_Release/llm_inference.exe llm_inference.tiny.iniPlease note that it is okay for llm_inference.exe and llm_inference.tiny.ini not being in the same folder (llm_inference.tiny.ini is in bin/ and llm_inference.exe is in bin/x64_Release/).
-
Example-2: Run the llm_inference tool to load a larger model for inference
- Step-1: Edit configuration file bin/inferflow_service.ini to choose a model.
In the "transformer_engine" section of bin/inferflow_service.ini, there are multiple lines starting with "
models =" or ";models =". The lines starting with the ";" character are comments. To choose a model for inference, please uncomment the line corresponding to this model, and comment the lines of other models. By default, the phi-2 model is selected.- Step-2: Download the selected model.
#> cd {inferflow-root-dir}/data/models/{model-name}/ #> download.win.cmd- Step-3: Edit configuration file bin/llm_inference.ini to choose or edit a query.
In the configuration file, queries are organized into query lists. A query list can contain one or multiple queries. Different query lists are for different purposes. For example,
query_list.decoder_onlyis for testing decoder-only models. Its detailed information can be configured in thequery_list.decoder_onlysection. The starting line of this section is "query_count = 1", which means only one query is included in this query list. Among the following lines with keyquery1, only one line is uncommented and therefore effective, whereas other lines (i.e., the lines starting with a ";" character) are commented. You can choose a query for testing by uncommenting this query and commenting all the other queries. You can, of course, add new queries or change the content of an existing query.- Step-4: Run the inference tool
#> cd {inferflow-root-dir}/bin/ #> x64_Release/llm_inference.exe
-
Start the Inferflow service:
-
Step-1: Edit the service configuration file (bin/inferflow_service.ini)
-
Step-2: Start the service:
#> cd bin #> x64_Release\inferflow_service.exeAlternatively, you can also run the service from the
x64_Releasefolder:#> cd bin\x64_Release #> inferflow_service.exe -
Run an HTTP client, to interact with the Inferflow service via HTTP protocol to get inference results.
-
Option-1. Run the Inferflow client tool: inferflow_client.exe
-
Step-1: Edit the configuration file (bin/inferflow_client.ini) to set the query text and options.
-
Step-2: Run the client tool to get inference results.
#> cd bin #> x64_Release\inferflow_client.exe -
-
Option-2. Use a third-party REST client (e.g., the Chrome extension of
Tabbed Postman).-
URL:
http://localhost:8080(If you access the service from a different machine, please replace "localhost" with the service IP) -
HTTP method:
POST -
Example body text:
{"text": "Write an article about the weather of Seattle.", "res_prefix": "", "decoding_alg": "sample.top_p", "random_seed": 1, "temperature": 0.7, "is_streaming_mode": 0}
-
(To be added)