Skip to content

Commit 3b1d362

Browse files
authored
Merge pull request #158 from danielferr85/main
Include vison capabilities for llama.cpp integration. Better error handling. Better documentation in code. Change notation for config.ini for GGUF models
2 parents 01d3261 + d8d58ac commit 3b1d362

5 files changed

Lines changed: 130 additions & 82 deletions

File tree

README.md

Lines changed: 42 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RKLLama: LLM Server and Client for Rockchip 3588/3576
22

3-
### [Version: 0.0.69](#New-Version)
3+
### [Version: 0.0.70](#New-Version)
44

55
Video demo ( version 0.0.1 ):
66

@@ -533,10 +533,12 @@ The structure of a GGUF model is similar to rkllm models. You need a folder with
533533
└── qwen3.5-4b:q8_0
534534
└── model.gguf (can have any name but must end in .gguf)
535535
└── config.ini (optional)
536+
└── mmproj.gguf (optional - can have any name but is recommended tu include substring 'mmproj' in the name. Must end in .gguf. Only apply for multimodal models for vision capabilities)
537+
536538
537539
```
538540

539-
The contents of the config.ini are llama.cpp environment vars for RKNPU inference explained by the author of the fork: https://github.com/invisiofficial/rk-llama.cpp/tree/rknpu2/ggml/src/ggml-rknpu2 (RKNPU_DOMAINS variable is skipped because rkllama handles it) and llama.cpp argument for the llama-server process: https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/tools/server/README.md
541+
The contents of the config.ini are llama.cpp environment vars for RKNPU inference explained by the author of the fork: https://github.com/invisiofficial/rk-llama.cpp/tree/rknpu2/ggml/src/ggml-rknpu2 (RKNPU_DOMAINS variable is skipped because rkllama handles it) and llama.cpp arguments for the llama-server process: https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/tools/server/README.md (Your are only allowed to use arguments that starts with '--')
540542

541543
Some examples of config.ini files:
542544

@@ -549,20 +551,20 @@ The structure of a GGUF model is similar to rkllm models. You need a folder with
549551
RKNPU_HYBRID=W8A8_HADAMARD
550552
551553
[ARGS]
552-
--mmap =
553-
--no-repack=
554-
--no-warmup=
555-
--cache-type-k = q8_0
556-
--cache-type-v = q8_0
557-
--cache-ram = 2048
558-
--batch-size = 2048
559-
--ubatch-size = 2048
560-
--top-p = 1.0
561-
--top-k = 0
562-
--min-p = 0.01
563-
--temp = 1.0
564-
--chat-template-kwargs = {"reasoning_effort": "low"}
565-
--log-file = /opt/rkllama/models/gpt-oss-20b:q8_0/llamacpp.log
554+
mmap =
555+
no-repack=
556+
no-warmup=
557+
cache-type-k = q8_0
558+
cache-type-v = q8_0
559+
cache-ram = 2048
560+
batch-size = 2048
561+
ubatch-size = 2048
562+
top-p = 1.0
563+
top-k = 0
564+
min-p = 0.01
565+
temp = 1.0
566+
chat-template-kwargs = {"reasoning_effort": "low"}
567+
log-file = llamacpp.log (can have any name and be an absolute path if you dont want logs in the same model folder)
566568
```
567569

568570

@@ -573,20 +575,20 @@ The structure of a GGUF model is similar to rkllm models. You need a folder with
573575
RKNPU_HYBRID=W4A4_HADAMARD
574576
575577
[ARGS]
576-
--mmap =
577-
--no-repack=
578-
--no-warmup=
579-
--cache-type-k = q8_0
580-
--cache-type-v = q8_0
581-
--cache-ram = 2048
582-
--batch-size = 2048
583-
--ubatch-size = 2048
584-
--ctx-size = 65536
585-
--predict = 2048
586-
--top-p = 0.95
587-
--top-k = 64
588-
--temp = 1.0
589-
--log-file = /home/orangepi/github/danielferr85/rkllama/gemma-4-26b-a4b-it:ud-iq4_xs/llamacpp.log
578+
mmap =
579+
no-repack=
580+
no-warmup=
581+
cache-type-k = q8_0
582+
cache-type-v = q8_0
583+
cache-ram = 2048
584+
batch-size = 2048
585+
ubatch-size = 2048
586+
ctx-size = 65536
587+
predict = 2048
588+
top-p = 0.95
589+
top-k = 64
590+
temp = 1.0
591+
log-file = llamacpp.log (can have any name and be an absolute path if you dont want logs in the same model folder)
590592
```
591593

592594

@@ -597,15 +599,16 @@ The structure of a GGUF model is similar to rkllm models. You need a folder with
597599
RKNPU_HYBRID=W8A8_STANDARD
598600
599601
[ARGS]
600-
--mmap =
601-
--no-repack=
602-
--no-warmup=
603-
--cache-type-k = q8_0
604-
--cache-type-v = q8_0
605-
--cache-ram = 2048
606-
--batch-size = 2048
607-
--ubatch-size = 2048
608-
--log-file = /home/orangepi/github/danielferr85/rkllama/qwen3.5-4b:q8_0/llamacpp.log
602+
mmap =
603+
no-repack=
604+
no-warmup=
605+
cache-type-k = q8_0
606+
cache-type-v = q8_0
607+
cache-ram = 2048
608+
batch-size = 2048
609+
ubatch-size = 2048
610+
mmproj = mmproj-F16.gguf (projector for vision capabilities for the model)
611+
log-file = llamacpp.log (can have any name and be an absolute path if you dont want logs in the same model folder)
609612
```
610613

611614
For qwen3.5 follow the recomendations:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "rkllama"
3-
version = "0.0.69"
3+
version = "0.0.70"
44
authors = [
55
{ name="NotPunchnox", email="punchnoxpro@gmail.com" },
66
{ name="TomJacobsUK", email="tom@tomjacobs.co.uk" },

src/rkllama/api/model_utils.py

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from pathlib import Path
66
import rkllama.config
77
import time
8+
import configparser
89

910
# Configure logger
1011
logger = logging.getLogger("rkllama.model_utils")
@@ -569,10 +570,22 @@ def get_gguf_model_path(model_name) -> str:
569570

570571
# Search for the GGUF files
571572
if os.path.isdir(model_path):
573+
574+
# Read the config for the GGUF model for llama.cpp (if exists)
575+
config_file = os.path.join(model_path, "config.ini")
576+
configuration = configparser.ConfigParser()
577+
configuration.read(config_file)
578+
579+
# Read possible projector for vision models
580+
expected_mmproj_subname = "mmproj"
581+
if configuration is not None and "ARGS" in configuration.keys() and any(x in configuration["ARGS"].keys() for x in ["mmproj","--mmproj"]):
582+
expected_mmproj_subname = configuration["ARGS"]["--mmproj"] if "--mmproj" in configuration["ARGS"].keys() else configuration["ARGS"]["mmproj"]
583+
584+
# Loop over the files in model directory
572585
for root, dirs, files in os.walk(model_path):
573586
for file in files:
574587
file_path = os.path.join(root, file)
575-
if file_path.lower().endswith(".gguf"):
588+
if file_path.lower().endswith(".gguf") and expected_mmproj_subname.lower() not in file_path.lower(): # Prevent return projector
576589
# return the file
577590
return file_path
578591

@@ -641,6 +654,7 @@ def wait_for_service(
641654
Wait until an HTTP service becomes available.
642655
643656
Parameters:
657+
process (dict): Popen process to check status
644658
url (str): URL to check.
645659
timeout (float): Requests timeout in seconds.
646660
interval (float): Seconds to wait between retry attempts.
@@ -671,8 +685,8 @@ def wait_for_service(
671685
stdout, _ = process.communicate()
672686

673687
# Kill the process
674-
server_process.kill()
675-
server_process.wait(timeout=5)
688+
process.kill()
689+
process.wait(timeout=5)
676690

677691
# Check if insufficient memory in the current domain
678692
if "RKNPU ERROR: Out of memory in allowed IOMMU domains" in stdout:
@@ -687,6 +701,11 @@ def wait_for_service(
687701
# requests.get() waits for the server response unless a timeout is set [InlineCitation-1-Guide to Handling Python Requests Timeout](https://oxylabs.io/blog/python-requests-timeout)
688702
response = requests.get(url, timeout=timeout)
689703
if response.status_code == expected_status:
704+
705+
# Wait for warm up subprocess to prevent error:
706+
# requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
707+
logger.debug(f"Waiting to finish warmup subprocess for llama-server...")
708+
time.sleep(5)
690709
return True, None
691710

692711
except requests.RequestException:
@@ -697,8 +716,8 @@ def wait_for_service(
697716
logger.error(f"Timeout waiting for llama-server process to start....")
698717

699718
# Kill the process
700-
server_process.kill()
701-
server_process.wait(timeout=5)
719+
process.kill()
720+
process.wait(timeout=5)
702721

703722
# Return not initiated
704723
return False, False

src/rkllama/api/worker.py

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,8 @@ def run_llama_cpp_model_server(model_name, gguf_model_dir, gguf_model_path, port
374374
"""
375375
Run an instance of llama.cpp for the required model program using subprocess.run.
376376
Args:
377+
model_name (str): Model Name to load
378+
gguf_model_dir (str): Directory where the model resides
377379
gguf_model_path (str): GGUF Model to load
378380
port (int): port to assign to the llama.cpp model
379381
base_domain_id (int): Domain to execute the llama.cpp server
@@ -391,7 +393,7 @@ def run_llama_cpp_model_server(model_name, gguf_model_dir, gguf_model_path, port
391393
configuration = configparser.ConfigParser()
392394
configuration.read(config_file)
393395

394-
# Read custom environment vars
396+
# Read custom environment vars to llama.cpp
395397
rk_llama_cpp_env = { "RKNPU_DOMAINS": f"{','.join(str(base_domain_id))}"}
396398
if configuration is not None and "ENV" in configuration.keys():
397399
for var in configuration["ENV"].keys():
@@ -409,14 +411,26 @@ def run_llama_cpp_model_server(model_name, gguf_model_dir, gguf_model_path, port
409411
cmd = ["taskset" ,"--cpu-list", cpu , os.path.join(rkllama.config.get_path("llamacpp"), "llama-server"),
410412
"--model" , gguf_model_path,
411413
"--port" , str(port),
412-
"--threads" , "4"]
414+
"--threads" , "4"] # Defauul 4 threads
413415

414416
# Read custom arguments to llama.cpp
415417
if configuration is not None and "ARGS" in configuration.keys():
416418
for arg in configuration["ARGS"].keys():
417419
arg_value = configuration["ARGS"][arg]
420+
421+
# Preparing format of arguments for llama-server
422+
if not arg.startswith("--"):
423+
logger.debug(f"Adding -- to the start of the argument '{arg}'")
424+
arg = f"--{arg}"
425+
418426
# Check that argments are not the required calculated by rkllama
419-
if arg not in ["-m", "--port", "--model", "--cpu-list"]:
427+
if arg not in ["--port", "--model", "--cpu-list"]:
428+
429+
# Cheking if projector and log-file exists without path in config
430+
if arg in ["--mmproj","--log-file"] and not arg_value.startswith("/"):
431+
logger.debug(f"Adding model directory to argument '{arg}' because currently relative path specified '{arg_value}'")
432+
arg_value = os.path.join(gguf_model_dir, arg_value)
433+
420434
logger.debug(f"Adding custom argument to llama.cpp '{arg}' with value '{arg_value}'")
421435
cmd.append(arg)
422436
if arg_value is not None and arg_value:
@@ -432,10 +446,6 @@ def run_llama_cpp_model_server(model_name, gguf_model_dir, gguf_model_path, port
432446
text=True,
433447
)
434448

435-
# Wait for warm up subprocess to prevent error:
436-
# requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
437-
time.sleep(3) # 3 seconds
438-
439449
# Waiting for service up
440450
logger.debug(f"Waiting for model {gguf_model_path} Up and running...")
441451
initialized, need_more_iommu_domains = wait_for_service(server_process, f"http://localhost:{port}/v1/models", max_wait = int(rkllama.config.get('model', 'max_seconds_waiting_worker_response')))
@@ -1515,6 +1525,7 @@ def create_worker_process(self, base_domain_ids, model_path, model_dir, options=
15151525
# Create the process
15161526
logger.debug(f"Trying to create the process for the worker for model {self.worker_model_info.model} with IOMMU domains {domains_assigned}")
15171527
self.process = run_llama_cpp_model_server(self.worker_model_info.model, model_dir,model_path,self.worker_model_info.llama_cpp_port,domains_assigned)
1528+
# CHeck if load fail because need more base domains
15181529
if isinstance(self.process, int) and self.process == -1:
15191530
# More base domains needed
15201531
continue

src/rkllama/server/server.py

Lines changed: 45 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1337,7 +1337,7 @@ def embeddings_ollama():
13371337
def ollama_version():
13381338
"""Return a dummy version to be compatible with Ollama clients"""
13391339
return jsonify({
1340-
"version": "0.0.69"
1340+
"version": "0.0.70"
13411341
}), 200
13421342

13431343

@@ -1573,7 +1573,8 @@ def forward_request_to_llama_cpp_worker(is_openai_request,request):
15731573
Route to llama.cpp worker for inference for GGUF models
15741574
15751575
Args:
1576-
request : Original request to forward
1576+
is_openai_request (bol):
1577+
request (dic): Original request to forward
15771578
"""
15781579

15791580
# Check if llama.cpp directory exists defined
@@ -1627,36 +1628,50 @@ def forward_request_to_llama_cpp_worker(is_openai_request,request):
16271628

16281629
def generate():
16291630

1630-
# Make the call to the llama-server with stream enable
1631-
with requests.post(
1632-
proxy_route_url,
1633-
json=data,
1634-
headers=headers,
1635-
timeout=120,
1636-
stream=stream
1637-
) as response:
1631+
try:
1632+
# Make the call to the llama-server with stream enable
1633+
with requests.post(
1634+
proxy_route_url,
1635+
json=data,
1636+
headers=headers,
1637+
timeout=120,
1638+
stream=stream
1639+
) as response:
1640+
1641+
# Check for error codes
1642+
try:
1643+
response.raise_for_status()
1644+
except requests.HTTPError as e:
1645+
error = f"OpenAI API error for model'{model_name}': {str(e)}"
1646+
logger.error(error)
1647+
return jsonify({"error": error}), 500
1648+
1649+
1650+
# Create a converter to Ollama (if needed)
1651+
converter = OpenAIToOllamaStreamConverter()
16381652

1639-
# Check for error codes
1640-
try:
1641-
response.raise_for_status()
1642-
except requests.HTTPError as e:
1643-
raise RuntimeError(f"OpenAI API error: {e}") from e
1644-
1645-
# Create a converter to Ollama (if needed)
1646-
converter = OpenAIToOllamaStreamConverter()
1647-
1648-
# Loop over the chunks returned by llama.cpp
1649-
for line in response.iter_lines():
1650-
# Decode the bytes line
1651-
line = line.decode("utf-8")
1652-
1653-
if not is_openai_request: # Ollama
1654-
for chunk in converter.process_line(line):
1655-
yield json.dumps(chunk) + "\n"
1656-
else: # OpenAI
1657-
# Return the chunk to the client of the request
1658-
yield f"{line}\n"
1653+
# Loop over the chunks returned by llama.cpp
1654+
for line in response.iter_lines():
1655+
# Decode the bytes line
1656+
line = line.decode("utf-8")
16591657

1658+
if not is_openai_request: # Ollama
1659+
for chunk in converter.process_line(line):
1660+
yield json.dumps(chunk) + "\n"
1661+
else: # OpenAI
1662+
# Return the chunk to the client of the request
1663+
yield f"{line}\n"
1664+
except Exception as e:
1665+
# Log the error
1666+
logger.error(f"Streaming error: {e}", exc_info=True)
1667+
# Send a JSON error message instead of breaking the stream
1668+
yield json.dumps({
1669+
"error": {
1670+
"code": "INTERNAL_ERROR",
1671+
"message": str(e)
1672+
}
1673+
}) + "\n"
1674+
16601675
logger.debug("Making the streaming call to llama-server...")
16611676
return Response(stream_with_context(generate()), mimetype="text/event-stream") # OpenAI
16621677

0 commit comments

Comments
 (0)