Performance of llama.cpp on Intel GPU with SYCL backend #23313

arthw · 2026-05-19T03:45:41Z

arthw
May 19, 2026
Collaborator

Purpose

It's used to share the performance data on Intel GPU with SYCL backend.

The performance data is only used as reference, since we don't double check the data.

It can not be used as any commercial purpose.

Rule

Encourage to test with default setting (environment variables).

If you want to update the data with special building or running setting, please create a new table.
Create/update the tables directly following the format.
Insert new record, instead of update it for same keys; Sort the records by col1, col2, col3.
Add your comments in the latest for more discussion.
Don't add table to compare with other hardware, framework or backend.
Please run 1+ times and update with the stable data.

Performance data on Intel GPU

Default setting

Build:

#fp32
./examples/sycl/build.sh

#fp16
set -DGGML_SYCL_F16=ON in ./examples/sycl/build.sh
./examples/sycl/build.sh

Run:

# choose the used GPUs in the test.
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
source /opt/intel/oneapi/setvars.sh
./build/bin/llama-bench -fa 0,1 -m ../models/llama-2-7b.Q4_0.gguf

Data:

LLM	GPU	Host	OS	Fp32 FP16	FA	pp512 t/s	tg128 t/s	Commit	Reporter	Date
gemma4 26B.A4B Q5_K - Medium	ARC 140T	Intel Ultra 255H 32GB	Linux	fp16	0	223.82	12.70	`d4c8e2c`	@jlionhan	2026/5/31
gemma4 26B.A4B Q5_K - Medium	ARC 140T	Intel Ultra 255H 32GB	Linux	fp16	1	166.46	13.70	`d4c8e2c`	@jlionhan	2026/5/31
llama-2-7b.Q4_0	Arc770x1	i7-13700K 64GB	Ubuntu 24.04.4	fp32	0	937.24	59.03	`053e01d`	@arthw	2026/5/19
llama-2-7b.Q4_0	Arc770x1	i7-13700K 64GB	Ubuntu 24.04.4	fp32	1	706.72	67.09	`19e92c3`	@arthw	2026/5/29
llama-2-7b.Q4_0	Arc770x1	i5-14600k	cachyOS	fp32	0	894.44	55.53	`5306f4b`	@digitalscream	2026/5/22
llama-2-7b.Q4_0	Arc770x1	i5-14600k	cachyOS	fp32	1	666.89	64.49	`5306f4b`	@digitalscream	2026/5/22
llama-2-7b.Q4_0	B580x1	Ryzen7 5700X3D	Ubuntu 25.10	fp16	0	2063.52	73.76	`c6e4088`	@bedovyy	2026/5/27
llama-2-7b.Q4_0	B580x2	Ryzen7 5700X3D	Ubuntu 25.10	fp16	0	1721.91	65.67	`c6e4088`	@bedovyy	2026/5/27
llama-2-7b.Q4_0	B70x1	EPYC8124P	Linux	fp16	0	2763.84	105.47	`dbe7901`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x1	EPYC8124P	Linux	fp32	0	928.65	106.48	`6a257d4`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x2	EPYC8124P	Linux	fp16	0	2683.19	103.70	`dbe7901`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x2	EPYC8124P	Linux	fp32	0	926.58	104.29	`6a257d4`	@FCLC	2026/5/21
Qwen3.6-27B-Q4_0	B70x1	EPYC8124P	Linux	fp32	0	721.08	26.09	`6a257d4`	@FCLC	2026/5/21
Qwen3.6-27B-Q8_0	B70x1	EPYC8124P	Linux	fp32	0	833.93	15.48	`6a257d4`	@FCLC	2026/5/21

FCLC · 2026-05-21T00:09:54Z

FCLC
May 21, 2026

compiled with cmake -B build-sycl -DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DGGML_SYCL_TARGET=INTEL -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_FLAGS="-march=znver4" -DCMAKE_CXX_FLAGS="-march=znver4" -DCMAKE_BUILD_TYPE=Release && cmake --build build-sycl --config Release -j 16

single b70

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2763.84 ± 4.23 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        105.47 ± 0.05 |

build: dbe7901ca (9147)

dual b70:

~/Developement/llama.cpp$ ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (2 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_device: registered device SYCL1 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      2683.19 ± 10.32 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        103.70 ± 0.27 |

build: dbe7901ca (9147)

0 replies

FCLC · 2026-05-21T00:38:14Z

FCLC
May 21, 2026

If instead compiling and using with f16=off:

cmake -B build-sycl -DGGML_SYCL=ON -DGGML_SYCL_F16=OFF -DGGML_SYCL_TARGET=INTEL -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_FLAGS="-march=znver4" -DCMAKE_CXX_FLAGS="-march=znver4" -DCMAKE_BUILD_TYPE=Release && cmake --build build-sycl --config Release -j 16

Single B70:

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |        928.65 ± 0.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        106.48 ± 0.05 |

build: 6a257d446 (9263)

Dual B70:

ONEAPI_DEVICE_SELECTOR="level_zero:0,1" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (2 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_device: registered device SYCL1 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       926.58 ± 14.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        104.29 ± 0.15 |

build: 6a257d446 (9263)

0 replies

FCLC · 2026-05-21T00:52:30Z

FCLC
May 21, 2026

And with a much more interesting model, namely Qwen 3.6 27B:

q4

 ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/qwen36_27b/Qwen3.6-27B-Q4_0.gguf 
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | SYCL       |  99 |           pp512 |        721.08 ± 0.73 |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | SYCL       |  99 |           tg128 |         26.09 ± 0.03 |

build: 6a257d446 (9263)

and q8:

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/qwen36_27b/Qwen3.6-27B-Q8_0.gguf 
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       |  99 |           pp512 |        833.93 ± 1.96 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       |  99 |           tg128 |         15.48 ± 0.00 |

1 reply

arthw May 21, 2026
Collaborator Author

@FCLC
I update the first post with your comment.

Thank you!

digitalscream · 2026-05-21T16:14:16Z

digitalscream
May 21, 2026

Ooft. A770 16GB, i5 14600k, current cachyOS.

fp16:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1526.53 ± 31.96 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         27.42 ± 0.04 |

fp32:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1515.53 ± 16.97 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         27.33 ± 0.21 |

Very weird, compared with the one in the table - much better prefill, half the decode performance.

9 replies

NeoZhangJianyu May 22, 2026

From the change of performance, I guess the Flash-attention is enabled in second case.

34 token/s is very similar with my old test result, which is impacted by driver.

Could you run:

lspci -nnk | grep -i vga -A3
00:02.0 VGA compatible controller [0300]: Intel Corporation Arrow Lake-U [Intel Graphics] [8086:7d67] (rev 06)
	DeviceName: Onboard - Video
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000]
	Kernel driver in use: i915
--
03:00.0 VGA compatible controller [0300]: Intel Corporation Device [8086:e211]
	Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device [1ef7:2542]
	Kernel driver in use: xe
	Kernel modules: xe

digitalscream May 22, 2026

Interesting. Just tried it with the FA switch:

lj@seraph:~/bin/llama.cpp> ./bin/llama-bench -m ../../llm/llama-2-7b.Q4_0.gguf -fa 0,1
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           pp512 |        912.36 ± 3.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           tg128 |         34.47 ± 0.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           pp512 |        692.82 ± 2.47 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           tg128 |         39.09 ± 0.09 |

So...I do get a bit of a boost from FA on! I don't know what's happened with the performance from recompiling it, it's just fallen off a cliff.

The lspci result is:

lj@seraph:~> lspci -nnk | grep -i vga -A3
03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A770] [8086:56a0] (rev 08)
	Subsystem: Sparkle Computer Co., Ltd. Device [172f:4134]
	Kernel driver in use: i915
	Kernel modules: i915, xe

Could it be the i915 driver that's the problem?

NeoZhangJianyu May 22, 2026

Yes, this is the root cause.
You need to keep one of i915 or xe to get the better performance.

FA's impact is still less than I expected.
Maybe impacted by driver too.

Please remove one of them.
i915 has better performance, but not recommended due to it's old driver.

It's risk to remove the GPU driver.
Don't do it remotly.

digitalscream May 22, 2026

Aha...winning!

MESA: warning: Support for this platform is experimental with Xe KMD, bug reports may be ignored.
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           pp512 |        894.44 ± 4.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           tg128 |         55.53 ± 0.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           pp512 |        666.89 ± 1.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           tg128 |         64.49 ± 0.29 |

Mildly frustrating that I can't reproduce that nice PP result now, though.

NeoZhangJianyu May 22, 2026

It's known issue. We are checking it.

It's great to see better performance in your test.
I will update it in the table.

Thank you!

bedovyy · 2026-05-27T16:20:57Z

bedovyy
May 27, 2026

B580, AMD Ryzen 7 5700X3D, Ubuntu 25.10

built with fp16 in ./examples/sycl/build.sh (applied #23612)

./build/bin/llama-ls-sycl-device
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12168M|            1.6.34666|
| 1| [level_zero:gpu:1]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12168M|            1.6.34666|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
| 1| [level_zero:gpu:1]|      Y|

### 2xB580
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-bench -m ../models/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1721.91 ± 35.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         65.67 ± 0.46 |

build: c6e408837 (9368)

### 1xB580
ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build/bin/llama-bench -m ../models/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2063.52 ± 3.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         73.76 ± 0.13 |

build: c6e408837 (9368)

1 reply

arthw May 28, 2026
Collaborator Author

Update as your feedback!
Thank you!

thordarsen · 2026-05-28T20:05:12Z

thordarsen
May 28, 2026

Intel Arc Pro B50, Intel i7-8700 32GB RAM
Recent changes (9397 was today, 9298 was less than a week ago) seem to have badly lowered PP on SYCL which was already lacking compared to Vulkan. TG can drastically outpace Vulkan though, but at this point I have to start thinking if prefill or decode will be more important for what I'm doing.

build: c0c7e14 (9298)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	424.50 ± 1.34
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	45.92 ± 0.09

build: 2f6c815 (9397)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	397.09 ± 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	45.74 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	397.73 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	47.80 ± 0.03
------------------------------	---------:	---------:	----------	--:	-:	--------------:	-------------------:
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	590.01 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	40.13 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	581.78 ± 1.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	41.82 ± 0.08

Command line arguments:
./llama-bench -r 7 -hf TheBloke/Llama-2-7B-GGUF:Q4_0 -fa 0,1

Build options:
cmake .. -B build -DGGML_SYCL=ON -DGGML_RPC=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_DEVICE_ARCH=bmg-g21

cmake .. -B build -DGGML_VULKAN=1 -DGGML_RPC=ON

nothing else changed between these runs, I tested my old version, ran "git pull", built it and retested

9 replies

thordarsen May 30, 2026

On my Ryzen 5700X the results are in line - in Windows. I used the last release with a Windows SYCL download (b9334) and got similar results

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	418.87 ± 1.26
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	45.71 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	401.27 ± 6.11
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	47.66 ± 0.23

(sysinfo from windows, - only relevant info I think it's missing is the RAM in DDR4-3200)

thordarsen May 30, 2026

And just as a note - Vulkan performance skyrocketed, (I assume because Windows drivers but i dunno)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	1186.45 ± 4.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	46.30 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	pp512	1122.53 ± 3.73
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	tg128	47.69 ± 0.06

build: aa46bda (9437)

arthw Jun 1, 2026
Collaborator Author

@thordarsen
Is it opened in BIOS? Over 4G decoding and Resize Bar?

thordarsen Jun 1, 2026

Yes - the Intel driver software shows Resizable BAR is active.

I know that PCIe3 isn't ideal, but based on my Vulkan results, I don't think it's the primary culprit

thordarsen Jun 1, 2026

OK set -DGGML_SYCL_F16=ON made a big difference

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	1174.51 ± 2.33
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	45.81 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	602.57 ± 1.31
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	48.03 ± 0.07

jlionhan · 2026-05-31T11:30:48Z

jlionhan
May 31, 2026

I hope it helps.

255H, ARC 140T, 32GB RAM

model	size	params	backend	ngl	fa	test	t/s
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	0	pp512	223.82 ± 5.05
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	0	tg128	12.70 ± 0.18
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	1	pp512	166.46 ± 3.61
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	1	tg128	13.70 ± 0.15

build: d4c8e2c (9442)

cmake --fresh -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL=1 -DBUILD_SHARED_LIBS=0 -DGGML_SYCL_F16=1

-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- SYCL Compiler version: 20260000
-- SYCL_INCLUDE_DIR: /opt/intel/oneapi/compiler/2026.0/include
-- SYCL_LIBRARY=/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so
-- Found IntelSYCL: /opt/intel/oneapi/compiler/2026.0/include (found version "202012")
-- GGML_SYCL_SUPPORT_LEVEL_ZERO ON
-- Level Zero loader found: /lib/libze_loader.so
-- Level Zero headers found: /usr/include
-- Found oneDNN: /opt/intel/oneapi/dnnl/2026.0/lib/libdnnl.so.3.11

0.00.794.752 I Build with Macros:
0.00.794.757 I   GGML_SYCL_FORCE_MMQ: no
0.00.794.757 I   GGML_SYCL_F16: yes
0.00.794.757 I   GGML_SYCL_GRAPH: yes
0.00.794.758 I   GGML_SYCL_DNNL: yes
0.00.794.758 I   GGML_SYCL_SUPPORT_LEVEL_ZERO: yes
0.00.794.759 I   GGML_SYCL_USE_VMM: yes
0.00.794.759 I Running with Environment Variables:
0.00.794.760 I   GGML_SYCL_DEBUG: 0
0.00.794.760 I   GGML_SYCL_DISABLE_OPT: 0
0.00.794.761 I   GGML_SYCL_DISABLE_GRAPH: 1
0.00.794.761 I   GGML_SYCL_ENABLE_LEVEL_ZERO: 1
0.00.794.761 I   GGML_SYCL_DISABLE_DNN: 0
0.00.794.762 I   GGML_SYCL_ENABLE_VMM: 1
0.00.794.762 I   GGML_SYCL_PRIORITIZE_DMMV: 0
0.00.794.764 I   GGML_SYCL_USE_ASYNC_MEM_OP: 1
0.00.794.764 I   GGML_SYCL_ENABLE_FLASH_ATTN: 1
0.00.794.767 I Found 1 SYCL devices:
0.00.794.768 I |  |                   |                                       |       |Max    |        |Max  |Global |                     |
0.00.794.768 I |  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
0.00.794.769 I |ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
0.00.794.769 I |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
0.00.794.938 I | 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.74|    128|    1024|   32| 30588M|           1.15.38308|
0.00.794.938 I SYCL Optimization Feature:
0.00.794.939 I |ID|        Device Type|Reorder|
0.00.794.939 I |--|-------------------|-------|
0.00.794.941 I | 0| [level_zero:gpu:0]|      Y|

lspci -nnk | grep -i vga -A3
00:02.0 VGA compatible controller [0300]: Intel Corporation Arrow Lake-P [Arc Pro 130T/140T] [8086:7d51] (rev 03)
        DeviceName: Onboard IGD
        Subsystem: Hewlett-Packard Company Device [103c:8dea]
        Kernel driver in use: xe

1 reply

arthw Jun 1, 2026
Collaborator Author

@jlionhan
Update them in the table!

Thank you!

bobguns · 2026-06-01T14:06:05Z

bobguns
Jun 1, 2026

~/llama.cpp$ cmake -B build/ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Including OPENVINO backend
-- ggml version: 0.13.1
-- ggml commit: 55ac090
-- OpenSSL found: 3.5.5
-- Generating embedded license file for target: llama-app
-- Configuring done (0.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/llama.cpp/build/ReleaseOV

:~/llama.cpp$ GGML_OPENVINO_STATEFUL_EXECUTION=1
GGML_OPENVINO_DEVICE=GPU
./build/ReleaseOV/bin/llama-bench
-m ~/models/Qwen2.5-7B-Instruct-Q4_0.gguf
-fa 1
-p 1024,4096
-n 128,512
OpenVINO: using device GPU

model	size	params	backend	ngl	fa	test	t/s
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	pp1024	2988.73 ± 8.18
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	pp4096	2551.74 ± 6.83
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	tg128	17.51 ± 0.08
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	tg512	16.80 ± 0.02

specs https://www.asrockind.com/en-gb/NUC%20BOX-358H
Crucial 5600 SO-DIMMS

0 replies

bobguns · 2026-06-01T15:13:39Z

bobguns
Jun 1, 2026

📊 Intel Panther Lake Xe3 iGPU (12 EU) Benchmark Matrix: OpenVINO vs. Vulkan vs. SYCL

I have completed a thorough benchmarking sweep across all three major acceleration backends available in llama.cpp for Intel hardware: OpenVINO, Vulkan, and SYCL (oneAPI).

Environment

Hardware: Intel Panther Lake mobile processor with Xe3 Graphics (12 Execution Units)
OS: Ubuntu Linux 26.04
Memory Architecture: Unified Memory Architecture (UMA) sharing system RAM directly with the iGPU.
llama.cpp Build: 55ac0909e (9458)

Models Tested

Qwen2.5-7B-Instruct-Q4_0 (Dense, 4.13 GiB)
Qwen3-Coder-Next-Q4_K_M (80B MoE, 45.15 GiB) — Note: Leverages UMA to allocate the entire 45 GB model file entirely within system-shared VRAM.

📈 Performance Summary Matrix

Model	Backend	Prompt Processing 1024 (t/s)	Prompt Processing 4096 (t/s)	Token Gen 128 (t/s)	Token Gen 512 (t/s)	Status / Observation
Qwen 2.5 7B (Dense)	OpenVINO	2988.73	2551.74	17.51	16.80	Champion prompt processing
	Vulkan	769.85	502.20	14.62	14.55	Strong ingestion, trailing generation
	SYCL	305.08	269.69	17.55	17.50	Maximum generation throughput
Qwen3 80B (MoE)	OpenVINO	—	—	—	—	CRASH (`CPY` memory layout bug)
	Vulkan	341.48	295.77	16.39	16.37	Fully stable, superior ingestion
	SYCL	166.85	172.98	17.53	17.32	Fully stable, maximum generation

🛠️ Deep-Dive Analysis

1. The OpenVINO Qwen3 MoE Crash

OpenVINO panics and drops a core dump immediately when attempting to initialize the new Qwen3-Coder-Next model.

Error: ggml-backend.cpp:898: pre-allocated tensor (cache_r_l0 (view) (copy of )) in a buffer (OPENVINO0) that cannot run the operation (CPY)
Root Cause: This is an upstream translation issue. The OpenVINO backend does not yet correctly compute memory copying operations (CPY) over the complex memory layouts required by the Gated DeltaNet layers unique to the new Qwen3 architecture.

2. SYCL vs. Vulkan on the 80B MoE Architecture

Both SYCL and Vulkan bypass the memory allocation crash cleanly, proving their robust memory routing handling over large UMA allocations:

Token Generation: SYCL wins by ~6%, delivering 17.53 t/s. By binding directly into Intel's low-level Level Zero driver layer, SYCL passes active experts through the matrix engines with minimal scheduling overhead.
Prompt Processing: Vulkan completely dominates SYCL by nearly 2x (341.48 t/s vs 166.85 t/s). The current SYCL implementation lacks the advanced cache-tiling optimizations needed to cleanly parallelize large prompt ingestion blocks across hybrid MoE weights.

📋 Raw Build & Execution Logs

1. Qwen3 80B MoE — OpenVINO Crash Log

:~/llama.cpp$ GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-bench -m ~/models/Qwen3-Coder-Next-Q4_K_M.gguf -fa 1 -p 1024,4096 -n 128,512
OpenVINO: using device GPU
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
/home/llama.cpp/ggml/src/ggml-backend.cpp:898: pre-allocated tensor (cache_r_l0 (view) (copy of )) in a buffer (OPENVINO0) that cannot run the operation (CPY)

#0  0x000070d0479484f3 in ggml_print_backtrace () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#1  0x000070d0479486a6 in ggml_abort () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#2  0x000070d04796170c in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#3  0x000070d04796375f in ggml_backend_sched_split_graph () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
Aborted (core dumped)

specs https://www.asrockind.com/en-gb/NUC%20BOX-358H
Crucial 5600 SO-DIMMS

0 replies

yqYo1 · 2026-06-01T17:02:06Z

yqYo1
Jun 1, 2026

HW:ryzen5 5600X, DDR4-3600 128GB, ARC B570
OS:Ubuntu 24.04.4, 6.17.0-29-generic
Since the output to the display uses a separate GPU (GT730), the B570 should only be processing llama.cpp.

FP32

❯ ./build/bin/llama-bench -fa 0,1 -m /data/llm_models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           pp512 |        388.77 ± 0.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           tg128 |         72.48 ± 0.18 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           pp512 |        412.46 ± 1.13 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           tg128 |         76.76 ± 0.06 |

build: 5aba5364d (9456)

FP16

❯ ./build/bin/llama-bench -fa 0,1 -m /data/llm_models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           pp512 |       1355.95 ± 7.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           tg128 |         72.52 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           pp512 |        685.89 ± 0.31 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           tg128 |         76.72 ± 0.07 |

build: 5aba5364d (9456)

0 replies

Performance of llama.cpp on Intel GPU with SYCL backend #23313

Uh oh!

Uh oh!

arthw May 19, 2026 Collaborator

Purpose

Rule

Performance data on Intel GPU

Default setting

Replies: 10 comments · 21 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw May 21, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw May 28, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 1, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 1, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

📊 Intel Panther Lake Xe3 iGPU (12 EU) Benchmark Matrix: OpenVINO vs. Vulkan vs. SYCL

Environment

Models Tested

📈 Performance Summary Matrix

🛠️ Deep-Dive Analysis

1. The OpenVINO Qwen3 MoE Crash

2. SYCL vs. Vulkan on the 80B MoE Architecture

📋 Raw Build & Execution Logs

Uh oh!

Uh oh!

arthw
May 19, 2026
Collaborator

Replies: 10 comments 21 replies

arthw May 21, 2026
Collaborator Author

arthw May 28, 2026
Collaborator Author

arthw Jun 1, 2026
Collaborator Author

arthw Jun 1, 2026
Collaborator Author