Accelerating llama.cpp for the Copilot+PCs / Snapdragon X (updated) #8336

AndreasKunar · 2024-07-06T15:30:33Z

AndreasKunar
Jul 6, 2024

Update + complete re-write of overview/summary on 2024-12-13:

This is intended as a collection of ideas/how-tos for getting llama.cpp to work accelerated on the new Snapdragon X based Copilot+PCs. It is an updated summary/TLDR; of the discussions below.

I currently use Surface Laptop 7 with a 12-core Snapdragon X Elite and 16GB RAM, so my tests are on this hardware.

Running Accelerated on the CPU (Q4_0 models,...):

Due to the work of llama-cpp contributors, Q4_0 quantizations (and later IQ4_NL + maybe others) were accelerated by special CPU GEMM/GEMV code. And there are no longer special switches/builds/re-quantizations required, the different quantization layout is automatically re-packed during Q4_0 model loading on supported hardware. This makes running Q4_0 models extremey fast on the CPU. And because its only running on on the CPU, it also works with nearly the same performance in WLS2, containers and VMs! With this, my Snapdragon X Elite has simmilar performance on the CPU, as my M2 MacBook Air with its 10-GPUs (but these cannot be used in containers/VMs).

Windows 11 24H2 (Build 26100.2454), 12 CPU,:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	177.71 ± 6.82
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	24.81 ± 0.64

Windows 11 24H2 (Build 26100.2454) / WSL2, Ubuntu 24.04.1 LTS, 12 CPU:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	pp512	177.35 ± 5.15
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	tg128	22.55 ± 3.86

Running on the GPU:

There currently are two approaches - running via the openCL or Vulkan llama.cpp-backend. Both are work-in-progress.

new llama.cpp OpenCL backend, Windows 11 24H2 (Build 26100.2454) Qualcomm(R) Adreno(TM) X1-85 GPU:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	99	pp512	100.65 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	99	tg128	17.95 ± 0.16

This was newly merged by the contributors into build a76c56f (4325) today, as first step. Thanks a lot!

Vulkan, Windows 11 24H2 (Build 26100.2454), 12 CPU, 16 GB:

There now is a Windows for arm Vulkan SDK available for the Snapdragon X, but although llama.cpp compiles/runs with it, currently (as of Dec 13, 2024) it produces un-usaably low-quality results. Escalated with issue #8455

Running on the NPU:

The SnapDragon X also has some work on DirectML/QNN ongoing by Qualcomm,... but this is not yet completed.

Any further ideas/help welcome!

bandoti · 2024-07-08T13:10:03Z

bandoti
Jul 8, 2024

I have not had time to test with Windows on ARM (and not sure I'm so brave to venture into such murky waters!).

However, I would like to point out that right now there's some Vulkan build-related changes in #8119 . If you make some headway with this, it would be great to test against this branch to ensure CMake compatibility. 😊

1 reply

AndreasKunar Jul 8, 2024
Author

I have not had time to test with Windows on ARM (and not sure I'm so brave to venture into such murky waters!).

However, I would like to point out that right now there's some Vulkan build-related changes in #8119 . If you make some headway with this, it would be great to test against this branch to ensure CMake compatibility. 😊

Thanks a lot, I will do it. I will need some time, because I just tried to upgrade my Surface Pro's SSD to 2TB (it's NVME SSD is replaceable) and I now need to do a complete reinstall. Paid EaseUS disk-cloning just "killed" my installation :-(. Please be patient.

I think Vulkan for llama.cpp is very interesting. It should work with the Snapdragon X on Windows for Arm (either in Windows but maybe also more easily in WSL2). Vulkan acceleration also works im MacOS Podman containers via virtio (see #8042), normally the M-series GPUs are not accessible in VMs/containers.

AndreasKunar · 2024-07-10T11:11:02Z

AndreasKunar
Jul 10, 2024
Author

A quick "how-to" for compiling llama.cpp on Windows 11 22H2 WSL2 Ubuntu-24.04

apt install:

git build-essential ccache cmake (for building llama.cpp)
libvulkan-dev glslc (for building llama.cpp for Vulkan)
vulkan-tools (for "vulkaninfo --summary" information)
mesa-utils (for "glxinfo -B" driver information)

build llama.cpp "normally" (for CPU only to test performance) and then for Vulkan (cmake -B build -DGGML_VULKAN=1)

2 replies

Berowne Aug 28, 2024

Hi @AndreasKunar did the vulkan model work?

AndreasKunar Aug 28, 2024
Author

Hi @AndreasKunar did the vulkan model work?

It works on WSL2/Ubuntu, but WSL2/Ubuntu's Vulkan driver just emulates via the CPU (no forwarding to the Windows-GPU driver via e.g. Venus).

I could not get it to work on WoA natively. And Vulkan is not expected to hugely accelerate due to the not-so-great Snapdragon X's Adruino GPU. I'm watching the developments for running llama.cpp with Qualcomm's QNN framework on the NPU and hope this gives better results. At least the NPU should bring power-savings vs. running on the CPUs.

Details:

I think the build-process for Windows arm64 llama.cpp with Vulkan works correctly (including vulkan-shaders-gen)
With the native Windows Snapdragon X Vulkan driver (by Qualcomm), it crashes with a C++-runtime error in one of the Vulkan calls. But I can't find the reason why with debugging. For details see issue Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs) #8455
It runs out of Vulkan-memory when using the Windows's Vulkan to DirectX12 translation driver as a possible fall-back solution.

I gave up on Vulkan llama.cpp for the Snapdragon X because the Snapdragon X's Adruino GPU is not too great. With the new arm64 CPU-optimizations in the Q4_0_4_4 format provided in llama.cpp by the ARM-team, running on the CPUs might probably be faster than running on the Adrunio GPU with Vulkan. In my benchmarking, I nearly reach the same improvements on the Snapdragon X when running Q4_0 CPU vs. Q4_0_4_4 than with my M2 MacBook Air running on its 4 CPU p-cores vs. its 10-GPUs (Metal) - see discussion #8273. The ARM-team provided Q4_0_4_4 acceleration does not improve speed on the M-chips, because Metal already accelerates a lot vs. plain C/C++ ARM code. Q4_0_4_4 acceleration also works in VMs/Containers, which Vulkan does not (currently).

All the compute-acceleration primarily affects the compute-intense prompt-processing. Also the SnapdragonX Plus's asymmetrical CPUs could probably be limited to not using all memory-bandwidth - like the M-series chips. The M-series chips seem to need to work on all the GPUs to utilize the full memory-bandwidth. LLM token-generation is largely memory-bandwidth bound. CPU/GPU's fast vector-compute doesn't run at full speed there, because it has to wait for the memory to pump all the model-parameters, KV-cache,... into the on-chip caches for each and every token generated (prompt-processing works in batches of tokens).

I had a Snapdragon X Plus Surface Pro entry model, and am currently exchanging it for a Snapdragon X Elite Laptop. So currently not able to run tests.

Berowne · 2024-08-31T21:56:49Z

Berowne
Aug 31, 2024

Thanks Andi I've got the Snapdragon X Elite Surface tablet. Went all in only to discover this lack of functionality afterwards. The bleeding edge gets messy sometimes. Like you, hoping for a update of arm64 llama.cpp. cheers Berowne

…

________________________________ From: Andreas (Andi) Kunar ***@***.***> Sent: Wednesday, August 28, 2024 1:45 AM To: ggerganov/llama.cpp ***@***.***> Cc: Berowne ***@***.***>; Manual ***@***.***> Subject: Re: [ggerganov/llama.cpp] Compiling llama.cpp with Vulkan for new Copilot+PCs / Snapdragon X (updated) (Discussion #8336) Hi @AndreasKunar<https://github.com/AndreasKunar> did the vulkan model work? It works on WSL2/Ubuntu, but WSL2/Ubuntu's Vulkan driver just emulates via the CPU (no forwarding to the Windows-GPU driver via e.g. Venus). I could not get it to work on WoA natively. And Vulkan is not expected to hugely accelerate due to the not-so-great Snapdragon X's Adruino GPU. I'm watching the developments for running llama.cpp with Qualcomm's QNN framework on the NPU and hope this gives better results. At least the NPU should bring power-savings vs. running on the CPUs. Details: * I think the build-process for Windows arm64 llama.cpp with Vulkan works correctly (including vulkan-shaders-gen) * With the native Windows Snapdragon X Vulkan driver (by Qualcomm), it crashes with a C++-runtime error in one of the Vulkan calls. But I can't find the reason why with debugging. For details see issue #8455<#8455> * It runs out of Vulkan-memory when using the Windows's Vulkan to DirectX12 translation driver as a possible fall-back solution. I gave up on Vulkan llama.cpp for the Snapdragon X because the Snapdragon X's Adruino GPU is not too great. With the new arm64 CPU-optimizations in the Q4_0_4_4 format provided in llama.cpp by the ARM-team, running on the CPUs might probably be faster than running on the Adrunio GPU with Vulkan. In my benchmarking, I nearly reach the same improvements on the Snapdragon X when running Q4_0 CPU vs. Q4_0_4_4 than with my M2 MacBook Air running on its 4 CPU p-cores vs. its 10-GPUs (Metal) - see discussion #8273<#8273>. The ARM-team provided Q4_0_4_4 acceleration does not improve speed on the M-chips, because Metal already accelerates a lot vs. plain C/C++ ARM code. Q4_0_4_4 acceleration also works in VMs/Containers, which Vulkan does not (currently). All the compute-acceleration primarily affects the compute-intense prompt-processing. Also the SnapdragonX Plus's asymmetrical CPUs could probably be limited to not using all memory-bandwidth - like the M-series chips. The M-series chips seem to need to work on all the GPUs to utilize the full memory-bandwidth. LLM token-generation is largely memory-bandwidth bound. CPU/GPU's fast vector-compute doesn't run at full speed there, because it has to wait for the memory to pump all the model-parameters, KV-cache,... into the on-chip caches for each and every token generated (prompt-processing works in batches of tokens). I had a Snapdragon X Plus Surface Pro entry model, and am currently exchanging it for a Snapdragon X Elite Laptop. So currently not able to run tests. — Reply to this email directly, view it on GitHub<#8336 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFUNW55VQ6F3ARCZIMFQNRDZTWE2LAVCNFSM6AAAAABKOQ5FM6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANBXGI2DGMY>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

AndreasKunar Sep 2, 2024
Author

I just installed Chrome on my (new) Surface Laptop 7 15" and enabled all its (experimental/dev) WebGL features (I use Chrome for dev-work only, so its not this critical for me). WebLLM then runs e.g. Phi-3.5-mini-instruct-q4f16_1-MLC fully on the local GPU (88-100% utilization), and is generating ~14 token/s! llama.cpp on my MacBook Air M2 10-GPU has 29 token/s with the same model quantized as Q4_0.

Maybe the Adreno GPU is not as slow as I have thought . Will look into this further.

FranzKafkaYu · 2024-09-02T02:40:20Z

FranzKafkaYu
Sep 2, 2024

I'm watching the developments for running llama.cpp with Qualcomm's QNN framework on the NPU and hope this gives better results

@AndreasKunar Hello sir,where can I find more details about llama.cpp with QNN backend,I already found that there are some PRs about QNN backend but unfortuately all of them didn't get merged.

1 reply

AndreasKunar Sep 2, 2024
Author

Hello sir, QNN to my knowledge sadly is a lot of work-in-progress and nothing complete.

I watched the work on PR #6869, which forked into https://github.com/chraac/llama.cpp/tree/dev-refactoring. To my knowledge, this work is done for QNN on llama.cpp for Android, but it looks promising. I hope they make enough progress to see, if it can be moved over to Windows as well.

Also, outside llama.cpp, the semantic-kernel team is working on supporting QNN for python - it will be interesting to see, what they manage to produce.

IanET · 2024-12-11T05:54:08Z

IanET
Dec 11, 2024

Fwiw, I just built on a Surface 11 ARM64 with the public beta Vulkan via https://vulkan.lunarg.com/sdk/home#windows

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

And it all seems to work fine with Llama-3.2-3B-Instruct-Q4_K_M.gguf with better performance compered to CPU only

4 replies

AndreasKunar Dec 12, 2024
Author

Thanks!! Very cool, that it seems to run now!

BUT I had not-so-encouraging results:

It did run e.g. Llama-3.2-3B-Instruct-Q4_0.gguf, but it produced low-quality results and gibberish after a few tokens in an endless loop.
llama-bench (for the usual llama-2 4B) did hang (probably because of the gibberish-loop)
performance (not via llama-bench but llama-cli with standard question) of Vulkan vs. the optimized CPU backend for the model in a) was not promising:
Vulkan: PP ~1.7 token/s, TG ~22 (producing gibberish)
CPU (Q4_0 optimized): PP ~2.7 token/s, TG ~35
Vulkan used ~1.1GB more RAM than non-Vulkan

Now I need to find out, why Vulkan produces gibberish here ...

AndreasKunar Dec 13, 2024
Author

Got llama.cpp Vulkan to build with MSVC and clang.

Clang build needs a bit of a tweak, e.g. here a .BAT which uts the result into .\build\bin:

cmake --preset arm64-windows-llvm-release -B build -D GGML_VULKAN=ON -D GGML_OPENMP=OFF
SETLOCAL
SET PATH=%PATH%;%USERPROFILE%\Projects\llama.cpp\build\bin
cmake --build build --config Release -j 12 --target llama-cli llama-bench llama-server
ENDLOCAL

But still the same result also with clang. Will re-open the Vulkan for WoA/Snapdragon issue.

AndreasKunar Dec 13, 2024
Author

FYI, I re-opened issue #8455

max-krasnyansky Dec 13, 2024
Collaborator

Makes sense to reopen the Vulkan issue.
In the meantime see #10693 for enabling OpenCL on Adreno.

mvx-team · 2024-12-15T09:25:19Z

mvx-team
Dec 15, 2024

With support for Q4_0_x_x ARM-specific GGUF variants removed, what's a good alternative than re-downloading all model files?

7 replies

AndreasKunar Dec 15, 2024
Author

Sorry to hear, sadly I have no other ideas.

I couldn't try requant, because I deleted all my Q4_0_4_4 files after re-pack started to work. I did not notice the performance-differnces, because I attributed it to my Surface's thermals (throttling).

For me, I'm happy with not having to use a different format - it's easier with ollama and LM-Studio. I'm more interested, if further improvements (like e.g. T-MAC acceleration) will result in more CPU speed-gains. And if the GPU (potentially faster for PP) / NPU (more power-savings) can ever catch up with this fast CPU-speed.

max-krasnyansky Dec 17, 2024
Collaborator

I'm also seeing slightly slower performance with repacked Q4_0 on the latest llama-cli build on Surface Laptop 7, compared to the last build that supported Q4 ARM formats.

I'd recommend to always do proper side-by-side comparison instead of comparing the numbers from some older runs, most likely under slightly different conditions.
For instance, there have been multiple Windows 11 OS updates, you might be running more/less apps, etc.

Here is the apples-to-apples comparison on Surface 7:

No pending Windows updates (i.e fully up to date)
Performance Mode, connected to power adapter
Fresh reboot
Running only two Power Shell terminals
- One with llama.cpp-master (commit 160bc03, latest at this moment)
- One with llama.cpp-q4_0_4_8 (commit 1e58ee1, just before Q4_0_4_8 file type was removed)

Both versions built with LLVM 18.1
cmake --preset arm64-windows-llvm-release -B build-wos
cmake --build build-wos

Model quantized with:
.\build-wos\bin\llama-quantize.exe ..\gguf\llama-v3.1-8b-instruct.f16.gguf src\gguf\llama-v3.1-8b-instruct.q4_0_4_8.gguf Q4_0_4_8
.\build-wos\bin\llama-quantize.exe ..\gguf\llama-v3.1-8b-instruct.f16.gguf src\gguf\llama-v3.1-8b-instruct.q4_0.gguf Q4_0

src\llama.cpp-master> .\build-wos\bin\llama-bench.exe -m  src\gguf\llama-v3.1-8b-instruct.q4_0.gguf' -p 128 -n 64 -t 10 --prio 1

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp128	198.47 ± 0.33
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg64	22.97 ± 0.09

src\llama.cpp-q4_0_4_8> .\build-wos\bin\llama-bench.exe -m  src\gguf\llama-v3.1-8b-instruct.q4_0_4_8.gguf' -p 128 -n 64 -t 10 --prio 1

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp128	196.44 ± 1.34
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg64	22.83 ± 0.02

mvx-team Dec 19, 2024

I re-ran a bunch of tests using Llama 8B and Mistral Nemo 12B in Q4 and Q4_0_4_8 quantizations and you are correct, there is no appreciable performance difference or regression with the latest builds. I appreciate all the work you and other coders have done to make Windows on Snapdragon a viable LLM platform.

Is it possible to reduce RAM usage when using online repacking? I'm seeing much higher RAM usage when using Q4 and presumably this is for the repacked weights. It's a concern when running 12B or 14B models on laptops with only 16 GB RAM.

max-krasnyansky Dec 19, 2024
Collaborator

If you disable mmap (--nommap) then the repacking has no overhead.

mvx-team Dec 19, 2024

That was the secret sauce. Adding the --no-mmap arg to llama-cli and llama-server brought the latest build with a Q4_0 model back to equal RAM usage to the old builds with Q4_0_4_8 models.

ng-05 · 2024-12-17T20:16:54Z

ng-05
Dec 17, 2024

Hello,
If you are looking to improve the 4bit latency performance on Arm CPU then please try latest KleidiAI 4 bit microkernels using this patch https://github.com/ARM-software/ML-examples/tree/main/kleidiai-examples/llama_cpp

This does not require any requantization like Q4_0_4_8, Q4_0_4_8 as we internally cache the packed weights after first forward pass.

Thanks

1 reply

AndreasKunar Dec 19, 2024
Author

Thanks a lot, this looks very interesting.

I tried to manually follow you instructions, but your referenced file https://github.com/ARM-software/ML-examples/blob/main/kleidiai-examples/llama_cpp/0001-Updates-to-kleidiai-examples-llama_cpp.patch seems no longer available, I tried to re-construct your changes from https://github.com/ARM-software/ML-examples/blob/main/kleidiai-examples/llama_cpp/0001-Use-KleidiAI-Int4-Matmul-micro-kernels-in-llama.cpp.patch with the b8deef0 commit, and build according to your instructions.

I built on a Snapdragon X Elite (Surface Laoptop 7) under Ubuntu 24.04 in WSL2 (WSL2 should not be a problem for CPU-only code). It seems to compile OK, but running llama-cli and llama-bench breaks with an illegal-instruction dump. Is the .."Int4-Matmul"... file I used maybe an old version, or is your linked file in blob still available somewhere?

Accelerating llama.cpp for the Copilot+PCs / Snapdragon X (updated) #8336

Update + complete re-write of overview/summary on 2024-12-13:

Running Accelerated on the CPU (Q4_0 models,...):

Running on the GPU:

new llama.cpp OpenCL backend, Windows 11 24H2 (Build 26100.2454) Qualcomm(R) Adreno(TM) X1-85 GPU:

Vulkan, Windows 11 24H2 (Build 26100.2454), 12 CPU, 16 GB:

Running on the NPU:

Replies: 7 comments · 17 replies

AndreasKunar Jul 8, 2024 Author

AndreasKunar Jul 10, 2024 Author

AndreasKunar Aug 28, 2024 Author

AndreasKunar Sep 2, 2024 Author

AndreasKunar Sep 2, 2024 Author

AndreasKunar Dec 12, 2024 Author

AndreasKunar Dec 13, 2024 Author

AndreasKunar Dec 13, 2024 Author

max-krasnyansky Dec 13, 2024 Collaborator

AndreasKunar Dec 15, 2024 Author

max-krasnyansky Dec 17, 2024 Collaborator

max-krasnyansky Dec 19, 2024 Collaborator

AndreasKunar Dec 19, 2024 Author

Replies: 7 comments 17 replies

AndreasKunar Jul 8, 2024
Author

AndreasKunar
Jul 10, 2024
Author

AndreasKunar Aug 28, 2024
Author

AndreasKunar Sep 2, 2024
Author

AndreasKunar Sep 2, 2024
Author

AndreasKunar Dec 12, 2024
Author

AndreasKunar Dec 13, 2024
Author

AndreasKunar Dec 13, 2024
Author

max-krasnyansky Dec 13, 2024
Collaborator

AndreasKunar Dec 15, 2024
Author

max-krasnyansky Dec 17, 2024
Collaborator

max-krasnyansky Dec 19, 2024
Collaborator

AndreasKunar Dec 19, 2024
Author