Accelerating llama.cpp for the Copilot+PCs / Snapdragon X (updated) #8336
Replies: 7 comments 17 replies
-
I have not had time to test with Windows on ARM (and not sure I'm so brave to venture into such murky waters!). However, I would like to point out that right now there's some Vulkan build-related changes in #8119 . If you make some headway with this, it would be great to test against this branch to ensure CMake compatibility. 😊 |
Beta Was this translation helpful? Give feedback.
-
A quick "how-to" for compiling llama.cpp on Windows 11 22H2 WSL2 Ubuntu-24.04 apt install:
build llama.cpp "normally" (for CPU only to test performance) and then for Vulkan (cmake -B build -DGGML_VULKAN=1) |
Beta Was this translation helpful? Give feedback.
-
Thanks Andi
I've got the Snapdragon X Elite Surface tablet. Went all in only to discover this lack of functionality afterwards. The bleeding edge gets messy sometimes. Like you, hoping for a update of arm64 llama.cpp.
cheers
Berowne
…________________________________
From: Andreas (Andi) Kunar ***@***.***>
Sent: Wednesday, August 28, 2024 1:45 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Berowne ***@***.***>; Manual ***@***.***>
Subject: Re: [ggerganov/llama.cpp] Compiling llama.cpp with Vulkan for new Copilot+PCs / Snapdragon X (updated) (Discussion #8336)
Hi @AndreasKunar<https://github.com/AndreasKunar> did the vulkan model work?
It works on WSL2/Ubuntu, but WSL2/Ubuntu's Vulkan driver just emulates via the CPU (no forwarding to the Windows-GPU driver via e.g. Venus).
I could not get it to work on WoA natively. And Vulkan is not expected to hugely accelerate due to the not-so-great Snapdragon X's Adruino GPU. I'm watching the developments for running llama.cpp with Qualcomm's QNN framework on the NPU and hope this gives better results. At least the NPU should bring power-savings vs. running on the CPUs.
Details:
* I think the build-process for Windows arm64 llama.cpp with Vulkan works correctly (including vulkan-shaders-gen)
* With the native Windows Snapdragon X Vulkan driver (by Qualcomm), it crashes with a C++-runtime error in one of the Vulkan calls. But I can't find the reason why with debugging. For details see issue #8455<#8455>
* It runs out of Vulkan-memory when using the Windows's Vulkan to DirectX12 translation driver as a possible fall-back solution.
I gave up on Vulkan llama.cpp for the Snapdragon X because the Snapdragon X's Adruino GPU is not too great. With the new arm64 CPU-optimizations in the Q4_0_4_4 format provided in llama.cpp by the ARM-team, running on the CPUs might probably be faster than running on the Adrunio GPU with Vulkan. In my benchmarking, I nearly reach the same improvements on the Snapdragon X when running Q4_0 CPU vs. Q4_0_4_4 than with my M2 MacBook Air running on its 4 CPU p-cores vs. its 10-GPUs (Metal) - see discussion #8273<#8273>. The ARM-team provided Q4_0_4_4 acceleration does not improve speed on the M-chips, because Metal already accelerates a lot vs. plain C/C++ ARM code. Q4_0_4_4 acceleration also works in VMs/Containers, which Vulkan does not (currently).
All the compute-acceleration primarily affects the compute-intense prompt-processing. Also the SnapdragonX Plus's asymmetrical CPUs could probably be limited to not using all memory-bandwidth - like the M-series chips. The M-series chips seem to need to work on all the GPUs to utilize the full memory-bandwidth. LLM token-generation is largely memory-bandwidth bound. CPU/GPU's fast vector-compute doesn't run at full speed there, because it has to wait for the memory to pump all the model-parameters, KV-cache,... into the on-chip caches for each and every token generated (prompt-processing works in batches of tokens).
I had a Snapdragon X Plus Surface Pro entry model, and am currently exchanging it for a Snapdragon X Elite Laptop. So currently not able to run tests.
—
Reply to this email directly, view it on GitHub<#8336 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFUNW55VQ6F3ARCZIMFQNRDZTWE2LAVCNFSM6AAAAABKOQ5FM6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANBXGI2DGMY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
@AndreasKunar Hello sir,where can I find more details about llama.cpp with QNN backend,I already found that there are some PRs about QNN backend but unfortuately all of them didn't get merged. |
Beta Was this translation helpful? Give feedback.
-
Fwiw, I just built on a Surface 11 ARM64 with the public beta Vulkan via https://vulkan.lunarg.com/sdk/home#windows cmake -B build -DGGML_VULKAN=ON And it all seems to work fine with Llama-3.2-3B-Instruct-Q4_K_M.gguf with better performance compered to CPU only |
Beta Was this translation helpful? Give feedback.
-
With support for Q4_0_x_x ARM-specific GGUF variants removed, what's a good alternative than re-downloading all model files? |
Beta Was this translation helpful? Give feedback.
-
Hello, This does not require any requantization like Q4_0_4_8, Q4_0_4_8 as we internally cache the packed weights after first forward pass. Thanks |
Beta Was this translation helpful? Give feedback.
-
Update + complete re-write of overview/summary on 2024-12-13:
This is intended as a collection of ideas/how-tos for getting llama.cpp to work accelerated on the new Snapdragon X based Copilot+PCs. It is an updated summary/TLDR; of the discussions below.
I currently use Surface Laptop 7 with a 12-core Snapdragon X Elite and 16GB RAM, so my tests are on this hardware.
Running Accelerated on the CPU (Q4_0 models,...):
Due to the work of llama-cpp contributors, Q4_0 quantizations (and later IQ4_NL + maybe others) were accelerated by special CPU GEMM/GEMV code. And there are no longer special switches/builds/re-quantizations required, the different quantization layout is automatically re-packed during Q4_0 model loading on supported hardware. This makes running Q4_0 models extremey fast on the CPU. And because its only running on on the CPU, it also works with nearly the same performance in WLS2, containers and VMs! With this, my Snapdragon X Elite has simmilar performance on the CPU, as my M2 MacBook Air with its 10-GPUs (but these cannot be used in containers/VMs).
Windows 11 24H2 (Build 26100.2454), 12 CPU,:
Windows 11 24H2 (Build 26100.2454) / WSL2, Ubuntu 24.04.1 LTS, 12 CPU:
Running on the GPU:
There currently are two approaches - running via the openCL or Vulkan llama.cpp-backend. Both are work-in-progress.
new llama.cpp OpenCL backend, Windows 11 24H2 (Build 26100.2454) Qualcomm(R) Adreno(TM) X1-85 GPU:
This was newly merged by the contributors into build a76c56f (4325) today, as first step. Thanks a lot!
Vulkan, Windows 11 24H2 (Build 26100.2454), 12 CPU, 16 GB:
There now is a Windows for arm Vulkan SDK available for the Snapdragon X, but although llama.cpp compiles/runs with it, currently (as of Dec 13, 2024) it produces un-usaably low-quality results. Escalated with issue #8455
Running on the NPU:
The SnapDragon X also has some work on DirectML/QNN ongoing by Qualcomm,... but this is not yet completed.
Any further ideas/help welcome!
Beta Was this translation helpful? Give feedback.
All reactions