Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: multi-row k quants #10846

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

netrunnereve
Copy link
Collaborator

This allows our k quant mat vec shaders to process multiple rows at a time just like mul_mat_vec.comp. It's way faster now and Q4_K_S is catching up to IQ4_NL and Q4_0 on my RX 470.

At this point we might want to consider merging the separate k quant files into mul_mat_vec.comp as they're reusing quite a bit of code, and maybe do some templating using ifdefs to choose the correct dequantization function. That's better left to another PR though.

PR:

model size params backend ngl threads main_gpu sm test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 1 none tg128 21.88 ± 0.00
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 1 none tg128 18.89 ± 0.04
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 27.12 ± 0.12
llama 8B Q5_K - Small 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 22.55 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 20.39 ± 0.00
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   24
2.40 us/run - 117.44 MFLOP/run - 484.49 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   44
7.88 us/run - 117.44 MFLOP/run - 262.22 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   23
8.10 us/run - 117.44 MFLOP/run - 493.23 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   31
5.13 us/run - 117.44 MFLOP/run - 372.68 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   36
8.04 us/run - 117.44 MFLOP/run - 319.10 GFLOPS

Master:

model size params backend ngl threads test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 tg128 17.66 ± 0.07
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 tg128 15.74 ± 0.02
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 tg128 20.58 ± 0.00
llama 8B Q5_K - Small 5.21 GiB 8.03 B Vulkan 100 8 tg128 16.04 ± 0.01
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 tg128 17.57 ± 0.06
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   39
1.03 us/run - 117.44 MFLOP/run - 300.33 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   52
8.63 us/run - 117.44 MFLOP/run - 222.16 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   37
4.13 us/run - 117.44 MFLOP/run - 313.90 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   47
2.77 us/run - 117.44 MFLOP/run - 248.41 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   46
1.75 us/run - 117.44 MFLOP/run - 254.34 GFLOPS

The number of rows used was chosen for my card and may need tuning for different architectures.

@netrunnereve netrunnereve requested a review from 0cc4m December 16, 2024 03:59
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 16, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Dec 16, 2024

Please rebase to reduce the number of commits.

@jeffbolznv
Copy link
Collaborator

Multiple rows is a bit slower on RTX 4070, so please change to one row for NVIDIA:

before:

| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         tg128 |        114.99 ± 2.51 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |  1 |         tg128 |        118.99 ± 0.70 |

after:

| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         tg128 |        114.69 ± 1.86 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |  1 |         tg128 |        115.25 ± 1.42 |

I read through the shader changes and they look good to me.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 17, 2024

Intel is being weird again..
MASTER:

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    98.25 us/run - 117.44 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   425.65 us/run - 117.44 MFLOP/run - 275.91 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8520 runs -   124.92 us/run - 117.44 MFLOP/run - 940.15 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   152.10 us/run - 117.44 MFLOP/run - 772.15 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   545.88 us/run - 117.44 MFLOP/run - 215.14 GFLOPS

PR:

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   140.19 us/run - 117.44 MFLOP/run - 837.73 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   361.53 us/run - 117.44 MFLOP/run - 324.84 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   141.64 us/run - 117.44 MFLOP/run - 829.16 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   160.36 us/run - 117.44 MFLOP/run - 732.34 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   376.82 us/run - 117.44 MFLOP/run - 311.66 GFLOPS

With 1*rm instead of 2*rm (equivalent to rm=1, which was not good for the legacy quants):

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  11928 runs -    84.55 us/run - 117.44 MFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   322.00 us/run - 117.44 MFLOP/run - 364.73 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    99.47 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   138.77 us/run - 117.44 MFLOP/run - 846.27 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   565.67 us/run - 117.44 MFLOP/run - 207.61 GFLOPS

It seems to prefer fewer rows on q2_k to q5_k and more rows on q6_k (but performance is bad there either way). I tested this with models for Q4_K_S and Q6_K and it confirms the findings.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 17, 2024

I can also confirm that 1*rm (fewer rows) is better on Nvidia RTX 3090.

The PR looks good, it just needs some changes to the selection logic. It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel. The merge conflict needs to be fixed, too.

Edit: Also looks good on AMD RX 6800 XT.

@netrunnereve
Copy link
Collaborator Author

It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel.

Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list.

BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing.

@netrunnereve
Copy link
Collaborator Author

Edit: Also looks good on AMD RX 6800 XT.

Does that mean it works best with two rows per shader?

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 19, 2024

It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel.

Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list.

It's a big difference, but performance is marginal either way. I would prefer not making it more complex cause it increases the number of parameters we need to hand-tune. Maybe it's time for an optimizer.

BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing.

No, I don't have that much time to devote to Intel.

Edit: Also looks good on AMD RX 6800 XT.

Does that mean it works best with two rows per shader?

I meant the PR got optimal performance on it already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants