-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: multi-row k quants #10846
base: master
Are you sure you want to change the base?
vulkan: multi-row k quants #10846
Conversation
Please rebase to reduce the number of commits. |
943185e
to
c656d92
Compare
Multiple rows is a bit slower on RTX 4070, so please change to one row for NVIDIA:
I read through the shader changes and they look good to me. |
Intel is being weird again..
PR:
With
It seems to prefer fewer rows on q2_k to q5_k and more rows on q6_k (but performance is bad there either way). I tested this with models for Q4_K_S and Q6_K and it confirms the findings. |
I can also confirm that 1*rm (fewer rows) is better on Nvidia RTX 3090. The PR looks good, it just needs some changes to the selection logic. It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel. The merge conflict needs to be fixed, too. Edit: Also looks good on AMD RX 6800 XT. |
Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list. BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing. |
Does that mean it works best with two rows per shader? |
It's a big difference, but performance is marginal either way. I would prefer not making it more complex cause it increases the number of parameters we need to hand-tune. Maybe it's time for an optimizer.
No, I don't have that much time to devote to Intel.
I meant the PR got optimal performance on it already. |
This allows our k quant mat vec shaders to process multiple rows at a time just like
mul_mat_vec.comp
. It's way faster now and Q4_K_S is catching up to IQ4_NL and Q4_0 on my RX 470.At this point we might want to consider merging the separate k quant files into
mul_mat_vec.comp
as they're reusing quite a bit of code, and maybe do some templating using ifdefs to choose the correct dequantization function. That's better left to another PR though.PR:
Master:
The number of rows used was chosen for my card and may need tuning for different architectures.