You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
numpy with OpenBLAS and blis are on-par for gemm. However, this does not use intermediate optimization on numpy's einsum. Enabling this by passing optimize=True:
Only slightly slower than blis now. However, I am skeptical of the claim that parallelization does not help in inference. The matrix sizes used in the benchmark are fairly typical in inference (e.g. the standard transformer attention matrices are 768x768). Testing with 4 threads (fairly modest on current multi-core SMT CPUs):
Ah, when I wrote that the optimize flag for einsum wasn't available for numpy. Thanks for pointing it out!
Regarding the threading in inference, the big issue is that you'll usually be able to parallelise inference at a higher loop, e.g. by starting more processes to work on your data. This is generally a better strategy for inference, usually outside of benchmarks we're inferring as part of a much larger workload, and it makes sense to parallelise the whole sequence. In a cloud context, we can also choose to use smaller instances rather than larger instances.
The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.
That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.
The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.
That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.
Agreed. I can't say I like OpenBLAS' default behavior, which often also leads to performance regressions on many-core CPUs. Making it a user-configurable options sounds sensible.
numpy with OpenBLAS and blis are on-par for gemm. However, this does not use intermediate optimization on numpy's einsum. Enabling this by passing
optimize=True
:Only slightly slower than blis now. However, I am skeptical of the claim that parallelization does not help in inference. The matrix sizes used in the benchmark are fairly typical in inference (e.g. the standard transformer attention matrices are 768x768). Testing with 4 threads (fairly modest on current multi-core SMT CPUs):
Maybe it's worthwhile compiling blis with multi-threading support?
For reference:
The text was updated successfully, but these errors were encountered: