Benchmark results look incorrect? #31

danieldk · 2020-08-17T14:06:57Z

% export OMP_NUM_THREADS=1
% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.54 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.50 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.78 seconds
Numpy (openblas) einsum ab,cb->ca
unset OMP_NUM_THREADS
Total: 5510596.19140625
90.67 seconds

numpy with OpenBLAS and blis are on-par for gemm. However, this does not use intermediate optimization on numpy's einsum. Enabling this by passing optimize=True:

% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.62 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.51 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.70 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
11.43 seconds

Only slightly slower than blis now. However, I am skeptical of the claim that parallelization does not help in inference. The matrix sizes used in the benchmark are fairly typical in inference (e.g. the standard transformer attention matrices are 768x768). Testing with 4 threads (fairly modest on current multi-core SMT CPUs):

% export OMP_NUM_THREADS=4
% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.77 seconds
Numpy (openblas) gemm...
Total: 11032015.625
3.40 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.83 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
4.53 seconds

Maybe it's worthwhile compiling blis with multi-threading support?

For reference:

% lscpu | grep name:
Model name:          Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

The text was updated successfully, but these errors were encountered:

honnibal · 2020-08-17T14:15:39Z

Ah, when I wrote that the optimize flag for einsum wasn't available for numpy. Thanks for pointing it out!

Regarding the threading in inference, the big issue is that you'll usually be able to parallelise inference at a higher loop, e.g. by starting more processes to work on your data. This is generally a better strategy for inference, usually outside of benchmarks we're inferring as part of a much larger workload, and it makes sense to parallelise the whole sequence. In a cloud context, we can also choose to use smaller instances rather than larger instances.

The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.

That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.

danieldk · 2020-08-17T14:18:36Z

The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.

That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.

Agreed. I can't say I like OpenBLAS' default behavior, which often also leads to performance regressions on many-core CPUs. Making it a user-configurable options sounds sensible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results look incorrect? #31

Benchmark results look incorrect? #31

danieldk commented Aug 17, 2020 •

edited

Loading

honnibal commented Aug 17, 2020

danieldk commented Aug 17, 2020

Benchmark results look incorrect? #31

Benchmark results look incorrect? #31

Comments

danieldk commented Aug 17, 2020 • edited Loading

honnibal commented Aug 17, 2020

danieldk commented Aug 17, 2020

danieldk commented Aug 17, 2020 •

edited

Loading