Which BLAS and LAPACK should I use?

TL;DR

If you are on a Mac, use vecLib
Otherwise, if you have an Intel CPU, use OpenBLAS or MKL
If you have an AMD CPU, use OpenBLAS or BLIS + libflame
For all other architectures such as ARM, use ATLAS
Avoid the reference BLAS implementation from Netlib
Avoid uBLAS

Motivation

I’ve been vaguely aware of BLAS and LAPACK forever, but lately I finally decided to take the time to read up on them properly. Below is a summary of my findings.

Background

LAPACK is a standard library – written in Fortran – for common linear algebra tasks such as solving systems of linear equations and least squares problems, finding eigenvalues and singular values, as well as factorizing matrices into decompositions including LU, Cholseky, QR and SVD (see this page for a detailed list). LAPACK is leveraged by much scientific computing, engineering, statistics and financial software, including MATLAB, R, NumPy and GSL.

LAPACK is open source software and can be downloaded from Netlib, the numerical software repository, as well as from GitHub. It is licensed under the permissive “3-clause BSD license”. Here is a plain English explanation of this license. A C interface called LAPACKE is included with LAPACK. To maximize vectorization while abstracting away architectural differences, LAPACK algorithms are written in terms of fundamental “building blocks” such as matrix-matrix and matrix-vector multiplication. LAPACK specifies a Fortran interface for such operations, called BLAS. The equivalent C interface is called CBLAS.

LAPACK is designed to efficiently handle dense matrices (mostly non-zero entries) and banded matrices (e.g. tridiagonal). General sparse matrices (whose entries are mostly zero), which often come up in Machine Learning applications, require specialized algorithms not included in the standard LAPACK distribution.

Note that BLAS is an interface specification, not an implementation. While a Fortran reference implementation – also licensed under the 3-clause BSD license – is available from Netlib and as part of the LAPACK GitHub repo, its use in production is strongly discouraged. This is because LAPACK’s performance is primarily driven by how fast the underlying BLAS is, and the reference implementation is not especially fast. Instead, the idea is for system vendors to provide optimized BLAS implementaions that fully exploit whatever support for parallelism their architectures provide (such as SIMD instruction sets including MMX, SSE and AVX). If such a vendor-supplied BLAS implementation exists for your system, that should be your first choice.

In practice, vendors often treat LAPACK itself as a reference implementation and re-implement portions of it to boost performance, as well as providing various extensions. In particular, many vendors supply routines for efficient handling of sparse matrices. See this page for more information.

For the historically minded, the following “oral history” of BLAS may prove of some inteterest.

Select Vendor Implementations

A highly-tuned BLAS implementation is included with Intel’s MKL, which also provides optimized implementations of various LAPACK routines. MKL is licensed under a permissive license called the ISSL, which you can read about here. Note, however, that Intel does not make the code for the linear algebra portion of the MKL freely available. On the other hand, the Deep Neural Nets-specific portion is on GitHub, licensed under version 2.0 of the permissive Apache license. Additional background on the MKL is available on this page.
AMD has their own BLAS implementation called BLIS, as well as their own LAPACK-like library called libflame. Both BLIS and libflame are licensed under the “3-clause BSD license”, just like BLAS and LAPACK (see here and here). For an in-depth look at BLIS, check out these papers. The libflame reference manual can be found here.
Apple provides an optimized BLAS implementation as part of its vecLib library, which also includes an optimized LAPACK. vecLib is part of Apple’s Accelerate framework.
Arm Performance Libraries provides optimized BLAS and LAPACK implementations for the ARM architecture. You will need a valid Arm Allinea Studio license to download it.
Cray provides proprietary BLAS and LAPACK implementations for their supercomputers as part of CMSL / LibSci. You can read more about LibSci here.

Other BLAS Implementations

ATLAS is a portable (i.e. architecture-independent) yet still quite efficient BLAS implementation. It is generally slower than a BLAS tuned for a particular architecture such as MKL, but much faster than Netlib’s reference BLAS implementation. ATLAS uses the same 3-clause BSD license as LAPACK. It can be obtained from SourceForge.
GotoBLAS was a BLAS optimization hand-optimized for the Intel and AMD processors. It is no longer maintained, but can still be downloaded here. GotoBLAS is open source software, licensed under the BSD license.
OpenBLAS is a fork of GotoBLAS which is still under active development. It is available on GitHub, licensed under the “3 clause BSD license”. OpenBLAS claims to be about as fast as MKL, which for many years was only available commercially.
uBLAS is a C++ template class library providing a portable BLAS implementation. It is distributed as part of Boost. You can read more about the rationale behind uBLAS here. Some tips for getting the most out of uBLAS can be found here. There is a FAQ here and another one here. uBLAS prioritizes portability over performance, like ATLAS, but is slower than ATLAS due to the “abstraction penalty”, and thus even slower than BLAS implementatinos tuned for particular architectures such as OpenBLAS, MKL or BLIS.

So Which BLAS should I Use?

Based on my research so far, I would do the following:

If you are on a Mac, use vecLib
Otherwise, if you have an Intel CPU, use OpenBLAS or MKL
If you have an AMD CPU, use OpenBLAS or BLIS + libflame
For all other architectures such as ARM, use ATLAS
Avoid the reference BLAS implementation from Netlib
Avoid uBLAS

These are just rules of thumb meant to maximize performance without sacrificing redistribution rights. I suggested using vecLib on Macs mostly as a matter of convenience; I have no idea how it’s performance compares to say OpenBLAS, but it would be interesting to check. Similarly, it’s not clear to me whether MKL and BLIS outperform OpenBLAS on Intel and AMD, respectively, or whether OpenBLAS outperforms ATLAS on those architectures.