TL;DR

  • If you are on a Mac, use vecLib
  • Otherwise, if you have an Intel CPU, use OpenBLAS or MKL
  • If you have an AMD CPU, use OpenBLAS or BLIS + libflame
  • For all other architectures such as ARM, use ATLAS
  • Avoid the reference BLAS implementation from Netlib
  • Avoid uBLAS

Motivation

I’ve been vaguely aware of BLAS and LAPACK forever, but lately I finally decided to take the time to read up on them properly. Below is a summary of my findings.

Background

LAPACK is a standard library – written in Fortran – for common linear algebra tasks such as solving systems of linear equations and least squares problems, finding eigenvalues and singular values, as well as factorizing matrices into decompositions including LU, Cholseky, QR and SVD (see this page for a detailed list). LAPACK is leveraged by much scientific computing, engineering, statistics and financial software, including MATLAB, R, NumPy and GSL.

LAPACK is open source software and can be downloaded from Netlib, the numerical software repository, as well as from GitHub. It is licensed under the permissive “3-clause BSD license”. Here is a plain English explanation of this license. A C interface called LAPACKE is included with LAPACK. To maximize vectorization while abstracting away architectural differences, LAPACK algorithms are written in terms of fundamental “building blocks” such as matrix-matrix and matrix-vector multiplication. LAPACK specifies a Fortran interface for such operations, called BLAS. The equivalent C interface is called CBLAS.

LAPACK is designed to efficiently handle dense matrices (mostly non-zero entries) and banded matrices (e.g. tridiagonal). General sparse matrices (whose entries are mostly zero), which often come up in Machine Learning applications, require specialized algorithms not included in the standard LAPACK distribution.

Note that BLAS is an interface specification, not an implementation. While a Fortran reference implementation – also licensed under the 3-clause BSD license – is available from Netlib and as part of the LAPACK GitHub repo, its use in production is strongly discouraged. This is because LAPACK’s performance is primarily driven by how fast the underlying BLAS is, and the reference implementation is not especially fast. Instead, the idea is for system vendors to provide optimized BLAS implementaions that fully exploit whatever support for parallelism their architectures provide (such as SIMD instruction sets including MMX, SSE and AVX). If such a vendor-supplied BLAS implementation exists for your system, that should be your first choice.

In practice, vendors often treat LAPACK itself as a reference implementation and re-implement portions of it to boost performance, as well as providing various extensions. In particular, many vendors supply routines for efficient handling of sparse matrices. See this page for more information.

For the historically minded, the following “oral history” of BLAS may prove of some inteterest.

Select Vendor Implementations

Other BLAS Implementations

  • ATLAS is a portable (i.e. architecture-independent) yet still quite efficient BLAS implementation. It is generally slower than a BLAS tuned for a particular architecture such as MKL, but much faster than Netlib’s reference BLAS implementation. ATLAS uses the same 3-clause BSD license as LAPACK. It can be obtained from SourceForge.
  • GotoBLAS was a BLAS optimization hand-optimized for the Intel and AMD processors. It is no longer maintained, but can still be downloaded here. GotoBLAS is open source software, licensed under the BSD license.
  • OpenBLAS is a fork of GotoBLAS which is still under active development. It is available on GitHub, licensed under the “3 clause BSD license”. OpenBLAS claims to be about as fast as MKL, which for many years was only available commercially.
  • uBLAS is a C++ template class library providing a portable BLAS implementation. It is distributed as part of Boost. You can read more about the rationale behind uBLAS here. Some tips for getting the most out of uBLAS can be found here. There is a FAQ here and another one here. uBLAS prioritizes portability over performance, like ATLAS, but is slower than ATLAS due to the “abstraction penalty”, and thus even slower than BLAS implementatinos tuned for particular architectures such as OpenBLAS, MKL or BLIS.

So Which BLAS should I Use?

Based on my research so far, I would do the following:

  • If you are on a Mac, use vecLib
  • Otherwise, if you have an Intel CPU, use OpenBLAS or MKL
  • If you have an AMD CPU, use OpenBLAS or BLIS + libflame
  • For all other architectures such as ARM, use ATLAS
  • Avoid the reference BLAS implementation from Netlib
  • Avoid uBLAS

These are just rules of thumb meant to maximize performance without sacrificing redistribution rights. I suggested using vecLib on Macs mostly as a matter of convenience; I have no idea how it’s performance compares to say OpenBLAS, but it would be interesting to check. Similarly, it’s not clear to me whether MKL and BLIS outperform OpenBLAS on Intel and AMD, respectively, or whether OpenBLAS outperforms ATLAS on those architectures.