Will my CPU beat your CPU?
Despite the market cap of a certain company that makes GPUs, CPUs can still be used for computations!
In computing, two aspects are crucial: FLOPS (floating point operations per seconds), and memory bandwidth.
For a long time I was looking for a simple way to measure peak performance of these metrics quickly. Be it for deciding what kind of hardware to rent in the cloud. Be it for cross-checking that whatever tool I’m using (NumPy, PyTorch, MATLAB, Octave, …) is correctly compiled/configured to make the most use of the hardware it’s running on. And, of course, totally not for bragging with my peers that my computer can beat up their computer :)
In the end, I implemented my own synthetic benchmark. Behold, the simple benchmark!
Simple benchmark is easy to install and supports Linux and macOS. It just needs the C++ compiler and, on Linux, the OpenBLAS library.
It determines peak memory bandwidth and FLOPS on all kind of hardware.
To determine memory bandwidth it performs multithreading copying using two different methods. To determine peak FLOPS it performs matrix multiplication using BLAS (→ Basic Linear Algebra Subprograms) libraries. In particular: OpenBLAS on Linux and the Accelerate framework on macOS. These libraries autodetect the hardware they’re running on and use the best strategy to perform the computation using all cores and SIMD (→ Single instruction, multiple data) extensions.
So, let’s run the tests on a PC with a Ryzen 9 7950X (“Raphael”, Zen 4) and 128 GB of DDR5-3600 ECC-RAM.
First, is memory bandwidth:
$ ./memcopy
[...]
measuring memory copy performance
threads method GiB/s
[...]
2 memcpy 18.536
3 memcpy 20.722
4 memcpy 21.738
6 memcpy 21.468
8 memcpy 21.197
12 memcpy 20.949
peak value: 21.738 GiB/s
That’s ~ 22 GIB/s for copying a large memory block (5 GiB) into another block of the same size.
Linux actually does have a tool to measure this: mbw. Running it with the
same block size (mbw 5120) measures a peak of 19.0 GiB/s, which roughly
confirms that value (it looks like mbw uses an efficient albeit single
threaded way to perform the copy).
Second, let’s look at FLOPS:
$ ./matmul
debug: using OpenBLAS 0.3.21 NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=64
[...]
measuring FP32 matrix multiplication performance
matrix size GFLOPS fp32
1024 921.6
2048 1565.4
4096 1993.3
8192 1992.7
16384 2094.5
peak value: 2094.5 GFLOPS fp32
That’s a cool 2 TFLOPS!
We can see OpenBLAS uses 32 threads. The machine actually has 16 physical
cores. We could run it as OPENBLAS_NUM_THREADS=16 ./matmul to make sure
hyper-threading is not in the way. It isn’t on this machine, but that’s not
always the case.
We also see it detects the hardware as “Cooperlake”. That’s some Intel
codename here used to define a generation of the x86_64 instruction set. In
this case it means OpenBLAS uses the right SIMD variant for this generation.
That’s fine (even though this is an AMD CPU). We could force a different
detection, for example downgrading to use OPENBLAS_CORETYPE=Nehalem ./matmul.
However, in my experience auto-detection is good, so this isn’t necessary. If
you’re curious, here is the → list of all supported core
types.
Can we cross-check this?
Yes! Let’s perform the same matrix multiplication using NumPy.
Note that to multiply two square matrices of size n, one needs to perform 2 *
n^3 - n^2 operations (additions or multiplications). This allows to compute the
FLOPS from the execution time, the same way matmul does. Here is the Python
code:
import numpy as np; import time
rng = np.random.default_rng()
size = 8192
big1 = rng.random((size, size), dtype=np.float32)
big2 = rng.random((size, size), dtype=np.float32)
t0 = time.time()
prod = big1 @ big2
t1 = time.time()
print("matrix mul %d x %d took %.3f seconds -> %.1f GFLOPS single" % (
size, size, t1 - t0, (2 * size ** 3 - size ** 2) / (t1 - t0) / 1.0E9))
Running this using NumPy 2.3.4 gives:
matrix mul 8192 x 8192 took 0.524 seconds -> 2098.0 GFLOPS single
Yes, a peak 2 TFLOPS fp32 seems to be the real deal on this PC.
In Octave, it’s even easier. This one-liner gives a result in the same ballpark:
n=8192; A=rand(n, 'single'); B=rand(n, 'single'); tic; C=A*B; (2 * n .^ 3 - n .^ 2) ./ toc / 1e9
As a matter of fact, both, NumPy and Octave actually do use OpenBLAS as
well… At least we see that matmul is working as intended.
As astute reader, you now might ask, how it is possible to feed 2 * 10^12 floating point numbers à 4 bytes (8 * 10^12 bytes) into the CPU registers each second, if the memory bandwidth is just 20 * 10^9 bytes per second?!
The trick lies in the word peak. Matrix multiplication requires using the same matrix elements many times in the computation, so clever BLAS implementations perform the operation in blocks that will fit into the fast cache hierarchy of the CPU and keep the instruction pipelines busy.
Leaving memory out of the equation, how fast could a Ryzen 9 7950X compute this theoretically? We have 16 cores with → AVX512 registers that compute on vectors of 16 floats each. They implement fused multiplication and addition, thus doing 2 operations per clock. Clock speed is 4.5 GHz. So that gives 16 * 16 * 2 * 4.5 * 10^9 = 2304 * 10 ^ 9, thus 2304 GFLOPS.
This shows how exceptionally well optimized OpenBLAS is, to be able to reach 2100 GFLOPS.
Do not expect to get this performance for other problems that cannot make good use of caches the same way matrix multiplication does. A related operation, where a vector is multiplied with a matrix, for example, is typically bottle necked by memory bandwidth, not by peak FLOPS.
Here are some results for machines of different classes.
Physical hardware
| machine type, CPU, RAM | OS and compiler | memcopy | matmul fp32 | note |
|---|---|---|---|---|
| PC Ryzen 9 7950X (2024), 16 cores, 128 GB DDR5-3600 | Debian 12, GCC 12 | 21.738 GiB/s | 2094.5 GFLOPS | |
| Notebook MacBook Pro (2023), M2 Pro 6+4 cores, 32 GB | macOS 15, clang 17 | 82.620 GiB/s | 2169.4 GFLOPS | |
| Notebook MacBook Pro (2019), i9-9880H 8 cores, 32 GB | macOS 15, clang 17 | 10.711 GiB/s | 499.6 GFLOPS | |
| Raspberry Pi 4 (2019) Cortex-A72, 4 cores, 4 GB | Debian 10, GCC 8 | 2.055 GiB/s | 15.4 GFLOPS | ^1 |
Small cloud VMs
| machine type, CPU, RAM | OS and compiler | memcopy | matmul fp32 | note |
|---|---|---|---|---|
| POP2-4C-16G, 4 vCPUs, 16 GB | Debian 13, GCC 14 | 21.981 GiB/s | 227.2 GFLOPS | ^2 |
| m7a.xlarge, 4 vCPUs, 16 GB | Debian 13, GCC 14 | 24.485 GiB/s | 420.0 GFLOPS | ^3 |
| m7i.xlarge, 4 vCPUs, 16 GB | Debian 13, GCC 14 | 21.901 GiB/s | 356.3 GFLOPS | ^4 |
| m8g.xlarge, 4 vCPUs 16 GB | Debian 13, GCC 14 | 58.072 GiB/s | 159.9 GFLOPS | ^5 |
Large cloud VMs
| machine type, CPU, RAM | OS and compiler | memcopy | matmul fp32 | note |
|---|---|---|---|---|
| POP2-32C-128G, 32 vCPUs, 128 GB | Debian 13, GCC 14 | 65.566 GiB/s | 1511.6 GFLOPS | ^2 |
| m7a.8xlarge, 32 vCPUs, 128 GB | Debian 13, GCC 14 | 80.613 GiB/s | 2722.3 GFLOPS | ^3 |
| m7i.8xlarge, 32 vCPUs, 128 GB | Debian 13, GCC 14 | 76.559 GiB/s | 3113.1 GFLOPS | ^4 |
| m8g.8xlarge, 32 vCPUs 128 GB | Debian 13, GCC 14 | 93.482 GiB/s | 1335.5 GFLOPS | ^5 |
[^1] memory block size reduced from 5 GiB to 1.25 GiB, maximum matrix size reduced from 16384×16384 to 4096×4096
[^2] cloud VM at Scaleway, 1 vCPU == 1 dedicated hyper-thread of a EPYC 7543 (“Milan”, Zen3)
[^3] cloud VM at AWS, 1 vCPU == 1 dedicated hyper-thread of a EPYC 9R14 (“Genoa”, Zen4)
[^4] cloud VM at AWS, 1 vCPU == 1 dedicated hyper-thread of a Xeon Platinum 8488C (“Sapphire Rapids”)
[^5] cloud VM at AWS, 1 vCPU == 1 dedicated core of a Graviton4