Will my CPU beat your CPU?

2025-10-22

Despite the market cap of a certain company that makes GPUs, CPUs can still be used for computations!

In computing, two aspects are crucial: FLOPS (floating point operations per seconds), and memory bandwidth.

For a long time I was looking for a simple way to measure peak performance of these metrics quickly. Be it for deciding what kind of hardware to rent in the cloud. Be it for cross-checking that whatever tool I’m using (NumPy, PyTorch, MATLAB, Octave, …) is correctly compiled/configured to make the most use of the hardware it’s running on. And, of course, totally not for bragging with my peers that my computer can beat up their computer :)

In the end, I implemented my own synthetic benchmark. Behold, the simple benchmark!

https://codeberg.org/chris-mair/simple_benchmark

Simple benchmark is easy to install and supports Linux and macOS. It just needs the C++ compiler and, on Linux, the OpenBLAS library.

It determines peak memory bandwidth and FLOPS on all kind of hardware.

To determine memory bandwidth it performs multithreading copying using two different methods. To determine peak FLOPS it performs matrix multiplication using BLAS (→ Basic Linear Algebra Subprograms) libraries. In particular: OpenBLAS on Linux and the Accelerate framework on macOS. These libraries autodetect the hardware they’re running on and use the best strategy to perform the computation using all cores and SIMD (→ Single instruction, multiple data) extensions.

So, let’s run the tests on a PC with a Ryzen 9 7950X (“Raphael”, Zen 4) and 128 GB of DDR5-3600 ECC-RAM.

First, is memory bandwidth:

$ ./memcopy
[...]
measuring memory copy performance
threads      method    GiB/s
[...]
      2      memcpy   18.536
      3      memcpy   20.722
      4      memcpy   21.738
      6      memcpy   21.468
      8      memcpy   21.197
     12      memcpy   20.949
peak value: 21.738 GiB/s

That’s ~ 22 GIB/s for copying a large memory block (5 GiB) into another block of the same size.

Linux actually does have a tool to measure this: mbw. Running it with the same block size (mbw 5120) measures a peak of 19.0 GiB/s, which roughly confirms that value (it looks like mbw uses an efficient albeit single threaded way to perform the copy).

Second, let’s look at FLOPS:

$ ./matmul
debug: using OpenBLAS 0.3.21 NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=64
[...]
measuring FP32 matrix multiplication performance
matrix size    GFLOPS fp32
1024          921.6
2048         1565.4
4096         1993.3
8192         1992.7
16384         2094.5
peak value: 2094.5 GFLOPS fp32

That’s a cool 2 TFLOPS!

We can see OpenBLAS uses 32 threads. The machine actually has 16 physical cores. We could run it as OPENBLAS_NUM_THREADS=16 ./matmul to make sure hyper-threading is not in the way. It isn’t on this machine, but that’s not always the case.

We also see it detects the hardware as “Cooperlake”. That’s some Intel codename here used to define a generation of the x86_64 instruction set. In this case it means OpenBLAS uses the right SIMD variant for this generation. That’s fine (even though this is an AMD CPU). We could force a different detection, for example downgrading to use OPENBLAS_CORETYPE=Nehalem ./matmul. However, in my experience auto-detection is good, so this isn’t necessary. If you’re curious, here is the → list of all supported core types.

Can we cross-check this?

Yes! Let’s perform the same matrix multiplication using NumPy.

Note that to multiply two square matrices of size n, one needs to perform 2 * n^3 - n^2 operations (additions or multiplications). This allows to compute the FLOPS from the execution time, the same way matmul does. Here is the Python code:

import numpy as np; import time
rng = np.random.default_rng()
size = 8192
big1 = rng.random((size, size), dtype=np.float32)
big2 = rng.random((size, size), dtype=np.float32)

t0 = time.time()
prod = big1 @ big2
t1 = time.time()

print("matrix mul %d x %d took %.3f seconds -> %.1f GFLOPS single" % (
    size, size, t1 - t0, (2 * size ** 3 - size ** 2) / (t1 - t0) / 1.0E9))

Running this using NumPy 2.3.4 gives:

matrix mul 8192 x 8192 took 0.524 seconds -> 2098.0 GFLOPS single

Yes, a peak 2 TFLOPS fp32 seems to be the real deal on this PC.

In Octave, it’s even easier. This one-liner gives a result in the same ballpark:

n=8192; A=rand(n, 'single'); B=rand(n, 'single'); tic; C=A*B; (2 * n .^ 3 - n .^ 2) ./ toc / 1e9

As a matter of fact, both, NumPy and Octave actually do use OpenBLAS as well… At least we see that matmul is working as intended.

As astute reader, you now might ask, how it is possible to feed 2 * 10^12 floating point numbers à 4 bytes (8 * 10^12 bytes) into the CPU registers each second, if the memory bandwidth is just 20 * 10^9 bytes per second?!

The trick lies in the word peak. Matrix multiplication requires using the same matrix elements many times in the computation, so clever BLAS implementations perform the operation in blocks that will fit into the fast cache hierarchy of the CPU and keep the instruction pipelines busy.

Leaving memory out of the equation, how fast could a Ryzen 9 7950X compute this theoretically? We have 16 cores that thanks to the Zen 4 architecture can do a peak of 32 FLOPS per core per cycle. Clock speed is 4.5 GHz. So that gives 16 * 32 * 4.5 * 10^9 = 2304 * 10 ^ 9, thus 2304 GFLOPS.

This shows how exceptionally well optimized OpenBLAS is, to be able to reach 2100 GFLOPS.

Do not expect to get this performance for other problems that cannot make good use of caches the same way matrix multiplication does. A related operation, where a vector is multiplied with a matrix, for example, is typically bottle necked by memory bandwidth, not by peak FLOPS.

Here are some results for machines of different classes.

Physical hardware

machine type, CPU, RAM	OS and compiler	memcopy	matmul fp32	note
PC Ryzen 9 7950X (2024), 16 cores, 128 GB DDR5-3600	Debian 12, GCC 12	21.738 GiB/s	2094.5 GFLOPS
Notebook MacBook Pro (2023), M2 Pro 6+4 cores, 32 GB	macOS 15, clang 17	82.620 GiB/s	2169.4 GFLOPS
Notebook MacBook Pro (2019), i9-9880H 8 cores, 32 GB	macOS 15, clang 17	10.711 GiB/s	499.6 GFLOPS
Raspberry Pi 4 (2019) Cortex-A72, 4 cores, 4 GB	Debian 10, GCC 8	2.055 GiB/s	15.4 GFLOPS	^1

Small cloud VMs

machine type, CPU, RAM	OS and compiler	memcopy	matmul fp32	note
POP2-4C-16G, 4 vCPUs, 16 GB	Debian 13, GCC 14	21.981 GiB/s	227.2 GFLOPS	^2
m7a.xlarge, 4 vCPUs, 16 GB	Debian 13, GCC 14	24.485 GiB/s	420.0 GFLOPS	^3
m7i.xlarge, 4 vCPUs, 16 GB	Debian 13, GCC 14	21.901 GiB/s	356.3 GFLOPS	^4
m8g.xlarge, 4 vCPUs 16 GB	Debian 13, GCC 14	58.072 GiB/s	159.9 GFLOPS	^5

Large cloud VMs

machine type, CPU, RAM	OS and compiler	memcopy	matmul fp32	note
POP2-32C-128G, 32 vCPUs, 128 GB	Debian 13, GCC 14	65.566 GiB/s	1511.6 GFLOPS	^2
m7a.8xlarge, 32 vCPUs, 128 GB	Debian 13, GCC 14	80.613 GiB/s	2722.3 GFLOPS	^3
m7i.8xlarge, 32 vCPUs, 128 GB	Debian 13, GCC 14	76.559 GiB/s	3113.1 GFLOPS	^4
m8g.8xlarge, 32 vCPUs 128 GB	Debian 13, GCC 14	93.482 GiB/s	1335.5 GFLOPS	^5

[^1] memory block size reduced from 5 GiB to 1.25 GiB, maximum matrix size reduced from 16384×16384 to 4096×4096

[^2] cloud VM at Scaleway, 1 vCPU == 1 dedicated hyper-thread of a EPYC 7543 (“Milan”, Zen3)

[^3] cloud VM at AWS, 1 vCPU == 1 dedicated core of a EPYC 9R14 (“Genoa”, Zen4)

[^4] cloud VM at AWS, 1 vCPU == 1 dedicated hyper-thread of a Xeon Platinum 8488C (“Sapphire Rapids”)

[^5] cloud VM at AWS, 1 vCPU == 1 dedicated core of a Graviton4

Update 2025-12-07: revisited text about the theoretical peak FLOPS and corrected description of AWS’ m7a instance type.

Sprachauswahl