Running and Evaluating STREAM benchmark for NVM in Gem5

3 min readMay 22, 2023

In my ongoing research, I have looked at the ways that non-volatile memory interacts with the CPU as a way to improve various computing tasks. You can look at previous work on setting up and executing benchmarks.

Assessing the performance of such a system requires data to be shuttled between the memory unit and the central processing unit. Finding good benchmarks for this is the best way to start evaluating performance.

Building STREAM

The benchmark STREAM was created by the Department of Computer Science at the University of Virginia. It measures the sustainable bandwidth and computation rates to assess how memory systems may exist as a constraint on otherwise fast processors.

The code itself is relatively constrained as just a few source files on an FTP server. To make things better, the project includes a Makefile so that building it will be fairly easy.

Inspecting this file, I can see that there is a Fortran version as well. We’ll not be using that.

I’ll copy over stream.c and Makefile to tests/test-progs/stream-benchmark in my Gem5 workspace.

As an aside, the Makefile has several different rules for programs and files that I'm not copying over. So if you follow these directions and run something like make all it will throw an error.

cd gem5/tests/test-progs/stream-benchmark
make stream_c.exe

This creates a binary file called stream_c.exe. Next I just need to run this in my system.

Running STREAM benchmark

In a previous post I demonstrated how to run several benchmarks with several Gem5 CPU configurations. Sadly some of that work has been lost but it won’t be too hard to recreate it.

Opening up configs/learning_gem5/part1/simple.py this script provides all the bootstrapping to run a binary. Similar to before, I'll run the STREAM benchmark with three memory architectures: DDR3_1600_8x8, DDR4_2400_16x4, and NVM_2400_1x64.

I can replace the system.mem_ctrl.dram field with each of these memory systems. In addition, I can update the binary variable to point to my stream executable.

Then it’s as easy as running:

build/GCN3_X86/gem5.opt configs/learning_gem5/part1/simple.py

Results

root@8fa8d1b910d8:~/gem5# build/GCN3_X86/gem5.opt configs/learning_gem5/part1/simple.py
...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 4391208 microseconds.
(= 1463736 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 41.3 3.875758 3.875757 3.875760
Scale: 29.6 5.409337 5.409335 5.409346
Add: 40.2 5.973723 5.973722 5.973723
Triad: 34.3 6.998141 6.998140 6.998143
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Exiting @ tick 247947740137000 because exiting with last active thread context

The program generates a lot of text and took a long time to complete. I have summarized the results in the following table:

Conclusion

Holding the processor constant and only changing the memory systems, you can see that all of them have relatively similar performance. Non-volatile memory does have a slight lead on scale operations compared to others, but otherwise is not as useful as DDR4 memory.

With regards to the original purpose of the benchmark, it appears that the memory used does not significantly change performance and that’s a good thing in general. It also suggests that there may need to be further research into computer systems where such a bottleneck might occur.

--

--

Nick Felker
Nick Felker

Written by Nick Felker

Social Media Expert -- Rowan University 2017 -- IoT & Assistant @ Google

No responses yet