Running a heterogeneous CPU-GPU simulation with Gem5

5 min readMay 19, 2023

In my ongoing research, I have looked at the ways that non-volatile memory interacts with the CPU as a way to improve various computing tasks. You can look at previous work on setting up and executing benchmarks. (Part 1, Part 2)

For standard processing, non-volatile memory is simply not going to be better. As such it is useful to look at other kinds of opportunities where it can develop a niche.

This has brought me to GPUs. Graphical processing units can perform a lot of computations in parallel though there is a bottleneck when sending data back and forth between the GPU, CPU, and RAM. Moving some operations to in-memory computing might improve performance.

In order to begin testing this hypothesis, I will need to find a way to simulate GPU behaviors. Given my existing research on top of Gem5, that is where I will begin.

Gem5-GPU

A GPU integrated into Gem5 was first published in a paper in 2013: gem5-gpu: A Heterogeneous CPU-GPU Simulator. The code for this capability has since been merged into the main project source. There is just a little bit of documentation on their GCN3 GPU model.

The documentation is more focused on documenting some of the compatibility aspects and less on how to use it in a simulation. As such, I aim to compile a small app and run it on a simple GPU.

Gem5 provides a few test programs that are setup with a GPU including Square, which is the simplest. It will just square a bunch of numbers.

Because it is a simple file, I’ll copy over the key program files to tests/test-progs/gpu-square. This program also includes a Makefile, making it relatively easy to build on the command line.

In order to do this, I will need to access specific resources in the separate gem5-gpu repository. I can do this using Docker to create a persistent, interactive terminal.

docker pull gcr.io/gem5-test/gcn-gpu:v22-0
docker images # Get image id
docker run --name gem5-gpu -it $IMAGE_ID

Sidetrack: Docker problems

Somehow, as I was working on this, my Docker installation became corrupted! Everything had to be wiped. That sucked. Reinstalling, I ran into a problem reconnecting to VSCode.

[2317 ms] @devcontainers/cli 0.40.0. Node.js v16.17.1. win32 10.0.19043 x64.
[2317 ms] Start: Run: docker buildx version
[2380 ms] 
[2381 ms] docker: 'buildx' is not a docker command.
See 'docker --help'.

As I’m using Windows with WSL, there’s a lot of strange interconnections happening. After toggling some settings and rebooting a few times, it works. Sadly, I don’t entirely know what I managed to fix. This doesn’t bode well for the future. As such, I realized it would be critical to keep my data preserved for the future.

Build Gem5

It took me long enough but I got this new container attached into VSCode and that’s good. Now to build Gem5 and everything, I did these steps from the container terminal:

git clone https://gem5.googlesource.com/public/gem5
cd gem5
scons build/GCN3_X86/gem5.opt -j 32

Here I’m told to add commit hooks. I consent by typing y and pressing enter.

This ended up not working, throwing the error:

[VER TAGS] -> GCN3_X86/sim/tags.cc
/usr/bin/env: python\r: No such file or directory

Something odd is happening here, since python\r shouldn't include the carriage return character. From a StackOverflow question, I ran:

apt install dos2unix
find . -type f -print0 | xargs -0 dos2unix

This might be an odd consequence of me jumping between Windows and Linux and may only be applicable in this situation.

Obtaining Square-GPU Sample

I still need to get the GPU-related sample projects. They’re in a separate code repository.

cd .. # Should be one level above
git clone https://gem5.googlesource.com/public/gem5-resources

Running Square-GPU Sample

Now I can build my simulation and run the square sample.

scons gem5/build/GCN3_X86/gem5.opt -j 32
cd ../gem5-resources/src/gpu/square
make square # Will spit out perl warnings but will work
cd ~
gem5/build/GCN3_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 -c gem5-resources/src/gpu/square/bin/square

Those commands will produce this output:

gem5 Simulator System.  https://www.gem5.org
gem5 is copyrighted software; use the --copyright option for details.
gem5 version 22.1.0.0
gem5 compiled May 18 2023 15:42:00
gem5 started May 19 2023 01:51:27
gem5 executing on 8fa8d1b910d8, pid 28469
command line: gem5/build/GCN3_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 -c gem5-resources/src/gpu/square/bin/squareNum SQC =  1 Num scalar caches =  1 Num CU =  4
warn: The `get_runtime_isa` function is deprecated. Please migrate away from using this function.
...
Global frequency set at 1000000000000 ticks per second
warn: system.ruby.network adopting orphan SimObject param 'ext_links'
warn: system.ruby.network adopting orphan SimObject param 'int_links'
build/GCN3_X86/mem/dram_interface.cc:690: warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (512 Mbytes)
build/GCN3_X86/base/stats/storage.hh:279: warn: Bucket size (10000) does not divide range [1:1.6e+06] into equal-sized buckets. Rounding up.
...
Forcing maxCoalescedReqs to 32 (TLB assoc.) 
...
build/GCN3_X86/base/statistics.hh:280: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
Forcing maxCoalescedReqs to 32 (TLB assoc.) 
Forcing maxCoalescedReqs to 32 (TLB assoc.) 
0: system.remote_gdb: listening for remote gdb on port 7000
build/GCN3_X86/sim/simulate.cc:192: info: Entering event queue @ 0.  Starting simulation...
build/GCN3_X86/mem/ruby/system/Sequencer.cc:613: warn: Replacement policy updates recently became the responsibility of SLICC state machines. Make sure to setMRU() near callbacks in .sm files!
build/GCN3_X86/sim/mem_state.cc:443: info: Increasing stack size by one page.
build/GCN3_X86/sim/syscall_emul.cc:74: warn: ignoring syscall mprotect(...)
...
build/GCN3_X86/arch/generic/debugfaults.hh:145: warn: MOVNTDQ: Ignoring non-temporal hint, modeling as cacheable!
build/GCN3_X86/arch/x86/generated/exec-ns.cc.inc:27: warn: instruction 'frndint' unimplemented
build/GCN3_X86/sim/mem_state.cc:443: info: Increasing stack size by one page.
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:710: warn: unimplemented ioctl: AMDKFD_IOC_ACQUIRE_VM
build/GCN3_X86/sim/syscall_emul.hh:1890: warn: mmap: writing to shared mmap region is currently unsupported. The write succeeds on the target, but it will not be propagated to the host or shared mappings
build/GCN3_X86/sim/mem_state.cc:443: info: Increasing stack size by one page.
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:460: warn: Signal events are only supported currently
build/GCN3_X86/sim/power_state.cc:105: warn: PowerState: Already in the requested power state, request ignored
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:604: warn: unimplemented ioctl: AMDKFD_IOC_SET_SCRATCH_BACKING_VA
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:614: warn: unimplemented ioctl: AMDKFD_IOC_SET_TRAP_HANDLER
info: running on device 
info: architecture on AMD GPU device is: 801
info: allocate host and device mem (  7.63 MB)
info: launch 'vector_square' kernel
build/GCN3_X86/sim/syscall_emul.cc:85: warn: ignoring syscall sched_yield(...)
      (further warnings will be suppressed)
...
info: check result
PASSED!
Ticks: 139381328500
Exiting because  exiting with last active thread context

That’s cool.

Looking at the config script, apu_se.py, it seems like there are a few settings for GPUs but none for switching CPU memory architectures.

Wrapping up

You can take a look at the square.cpp code itself if you want. Admittedly, as someone who has never written GPU code before, I don’t quite get what is happening here. The apu_se.py config has a lot of different settings setting up shaders and other things. It seems impressive that this is all simulator stuff.

In general, it’s clear that I will have to spend more time with this in order to complete my research.