Implementing IMPLY operation in RISC-V with Gem5 simulations

9 min readJun 24, 2023

In my continued research of RISC-V and non-volatile memory, I’ve started exploring in-memory processing. This technique promises great improvements in compute performance by avoiding the CPU entirely, getting around the “Von Neumann bottleneck”.

In this blog post, I will be building upon previous work to add an IMPLY instruction into RISC-V as a custom instruction. Then, I will build a program that can run with a simulated implementation of this instruction using Gem5.

IMPLY

This is built off a recent paper where they designed an in-memory computation for IMPLY. However, their work only simulated the electrical properties in Cadence Virtuoso rather than focus on more of the software properties.

IMPLY gate symbol

An IMPLY op is one that performs a logical implication. Essentially it follows the rule that, for inputs p and q:

p implies q: If p, then q.

In more of a programming mindset framing, we can refer to the truth table:

Implication truth table

From a previous post, we can immediately tell that this requires two inputs and thus is an R-type instruction in RISC-V. It’ll be fairly straightforward to add this in.

Why do we want an IMPLY? In high-density applications, it is a kind of circuit that is relatively easy to implement and can be used in more complex kinds of mathematical algorithms.

Adding RISC-V instruction

The first step will be opening up riscv-gnu-toolchain/binutils/opcodes/riscv-opc.c and getting to the bottom of our basic instructions list.

{"imply",      0, INSN_CLASS_I,       "d,s,t",         MATCH_IMPLY, MASK_IMPLY, match_opcode, 0 }

Now we have to pick values of MATCH_IMPLY and MASK_IMPLY and insert them into riscv-gnu-toolchain/binutils/include/opcode/riscv-opc.h. These are meant to fulfill the equation (ins & MASK_IMPLY) == MATCH_IMPLY.

An R-type instruction has three fields of disambiguation: OPCODE, FUNCT7, and FUNCT3. To match all of these, we will pick the mask 0xfe00707f, which we've used before. For the match, we should first find a free space.

Jumping temporarily to Gem5, we can open up gem5/src/arch/riscv/isa/decoder.isa and scroll down to where the opcode is 0x09 and the FUNCT3 is 0x06. Since I used this area previously to implement gcd, and it's not being used for anything else, I'll go ahead and define a FUNCT7 of 0x1.

Going back to our R-type instruction, we can now define the bits we need and turn that into a MATCH:

Imply instruction breakdown
0000001 00000 00000 110 00000 01001 11
= 0x2006027

So now I can add these files in my header file:

#define MATCH_IMPLY 0x2006027
#define MASK_IMPLY 0xfe00707f

Further down I also need to add:

DECLARE_INSN(imply, MATCH_IMPLY, MASK_IMPLY)

Now I can build the RISC-V toolchain. This will take some time, so I can shift attention to Gem5.

./configure --prefix=/opt/riscv --host=riscv64-unknown-elf
sudo make clean
sudo make -j16

Implementing IMPLY

In gem5/src/arch/riscv/isa/decoder.isa, we have implementations for each instruction. As shown in the last blog post, we can basically write any C-lang behavior when a command is received. There are some 'magic variables' that are available for us to use. For an R-type, we get parameters Rs1 and Rs2 and a destination of Rd.

As I look through this file I can see a lot of other kinds of function calls including reading from status registers and more complex library calls. I have yet to fully understand what limits are available in each command, but an imply operation is relatively simple to build.

Using the truth table from before, we can do some quick boolean algebra to pull out an equation.

(p → q) = (!p & !q) || (!p & q) || (p & q)
(p → q) = !p || q

So traversing down the tree, to right below my gcd instruction, I can add my imply definition:

0x1: imply({{
Rd = !Rs1 | Rs2;
}});

Then I can just build Gem5 and, if this was done correctly, I will end up with a dev environment ready to run programs.

scons build/RISCV/gem5.opt -j 16

...
scons: `build/RISCV/gem5.opt' is up to date.
scons: done building targets.

Benchmarking

Now, like before, I can write a simple program with some simple assembly instruction to show that this logic works. I feel like at this point I can’t keep performing the same tricks. So instead, I’m going to write two programs. Both will be similar but not identical.

The first will be a more traditional C-lang program which will imply a bunch of things. The second will use this new instruction to imply the same things. I can use Gem5’s built-in timing mechanism to measure speed.

implyram.c

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>

#define ITERATIONS 10
int main(void)
{
bool param1[] = {false, false, true, true};
bool param2[] = {false, true, false, true};
bool expect[] = {true, true, false, true};
for (uint32_t i = 0; i < ITERATIONS; i++) {
bool p1 = param1[i % 4];
bool p2 = param2[i % 4];
bool expected = expect[i % 4];
bool actual = !p1 || p2;
if (actual == expected) {
printf("Iteration %i pass", i);
} else {
printf("Iteration %i failed", i);
}
}
return 0;
}

implynvm.c

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>

#define ITERATIONS 10
int main(void)
{
bool param1[] = {false, false, true, true};
bool param2[] = {false, true, false, true};
bool expect[] = {true, true, false, true};
for (uint32_t i = 0; i < ITERATIONS; i++) {
bool p1 = param1[i % 4];
bool p2 = param2[i % 4];
bool expected = expect[i % 4];
bool actual;
asm volatile("imply %0, %1, %2\n":"=r"(actual):"r"(p1),"r"(p2):);
if (actual == expected) {
printf("Iteration %i pass", i);
} else {
printf("Iteration %i failed", i);
}
}
return 0;
}

As you can see, they are functionally identical in terms of code. However, when I build both and disassemble them:

/opt/riscv/bin/riscv64-unknown-elf-gcc implyram.c -o implyram.o
/opt/riscv/bin/riscv64-unknown-elf-gcc implynvm.c -o implynvm.o
/opt/riscv/bin/riscv64-unknown-elf-objdump -D implyram.o > implyram
/opt/riscv/bin/riscv64-unknown-elf-objdump -D implynvm.o > implynvm

You will see that the underlying machine instructions are slightly different:

implyram.o

00000000000101a2 <main>:
101a2: 7179 add sp,sp,-48
101a4: f406 sd ra,40(sp)
101a6: f022 sd s0,32(sp)
101a8: 1800 add s0,sp,48
101aa: 7301a783 lw a5,1840(gp) # 1f8a8 <mask+0x10>
101ae: fef42023 sw a5,-32(s0)
101b2: 7381a783 lw a5,1848(gp) # 1f8b0 <mask+0x18>
101b6: fcf42c23 sw a5,-40(s0)
101ba: 7401a783 lw a5,1856(gp) # 1f8b8 <mask+0x20>
101be: fcf42823 sw a5,-48(s0)
101c2: fe042623 sw zero,-20(s0)
101c6: a865 j 1027e <main+0xdc>
101c8: fec42783 lw a5,-20(s0)
101cc: 8b8d and a5,a5,3
101ce: 2781 sext.w a5,a5
101d0: 1782 sll a5,a5,0x20
101d2: 9381 srl a5,a5,0x20
101d4: 17c1 add a5,a5,-16
101d6: 97a2 add a5,a5,s0
101d8: ff07c783 lbu a5,-16(a5)
101dc: fef405a3 sb a5,-21(s0)
101e0: fec42783 lw a5,-20(s0)
101e4: 8b8d and a5,a5,3
101e6: 2781 sext.w a5,a5
101e8: 1782 sll a5,a5,0x20
101ea: 9381 srl a5,a5,0x20
101ec: 17c1 add a5,a5,-16
101ee: 97a2 add a5,a5,s0
101f0: fe87c783 lbu a5,-24(a5)
101f4: fef40523 sb a5,-22(s0)
101f8: fec42783 lw a5,-20(s0)
101fc: 8b8d and a5,a5,3
101fe: 2781 sext.w a5,a5
10200: 1782 sll a5,a5,0x20
10202: 9381 srl a5,a5,0x20
10204: 17c1 add a5,a5,-16
10206: 97a2 add a5,a5,s0
10208: fe07c783 lbu a5,-32(a5)
1020c: fef404a3 sb a5,-23(s0)
10210: feb44783 lbu a5,-21(s0)
10214: 0017c793 xor a5,a5,1
10218: 0ff7f793 zext.b a5,a5
1021c: e791 bnez a5,10228 <main+0x86>
1021e: fea44783 lbu a5,-22(s0)
10222: 0ff7f793 zext.b a5,a5
10226: c399 beqz a5,1022c <main+0x8a>
10228: 4785 li a5,1
1022a: a011 j 1022e <main+0x8c>
1022c: 4781 li a5,0
1022e: fef40423 sb a5,-24(s0)
10232: fe844783 lbu a5,-24(s0)
10236: 8b85 and a5,a5,1
10238: fef40423 sb a5,-24(s0)
1023c: fe844783 lbu a5,-24(s0)
10240: 873e mv a4,a5
10242: fe944783 lbu a5,-23(s0)
10246: 0ff77713 zext.b a4,a4
1024a: 0ff7f793 zext.b a5,a5
1024e: 00f71b63 bne a4,a5,10264 <main+0xc2>
10252: fec42783 lw a5,-20(s0)
10256: 85be mv a1,a5
10258: 67f5 lui a5,0x1d
1025a: 85078513 add a0,a5,-1968 # 1c850 <__clzdi2+0x40>
1025e: 170000ef jal 103ce <printf>
10262: a809 j 10274 <main+0xd2>
10264: fec42783 lw a5,-20(s0)
10268: 85be mv a1,a5
1026a: 67f5 lui a5,0x1d
1026c: 86878513 add a0,a5,-1944 # 1c868 <__clzdi2+0x58>
10270: 15e000ef jal 103ce <printf>
10274: fec42783 lw a5,-20(s0)
10278: 2785 addw a5,a5,1
1027a: fef42623 sw a5,-20(s0)
1027e: fec42783 lw a5,-20(s0)
10282: 0007871b sext.w a4,a5
10286: 47a5 li a5,9
10288: f4e7f0e3 bgeu a5,a4,101c8 <main+0x26>
1028c: 4781 li a5,0
1028e: 853e mv a0,a5
10290: 70a2 ld ra,40(sp)
10292: 7402 ld s0,32(sp)
10294: 6145 add sp,sp,48
10296: 8082 ret

implynvm.o

00000000000101a2 <main>:
101a2: 7179 add sp,sp,-48
101a4: f406 sd ra,40(sp)
101a6: f022 sd s0,32(sp)
101a8: 1800 add s0,sp,48
101aa: 7301a783 lw a5,1840(gp) # 1f888 <mask+0x10>
101ae: fef42023 sw a5,-32(s0)
101b2: 7381a783 lw a5,1848(gp) # 1f890 <mask+0x18>
101b6: fcf42c23 sw a5,-40(s0)
101ba: 7401a783 lw a5,1856(gp) # 1f898 <mask+0x20>
101be: fcf42823 sw a5,-48(s0)
101c2: fe042623 sw zero,-20(s0)
101c6: a871 j 10262 <main+0xc0>
101c8: fec42783 lw a5,-20(s0)
101cc: 8b8d and a5,a5,3
101ce: 2781 sext.w a5,a5
101d0: 1782 sll a5,a5,0x20
101d2: 9381 srl a5,a5,0x20
101d4: 17c1 add a5,a5,-16
101d6: 97a2 add a5,a5,s0
101d8: ff07c783 lbu a5,-16(a5)
101dc: fef405a3 sb a5,-21(s0)
101e0: fec42783 lw a5,-20(s0)
101e4: 8b8d and a5,a5,3
101e6: 2781 sext.w a5,a5
101e8: 1782 sll a5,a5,0x20
101ea: 9381 srl a5,a5,0x20
101ec: 17c1 add a5,a5,-16
101ee: 97a2 add a5,a5,s0
101f0: fe87c783 lbu a5,-24(a5)
101f4: fef40523 sb a5,-22(s0)
101f8: fec42783 lw a5,-20(s0)
101fc: 8b8d and a5,a5,3
101fe: 2781 sext.w a5,a5
10200: 1782 sll a5,a5,0x20
10202: 9381 srl a5,a5,0x20
10204: 17c1 add a5,a5,-16
10206: 97a2 add a5,a5,s0
10208: fe07c783 lbu a5,-32(a5)
1020c: fef404a3 sb a5,-23(s0)
10210: feb44783 lbu a5,-21(s0)
10214: fea44703 lbu a4,-22(s0)
10218: 02e7e7a7 imply a5,a5,a4
1021c: fef40423 sb a5,-24(s0)
10220: fe844783 lbu a5,-24(s0)
10224: 873e mv a4,a5
10226: fe944783 lbu a5,-23(s0)
1022a: 0ff77713 zext.b a4,a4
1022e: 0ff7f793 zext.b a5,a5
10232: 00f71b63 bne a4,a5,10248 <main+0xa6>
10236: fec42783 lw a5,-20(s0)
1023a: 85be mv a1,a5
1023c: 67f5 lui a5,0x1d
1023e: 83078513 add a0,a5,-2000 # 1c830 <__clzdi2+0x3c>
10242: 170000ef jal 103b2 <printf>
10246: a809 j 10258 <main+0xb6>
10248: fec42783 lw a5,-20(s0)
1024c: 85be mv a1,a5
1024e: 67f5 lui a5,0x1d
10250: 84878513 add a0,a5,-1976 # 1c848 <__clzdi2+0x54>
10254: 15e000ef jal 103b2 <printf>
10258: fec42783 lw a5,-20(s0)
1025c: 2785 addw a5,a5,1
1025e: fef42623 sw a5,-20(s0)
10262: fec42783 lw a5,-20(s0)
10266: 0007871b sext.w a4,a5
1026a: 47a5 li a5,9
1026c: f4e7fee3 bgeu a5,a4,101c8 <main+0x26>
10270: 4781 li a5,0
10272: 853e mv a0,a5
10274: 70a2 ld ra,40(sp)
10276: 7402 ld s0,32(sp)
10278: 6145 add sp,sp,48
1027a: 8082 ret

By comparing the machine code of both you can see that the second, with a built-in imply, has fewer steps, 73 compared to 83.

I’ll run both and compare the times. I could also change the value of ITERATIONS to adjust the total time of compute. Alternatively, a future version could accept this value using an input through stdin.

Once I reply the binary to run in my Python configuration, I can simply run build/RISCV/gem5.opt configs/learning_gem5/part1/simple-riscv.py and track the time it takes to complete in ticks. What is a tick? In my Gem5 output, since I have a 1GHz processor, it prints Global frequency set at 1000000000000 ticks per second.

Comparing ticks to perform each program

The “implications” of this new capability

At some point I’m going to turn this research into academic papers, and it’ll be a shame that I can’t use bad puns as the section headers like I can here.

One limitation of this blog post, and in Gem5 generally, is that the physical properties are not covered. While past research has explored the implications of non-volatile memory in physical area and energy reduction, this CPU simulation toolchain doesn’t factor in those properties. I can rely on previous research to justify these claims.

It’s clear that using a singular imply is faster. For 10 iterations, we save about 8 million ticks, roughly 8 milliseconds. This corresponds with a 1% improvement in perforance. So this is objectively better, but I don't expect venture capitalists to start knocking on my door.

Future work will require me identifying other novel instructions that can be implemented. One downside of Gem5 is that, since it doesn’t have a lot of physical representation, I can’t exactly model how a non-volatile memory cell could be faster at doing the compute compared to SRAM or an ALU. This means future work may need to oversell the benefits.

--

--

Nick Felker
Nick Felker

Written by Nick Felker

Social Media Expert -- Rowan University 2017 -- IoT & Assistant @ Google

Responses (1)