"Insanity is doing the same thing over and over again and expecting different results."
Often misattributed to Albert Einstein
Before we get into the RISC-V, I want to cover something that may be obvious to some but
may have never occurred to others. Any loop structure can be converted to any other
(possibly with the addition of an if
statement). So a for
can be written as a while
and vice versa. Even a do-while
can be written as a for
or while
loop. Let’s look
at some equivalencies.
for (int i=0; i<a; i++) {
do_something;
}
int i = 0;
while (i < a) {
do_something;
i++;
}
int i = 0;
if (i < a) {
do {
do_something;
i++;
} while (i < a);
}
// you could also have an if (i >= a) goto loop_done; to jump over do-while
In general, when writing assembly, it can help to think more in terms of while
or
do-while
rather than for
because the former more closely resemble what the
assembly looks like in terms of what goes where. Like in the last chapter,
where we would think of the if-else
statements in "jump-form" or "branch-form",
we can do the same here, converting for
to while
in our head as an intermediary
step before going to assembly.
Speaking of "jump-form", lets apply it to the loop above:
int i=0;
if (i >= a)
goto done_loop;
loop:
do_something;
i++
if (i < a)
goto loop;
done_loop:
You can see how that starts to look more like assembly. Another thing to note is that
unlike with if
statements where we test for the opposite to jump over the block of code,
when you’re doing the loop test at the bottom like with a do-while
, it is unchanged
from C because you are jumping to continue the loop. If you put the test at the top it
becomes inverted, and you put an unconditional jump at the bottom:
int i=0;
loop:
if (i >= a)
goto done_loop;
do_something;
i++
goto loop:
done_loop:
In general it’s better to test at the bottom, both because the condition matches
the higher level form, and because when you know the loop is going to execute at least once it requires
only one jump + label, rather than 2 since you can forgo the the initial if
check:
for (int i=0; i<10; i++)
do_something;
// becomes
int i=0;
loop:
do_something;
i++
if (i < 10)
goto loop;
Ok, now that we’ve got the theory and structure out of the way, let’s try doing a simple one in RISC-V.
int sum = 0;
for (int i=0; i<100; i++) {
sum += i;
}
That’s about as basic as it gets, adding up the numbers 0 to 99.
li t0, 0 # sum = 0
li t1, 1 # i = 1 we can start at 1 because obviously adding 0 is pointless
li t2, 100
loop:
addi t0, t0, t1 # sum += i
addi t1, t1, 1 # i++
blt t1, t2, loop # while (i < 100)
Ok I don’t think there’s much point in doing any more without getting to what loops are most often used for, looping through data structures, most commonly arrays.
Looping and arrays go together like peanut butter and jam. An array is a sequence of variables of the same type, almost always related in some way. Naturally, you want to operate on them all together in various ways; sorting, searching, accumulating, etc. Given that the only way to do that is with loops, in this section we’ll cover different ways of looping through arrays, including multidimensional arrays.
Let’s pretend there’s an array int numbers[10];
filled with 10 random numbers.
int total = 0;
for (int i=0; i<10; i++) {
total += numbers[i];
}
There are several ways to do this. The first is the most literal translation.
li t0, 0 # total = 0
li t1, 0 # i = 0
la t2, numbers # t2 = numbers
li t3, 10
sum_loop:
slli t4, t1, 2 # t4 = i*sizeof(int) == i*4
add t4, t4, t2 # t4 = &numbers[i]
lw t4, 0(t4) # t4 = numbers[i]
add t0, t0, t4 # total += numbers[i]
addi t1, t1, 1 # i++
blt t1, t3, sum_loop # while (i < 10)
We initialize the relevant variables beforehand (numbers
and 10
could be loaded
every iteration but that’s less efficient). Now what’s with the i*4
? We already
discussed using shifts to multiply and divide by powers of 2 in a previous chapter,
but here we’re doing something that higher level languages do automatically for you
every time you do an array access. When you access the i'th element, under the hood
it is multiplying i
by the size of the type of the array and adding that number of
bytes to the base address and then loading the element located there.
If you’re unfamiliar with the C syntax in the comments, &
means "address of", so
t4
is being set to the address of the i'th element. Actually that C syntax is
redundant because the the &
counteracts the brackets. In C adding a number to a
pointer does pointer math (ie it multiplies by the size of the items as discussed
above). This means that the following comparison is true:
&numbers[i] == numbers + i
which means that this is true too
&numbers[0] == numbers
The reason I use the form on the left in C/C++ even when I can use the right is it makes it more explicit and obvious that I’m getting the address of an element of an array. If you were scanning the code quickly and saw the expression on the right, you might not realize that’s an address at all, it could be some mathematical expression (though the array name would hopefully clue you in if it was picked well).
Anyway, back to the RISC-V code. After we get the address of the element we want, we
have to actually read it from memory (ie load it). Since it’s an array of words
(aka 4 byte ints) we can use load word, lw
.
Finally, we add that value to total
, increment i
, and perform the loop check.
Now, I said at the beginning that this was the most literal, direct translation
(not counting the restructuring to a do-while
form). However, it is not my preferred
form because it’s not the simplest, nor the shortest.
Rather than calculate the element address every iteration, why not keep a pointer to the current element and iterate through the array with it? In C what I’m suggesting is this:
int* p = &numbers[0];
int i = 0, total = 0;
do {
total += *p;
i++;
p++;
} while (i < 10);
In other words, we set p
to point at the first element and then increment it every
step to keep it pointing at numbers[i]
. Again, all mathematical operations on pointers
in C deal in increments of the byte syze of the type, so p++
is really adding 1*sizeof(int)
.
li t0, 0 # total = 0
li t1, 0 # i = 0
la t2, numbers # p = numbers
li t3, 10
sum_loop:
lw t4, 0(t2) # t4 = *p
add t0, t0, t4 # total += *p
addi t1, t1, 1 # i++
addi t2, t2, 4 # p++ ie p += sizeof(int)
blt t1, t3, sum_loop # while (i < 10)
Now, that may not look much better, we only saved 1 instuction, and if we were
looping through a string (aka an array of characters, sizeof(char) == 1
) we wouldn’t
have saved any. However, imagine if we weren’t using slli
to do the multiply but
mul
. That would take 2 instructions, even if one could be above the loop.
And remember we would have to use mul
instead of slli
if we were iterating
through an array of structures with a size that wasn’t a power of 2, so using this
method saves even more in that rare case.
However, there is one more variant that you can use that can save a few more instructions.
Instead of using i
and i<10
to control the loop, use p
and the address just past the
end of the array. In C it would be this:
int* p = &numbers[0];
int* end = &numbers[10];
int total = 0;
do {
total += *p;
p++;
} while (p < end);
You could also use !=
instead of <
. This is similar to using the .end()
method
on many C++ data structures when using iterators. Now the RISC-V version:
li t0, 0 # total = 0
la t2, numbers # p = numbers
addi t3, t2, 40 # end = &numbers[10] = numbers + 10*sizeof(int)
sum_loop:
lw t4, 0(t2) # t4 = *p
add t0, t0, t4 # total += *p
addi t2, t2, 4 # p++ ie p += sizeof(int)
blt t2, t3, sum_loop # while (p < end)
So we dropped from 10 to 7 instructions, 6 to 4 in the loop itself which is the most important for performance. And this was for a 1D array. Imagine if you had 2 or 3 indices you had to use to calculate the correct offset. That’s what we go over in the next section.
The first thing to understand is what’s really happening when you declare a 2D array in C. The contents of a 2D array are tightly packed, in row-major order, meaning that all the elements from the first row are followed by all the elements of the second row and so on. What this means is that a 2D array is equivalent to a 1D array with rows*cols elements in the same order:
#define ROWS 2
#define COLS 4
// The memory of these two arrays are identical
int array[ROWS][COLS] = { { 1, 2, 3, 4 }, { 5, 6, 7, 8 } };
int array1d[ROWS*COLS] = { 1, 2, 3, 4, 5, 6, 7, 8 };
See the code example 2d_arrays.c for more details.
What this means is that when we declare a 2D array, it’s basically a 1D array with the size equal to rows * columns. Also, when we loop through a 2D array, we can often treat it like a 1D array with a single loop. So everything that we learned before applies.
Let’s do an example.
for (int i=0; i<rows; i++) {
for (int j=0; j<cols; ++j) {
array[i][j] = i + j;
}
}
// becomes
int r, c;
for (int i=0; i<rows*cols; i++) {
r = i / cols;
c = i % cols;
array[i] = r + c;
}
So assuming rows
and cols
are in a0
and a1
(and nonzero), it would
look like this:
la t0, array # p = &array[0]
li t1, 0 # i = 0
mul t2, a0, a1 # t2 = rows * cols
loop:
div t3, t1, a1 # r = i / cols
rem t4, t1, a1 # c = i % cols
add t3, t3, t4 # t3 = r + c
sw t3, 0(t0) # array[i] = *p = r + c
addi t1, t1, 1 # i++
addi t0, t0, 4 # p++ (keep pointer in sync with i, aka p = &array[i])
blt t1, t2, loop # while (i < rows*cols)
You might ask if it’s it worth it to convert it to a single loop when you still
need the original i
and j
as if you were doing nested loops. Generally, it is
much nicer to avoid nested loops in assembly if you can. There are many cases
when you get the best of both worlds though. If you’re doing a clear for example,
setting the entire array to a single value, there’s no need to calculate the row
and column like we did here. I only picked this example to show how you could
get them back if you needed them.
For comparison here’s the nested translation (while still taking advantage of the 1D arrangement of memory and pointer iterators):
la t0, array # p = &array[0]
li t1, 0 # i = 0
looprows:
li t2, 0 # j = 0
loopcols:
add t3, t1, t2 # t3 = i + j
sw t3, 0(t0) # array[i][j] = *p = i + j
addi t2, t2, 1 # j++
addi t0, t0, 4 # p++ (keep pointer in sync with i and j, aka p = &array[i][j])
blt t2, a1, loopcols # while (j < cols)
addi t1, t1, 1 # i++
blt t1, a0, looprows # while (i < rows)
It’s the same number of instructions, but with an extra label and branch. I think I prefer this version despite the extra branch. On the other hand, either of the last 2 versions are better than the literal translation below:
la t0, array # p = &array[0]
li t1, 0 # i = 0
looprows:
li t2, 0 # j = 0
loopcols:
add t3, t1, t2 # t3 = i + j
# need to calculate the byte offset of element array[i][j]
mul t4, t1, a1 # t4 = i * cols
add t4, t4, t2 # t4 = i * cols + j
slli t4 t4, 2 # t4 = (i * cols + j) * sizeof(int)
add t4, t4, t0 # t4 = &array[i][j] (calculated as array + (i*cols + j)*4)
sw t3, 0(t4) # array[i][j] = i + j
addi t2, t2, 1 # j++
blt t2, a1, loopcols # while (j < cols)
addi t1, t1, 1 # i++
blt t1, a0, looprows # while (i < rows)
That chunk in the middle calculating the offset of every element? Not only is it far slower than iterating the pointer through the array, but you can imagine how much worse it would be for a 3D array with 3 nested loops.
Hopefully after those examples you have a more solid understanding of looping in RISC-V and how to transform various loops and array accesses into the form that makes your life the easiest. There is more we could cover here, like looping through a linked list, but I think that’s beyond the scope of what we’ve covered so far. Perhaps in a later chapter.