Skip to content

Latest commit

 

History

History
413 lines (331 loc) · 13 KB

ch4.adoc

File metadata and controls

413 lines (331 loc) · 13 KB

Chapter 4: Loops

"Insanity is doing the same thing over and over again and expecting different results."
— Unknown
Often misattributed to Albert Einstein

Before we get into the RISC-V, I want to cover something that may be obvious to some but may have never occurred to others. Any loop structure can be converted to any other (possibly with the addition of an if statement). So a for can be written as a while and vice versa. Even a do-while can be written as a for or while loop. Let’s look at some equivalencies.

	for (int i=0; i<a; i++) {
		do_something;
	}

	int i = 0;
	while (i < a) {
		do_something;
		i++;
	}

	int i = 0;
	if (i < a) {
		do {
			do_something;
			i++;
		} while (i < a);
	}
	// you could also have an if (i >= a) goto loop_done; to jump over do-while

In general, when writing assembly, it can help to think more in terms of while or do-while rather than for because the former more closely resemble what the assembly looks like in terms of what goes where. Like in the last chapter, where we would think of the if-else statements in "jump-form" or "branch-form", we can do the same here, converting for to while in our head as an intermediary step before going to assembly.

Speaking of "jump-form", lets apply it to the loop above:

	int i=0;
	if (i >= a)
		goto done_loop;
loop:
	do_something;
	i++
	if (i < a)
		goto loop;

done_loop:

You can see how that starts to look more like assembly. Another thing to note is that unlike with if statements where we test for the opposite to jump over the block of code, when you’re doing the loop test at the bottom like with a do-while, it is unchanged from C because you are jumping to continue the loop. If you put the test at the top it becomes inverted, and you put an unconditional jump at the bottom:

	int i=0;
loop:
	if (i >= a)
		goto done_loop;
	do_something;
	i++
	goto loop:

done_loop:

In general it’s better to test at the bottom, both because the condition matches the higher level form, and because when you know the loop is going to execute at least once it requires only one jump + label, rather than 2 since you can forgo the the initial if check:

	for (int i=0; i<10; i++)
		do_something;

	// becomes

	int i=0;
loop:
	do_something;
	i++
	if (i < 10)
		goto loop;

Ok, now that we’ve got the theory and structure out of the way, let’s try doing a simple one in RISC-V.

	int sum = 0;
	for (int i=0; i<100; i++) {
		sum += i;
	}

That’s about as basic as it gets, adding up the numbers 0 to 99.

	li     t0, 0   # sum = 0
	li     t1, 1   # i = 1  we can start at 1 because obviously adding 0 is pointless
	li     t2, 100
loop:
	addi   t0, t0, t1     # sum += i
	addi   t1, t1, 1      # i++
	blt    t1, t2, loop   # while (i < 100)

Ok I don’t think there’s much point in doing any more without getting to what loops are most often used for, looping through data structures, most commonly arrays.

Looping Through Arrays

Looping and arrays go together like peanut butter and jam. An array is a sequence of variables of the same type, almost always related in some way. Naturally, you want to operate on them all together in various ways; sorting, searching, accumulating, etc. Given that the only way to do that is with loops, in this section we’ll cover different ways of looping through arrays, including multidimensional arrays.

1D Arrays

Let’s pretend there’s an array int numbers[10]; filled with 10 random numbers.

	int total = 0;
	for (int i=0; i<10; i++) {
		total += numbers[i];
	}

There are several ways to do this. The first is the most literal translation.

	li    t0, 0     # total = 0
	li    t1, 0     # i = 0
	la    t2, numbers   # t2 = numbers
	li    t3, 10
sum_loop:
	slli  t4, t1, 2   # t4 = i*sizeof(int) == i*4
	add   t4, t4, t2  # t4 = &numbers[i]
	lw    t4, 0(t4)   # t4 = numbers[i]
	add   t0, t0, t4  # total += numbers[i]

	addi  t1, t1, 1   # i++
	blt   t1, t3, sum_loop   # while (i < 10)

We initialize the relevant variables beforehand (numbers and 10 could be loaded every iteration but that’s less efficient). Now what’s with the i*4? We already discussed using shifts to multiply and divide by powers of 2 in a previous chapter, but here we’re doing something that higher level languages do automatically for you every time you do an array access. When you access the i'th element, under the hood it is multiplying i by the size of the type of the array and adding that number of bytes to the base address and then loading the element located there.

If you’re unfamiliar with the C syntax in the comments, & means "address of", so t4 is being set to the address of the i'th element. Actually that C syntax is redundant because the the & counteracts the brackets. In C adding a number to a pointer does pointer math (ie it multiplies by the size of the items as discussed above). This means that the following comparison is true:

&numbers[i] == numbers + i

which means that this is true too

&numbers[0] == numbers

The reason I use the form on the left in C/C++ even when I can use the right is it makes it more explicit and obvious that I’m getting the address of an element of an array. If you were scanning the code quickly and saw the expression on the right, you might not realize that’s an address at all, it could be some mathematical expression (though the array name would hopefully clue you in if it was picked well).

Anyway, back to the RISC-V code. After we get the address of the element we want, we have to actually read it from memory (ie load it). Since it’s an array of words (aka 4 byte ints) we can use load word, lw.

Finally, we add that value to total, increment i, and perform the loop check.

Now, I said at the beginning that this was the most literal, direct translation (not counting the restructuring to a do-while form). However, it is not my preferred form because it’s not the simplest, nor the shortest.

Rather than calculate the element address every iteration, why not keep a pointer to the current element and iterate through the array with it? In C what I’m suggesting is this:

	int* p = &numbers[0];
	int i = 0, total = 0;
	do {
		total += *p;
		i++;
		p++;
	} while (i < 10);

In other words, we set p to point at the first element and then increment it every step to keep it pointing at numbers[i]. Again, all mathematical operations on pointers in C deal in increments of the byte syze of the type, so p++ is really adding 1*sizeof(int).

	li     t0, 0     # total = 0
	li     t1, 0     # i = 0
	la     t2, numbers   # p = numbers
	li     t3, 10
sum_loop:
	lw     t4, 0(t2)    # t4 = *p
	add    t0, t0, t4   # total += *p

	addi   t1, t1, 1    # i++
	addi   t2, t2, 4    # p++  ie p += sizeof(int)
	blt    t1, t3, sum_loop   # while (i < 10)

Now, that may not look much better, we only saved 1 instuction, and if we were looping through a string (aka an array of characters, sizeof(char) == 1) we wouldn’t have saved any. However, imagine if we weren’t using slli to do the multiply but mul. That would take 2 instructions, even if one could be above the loop. And remember we would have to use mul instead of slli if we were iterating through an array of structures with a size that wasn’t a power of 2, so using this method saves even more in that rare case.

However, there is one more variant that you can use that can save a few more instructions. Instead of using i and i<10 to control the loop, use p and the address just past the end of the array. In C it would be this:

	int* p = &numbers[0];
	int* end = &numbers[10];
	int total = 0;
	do {
		total += *p;
		p++;
	} while (p < end);

You could also use != instead of <. This is similar to using the .end() method on many C++ data structures when using iterators. Now the RISC-V version:

	li     t0, 0        # total = 0
	la     t2, numbers  # p = numbers
	addi   t3, t2, 40   # end = &numbers[10] = numbers + 10*sizeof(int)
sum_loop:
	lw     t4, 0(t2)    # t4 = *p
	add    t0, t0, t4   # total += *p

	addi   t2, t2, 4    # p++  ie p += sizeof(int)
	blt    t2, t3, sum_loop   # while (p < end)

So we dropped from 10 to 7 instructions, 6 to 4 in the loop itself which is the most important for performance. And this was for a 1D array. Imagine if you had 2 or 3 indices you had to use to calculate the correct offset. That’s what we go over in the next section.

2D Arrays

The first thing to understand is what’s really happening when you declare a 2D array in C. The contents of a 2D array are tightly packed, in row-major order, meaning that all the elements from the first row are followed by all the elements of the second row and so on. What this means is that a 2D array is equivalent to a 1D array with rows*cols elements in the same order:

	#define ROWS 2
	#define COLS 4
	// The memory of these two arrays are identical
	int array[ROWS][COLS] = { { 1, 2, 3, 4 }, { 5, 6, 7, 8 } };
	int array1d[ROWS*COLS] = { 1, 2, 3, 4, 5, 6, 7, 8 };

See the code example 2d_arrays.c for more details.

What this means is that when we declare a 2D array, it’s basically a 1D array with the size equal to rows * columns. Also, when we loop through a 2D array, we can often treat it like a 1D array with a single loop. So everything that we learned before applies.

Let’s do an example.

	for (int i=0; i<rows; i++) {
		for (int j=0; j<cols; ++j) {
			array[i][j] = i + j;
		}
	}

	// becomes

	int r, c;
	for (int i=0; i<rows*cols; i++) {
		r = i / cols;
		c = i % cols;
		array[i] = r + c;
	}

So assuming rows and cols are in a0 and a1 (and nonzero), it would look like this:

	la     t0, array    # p = &array[0]
	li     t1, 0        # i = 0
	mul    t2, a0, a1   # t2 = rows * cols
loop:
	div    t3, t1, a1   # r = i / cols
	rem    t4, t1, a1   # c = i % cols
	add    t3, t3, t4    # t3 = r + c

	sw     t3, 0(t0)      # array[i] = *p = r + c

	addi   t1, t1, 1      # i++
	addi   t0, t0, 4      # p++ (keep pointer in sync with i, aka p = &array[i])
	blt    t1, t2, loop   # while (i < rows*cols)

You might ask if it’s it worth it to convert it to a single loop when you still need the original i and j as if you were doing nested loops. Generally, it is much nicer to avoid nested loops in assembly if you can. There are many cases when you get the best of both worlds though. If you’re doing a clear for example, setting the entire array to a single value, there’s no need to calculate the row and column like we did here. I only picked this example to show how you could get them back if you needed them.

For comparison here’s the nested translation (while still taking advantage of the 1D arrangement of memory and pointer iterators):

	la     t0, array    # p = &array[0]
	li     t1, 0        # i = 0
looprows:
	li     t2, 0        # j = 0
loopcols:
	add    t3, t1, t2     # t3 = i + j
	sw     t3, 0(t0)      # array[i][j] = *p = i + j

	addi   t2, t2, 1         # j++
	addi   t0, t0, 4         # p++ (keep pointer in sync with i and j, aka p = &array[i][j])
	blt    t2, a1, loopcols  # while (j < cols)

	addi   t1, t1, 1          # i++
	blt    t1, a0, looprows   # while (i < rows)

It’s the same number of instructions, but with an extra label and branch. I think I prefer this version despite the extra branch. On the other hand, either of the last 2 versions are better than the literal translation below:

	la     t0, array    # p = &array[0]
	li     t1, 0        # i = 0
looprows:
	li     t2, 0        # j = 0
loopcols:
	add    t3, t1, t2    # t3 = i + j

	# need to calculate the byte offset of element array[i][j]
	mul    t4, t1, a1      # t4 = i * cols
	add    t4, t4, t2      # t4 = i * cols + j
	slli   t4  t4, 2       # t4 = (i * cols + j) * sizeof(int)

	add    t4, t4, t0      # t4 = &array[i][j] (calculated as array + (i*cols + j)*4)

	sw     t3, 0(t4)       # array[i][j] = i + j

	addi   t2, t2, 1         # j++
	blt    t2, a1, loopcols  # while (j < cols)

	addi   t1, t1, 1          # i++
	blt    t1, a0, looprows   # while (i < rows)

That chunk in the middle calculating the offset of every element? Not only is it far slower than iterating the pointer through the array, but you can imagine how much worse it would be for a 3D array with 3 nested loops.

Conclusion

Hopefully after those examples you have a more solid understanding of looping in RISC-V and how to transform various loops and array accesses into the form that makes your life the easiest. There is more we could cover here, like looping through a linked list, but I think that’s beyond the scope of what we’ve covered so far. Perhaps in a later chapter.