-
Notifications
You must be signed in to change notification settings - Fork 81
How to document the code
Disassembling the code is easy, and computers are very good at it.
What is difficult is to label and comment the original code, finding the meaning behind the function calls and memory manipulations.
Of course we can just read the disassembled sources in a text editor, understand what the assembly does, and label it accordingly. However this is a slow and difficult process–but, fortunately, there are a few techniques to help us.
Table of contents
See also: Tooling for reverse-engineering
In the raw disassembly, functions are only referenced by their address. It looks like:
func_003_4563::
; This is the function originally located in bank 003, at address 4563
There are a few useful things to do with a raw function address like this:
- Cross-reference it (if it is called from bank 0)
- Label the function according to its purpose
- Document the inner workings of the function
Functions called from bank 0 are usually called by memory address. Once a new bank is disassembled, by reading the code, we can actually infer that an address stands for a function in a specific bank.
For instance, considering the following code:
; Switch to bank $03
ld a, $03
ld [MBC3SelectBank], a
; Call function at address $4563
call $4563
This obviously calls the address $4563
in bank $03
. So we can cross-reference the function, and rewrite it as:
ld a, BANK(func_003_4563)
ld [MBC3SelectBank], a
call func_003_4563
This is good: now, even if the address of func_003_4563
changes (or even is moved to another bank), the code will still be compiled correctly.
We can do even better: as calling a function in another bank (a "farcall") is done quite often, macros were written to shorten this sequence. So it can be rewritten as:
callsb func_003_4563
callsb
is a macro that will expand to exactly the code above (see macros.asm
for more details).
Gaining a generic understanding of what a function does as a whole (even without understanding the details) is quite useful: we can give it a name, and find it called in other places.
Some methods for understanding the general purpose of a function:
- Change the first instruction of the function to be a
ret
(return), to see what happens when the function is not executed; - Use a tool to display the function as C-like code, and read it (like
awake
; see below); - In last resort, read the assembly code.
A binary ROM has no fixed structure. At first glance, it just look like a bag of bytes–without clear separation between code, text, graphics, data… There are no metadata to tell the disassembler which parts are executable code, and which parts are data.
So we have to figure out the data parts ourselves.
By default, the disassembler considers everything as code. But sometimes we see a pattern like this:
nop ; 5EA6: $00
nop ; 5EA7: $00
rst $38 ; 5EA8: $FF
add a, b ; 5EA9: $80
add a, b ; 5EAA: $80
func_001_5EAB::
ld hl, $5EA6 ; 5EAB:
add hl, de ; 5EAE:
ldh a, [hl] ; 5EB0:
Looking at this automatically-disassembled code, there are several hints that some of this is actually data:
; This sequence of code doesn't really make sense:
; usually code doesn't contains `nop` instructions,
; and adding the same value twice is dubious.
nop ; 5EA6: $00
nop ; 5EA7: $00
rst $38 ; 5EA8: $FF
add a, b ; 5EA9: $80
add a, b ; 5EAA: $80
func_001_5EAB::
; Loading a 2-bytes number into `hl` is a common pattern for accessing a memory address as data
ld hl, $5EA6 ; 5EAB:
; Also, adding a variable to `hl` is a common pattern for accessing an array-like data structure
add hl, de ; 5EAE:
ldh a, [hl] ; 5EB0:
All these hints point to one $5EA6
being not code, but an array of data.
So we can rewrite it as:
data_001_5EA6::
db $00, $00, $FF, $80, $80
func_001_5EAB::
ld hl, data_001_5EA6 ; 5EAB:
add hl, de ; 5EAE:
ldh a, [hl] ; 5EB0:
In this version:
- We labelled the data address with a reference (
data_001_5EA6
) rather than a raw numeric address; - We converted the code into data instructions.
We still don't know exactly how this data is used, but at least we know it is data.
A Game Boy game has no local variables. All of the game state is stored in globally accessible RAM areas.
Documenting the meaning behind a RAM address is very useful. It is difficult to understand the purpose of a function that operates on a raw RAM address:
func_020_52C3::
ld a, $C408
inc a
ld $C408, a
co $1F
jp z, label_020_52CF
ld a, $01
ld $C409, a
label_020_52CF:
ret
However, if the can understand how the RAM address is used, it will be much easier to understand what the function is doing.
To understand the purpose of a variable, we can either:
- search for other places in the code where this variable is used, and how;
- or use BGB’s debugger to watch the variable changes while the game is running;
- or even use BGB to change the value of the variable while the game is running, and see what is does.
If we’re lucky, we can eventually understand that $C14B is modified when the Pegasus Boots are charging.
We can then document this finding in wram.asm
:
;wram.asm
wPegasusBootsChargeMeter:: ; C14B
ds 1
We can now perform a Search-and-replace across the whole code base, to give meaning to the memory location:
func_020_52C3::
ld a, wPegasusBootsChargeMeter
inc a
ld wPegasusBootsChargeMeter, a
cp $1F
jp z, label_020_52CF
ld a, $01
ld $C409, a
label_020_52CF:
ret
The purpose of the function now becomes quite clear: it increments the Pegasus boots charge meter.
This even gives us the probable use of $C409
: it very probably stores whether the Pegasus Boots are fully charged.
We can now label the extra variable, rename the function according to its purpose, and document its code.
ChargePegasusBoots::
; Increment wPegasusBootsChargeMeter
ld a, wPegasusBootsChargeMeter
inc a
ld wPegasusBootsChargeMeter, a
; If wPegasusBootsChargeMeter >= $1F…
cp $1F
jp z, .return
; Set the boots as charged
ld a, $01
ld wPegasusBootsReady, a
.return
ret
#TODO