-
Notifications
You must be signed in to change notification settings - Fork 9
Notes
Just random notes.
Do not add call sequence pseudos if we don't actually need the stack in the call sequence; PEI thinks that anything with a call sequence adjusts the stack.
- Technically, calls do adjust the stack. Does this count?
- Use processFunctionBeforeFrameFinalized in TargetFrameInfo to do push-pop optimization. We can eliminate dead frame objects by marking them as such and they will be excluded from the stack.
- We also would like to order the stack objects to group similar stack objs together (for doing push-pop on 8-bit values) but we can't do that until AFTER processFunctionBeforeFrameFinalized... which is sort of annoying.
- Might be possible to do anyway somehow...
- Implement saveScavengerRegister?
- LAO is -1 since the SP points to top of stack at function entry.
- Prologue can either use
ADD SP+e
if we are GB (what about optz? can they expand the stack somehow?).- Otherwise we need to do some kind of
LD HL, dd; ADD HL, SP; LD SP, HL
. This needs to save HL if it's live (an input parameter) - The solution in this case could be to add HL as a CSR. CalleeSavedInfo has a 'Restored' parameter that controls restoration; this is good, as this HL should not actually be restored in the epilogue.
- Wait, what about adjusting the Min/MaxCSRObject variables? Probably still won't work.
- Basically what we want is simply to add a frame object before/after the CSR frame objects are added, so we can do a
PUSH HL
before/after CSR and then restore it with a frame index load after the prologue. - This is all a good argument for excluding HL from the calling convention.
- Otherwise we need to do some kind of
Prologue is basically:
if HL is live && (!GB || largeStack):
Push HL (this should have the last frame object index if we do it right, so it must be added AFTER CSR is processed)
Push CSR
if GB:
Repeat:
ADD SP-stackSize
else:
LD HL, -stackSize
ADD HL, SP
LD SP, HL
HL = LD16_FI (frameidx of the HL we pushed) (how to make this frameindex access efficient? It's at the very far end of the stack from the HL we just made!)
Epilogue is much harder if HL is live-out, since we can't do the PUSH HL trick on function exit.
Look for patterns that are more efficient:
- Constant multiplication => add/shift combinations.
- Constant adds => combinations of inc/dec.
- Add of i16 with constant has a break even at 4 inc/dec.
- Normal: 1+2+1+1+2+1=8, inc/dec*4: 2+2+2+2=8
- Size is not broken, normal=8, inc/dec*4=4
- Add of i16 with constant has a break even at 4 inc/dec.
- Shift patterns which can be optimized differently to get fewer actual shifts. subreg extracts etc.
- Fix shifts in general. It's messy right now with the large expansion thing. Better to keep the original SHL and SHR nodes, maybe?
- Undo that idiotic DAGCombine for add -> or.
Move ADD16r expansion to PostRA or PostPEI so we can make large ADD if possible. hint the def with HL.
Add more allocation hints in general, on shifts, def of arith ops, postmod ptrs, etc.
Figure out why MachineCopyPropagation isn't working terribly well. We will have to add our own post-ra register coalescer most likely. Good to run it after both postra and postpei.
Add some kind of pass that tries to break up actual R16 regs into R8 components. This is probably beneficial in many cases since the register allocator won't be forced to keep a large number of R16 regs live at once.
- It might make push-pop optimization harder, though.