I didn't say it when I posted the patch, but I thought that changing lines like "b = a / 100" into "b = divu_100(a)" was pretty crappy. What I said was that it felt "error prone" -- borderline unreadable would be nearer the mark. So I'm glad to report that I have come up with a better way. No need to make the C code unreadable, better performance, and smaller executable.
What I've done is put a somewhat general purpose optimiser inside the build process. I tweaked Makefile.ms2 so that C code would go through a pipeline:
- gcc takes .c file, compiles into its assembly language equivalent and stashes this as a .S file
- peepopt (a Perl script) takes the .S file, applies a number of optimisations to the code and stores the result as a .s file
- the .s is assembled and linked as usual by the gnu assembler and linker
Peepopt isn't all that fancy. In a way it's just an automated text editor which does search and replace of dumb code patterns with smarter ones.
It is, itself, pretty dumb -- but I hope it'll be worthwhile. It currently performs three optimisations. Each makes the code slightly smaller and faster.
Optimisation 1 -- merge consecutive MOVB instructions into MOVW instructions. Given that a MOVW moves two bytes in just the same time and with just the same amount of code as a MOVB, this can make for a tidy saving. Here's an example from ms2_extra_misc.S:
Code: Select all
.LSM306: .LSM306:
movw outpc+6,txbuf+6 movw outpc+6,txbuf+6
.LSM307: .LSM307:
movw outpc+8,txbuf+8 movw outpc+8,txbuf+8
.LSM308: .LSM308:
movb outpc+10,txbuf+10 .LSM309:
.LSM309: MOVW outpc+10,txbuf+10
movb outpc+11,txbuf+11 .LSM310:
.LSM310: .LSM311:
movb outpc+12,txbuf+12 MOVW outpc+12,txbuf+12
.LSM311: .LSM312:
movb outpc+13,txbuf+13 .LSM313:
.LSM312: MOVW outpc+14,txbuf+14
movb outpc+14,txbuf+14 .LSM314:
.LSM313: movw outpc+16,txbuf+16
movb outpc+15,txbuf+15 .LSM315:
.LSM314: movw outpc+18,txbuf+18
movw outpc+16,txbuf+16
.LSM315:
movw outpc+18,txbuf+18
Optimisation 2 is quite similar. Merging CLR instructions so that memory is cleared 16-bits at a time. Unfortunately, there isn't a way of doing this quite as easily as the MOV* instructions, so each candidate CLR block is evaluated whether the overhead will be worthwhile (as long as there are more than 3 byte-pairs to be cleared, it's worththile). If it is, it is bracketed in PSHD,CLRA, CLRB, then STD in whatever addresses need 16-bit clears, all rounded off with a PULD. Here's one of the blocks as updated:
Code: Select all
.L54: .L54:
.LSM49: ; ========================================
clr dt2.0+2 ; Merge CLR instructions
clr dt2.0+3 PSHD
clr dt2.0 CLRA
clr dt2.0+1 CLRB
.LSM50: STD dt2.0
clr dt3.1+2 .LSM49:
clr dt3.1+3 STD dt2.0+2
clr dt3.1 STD dt3.1
clr dt3.1+1 .LSM50:
STD dt3.1+2
PULD
; End merged CLR instructions
Optimisation 3 is quite a bit more involved and implements the divide stuff that I was talking about earlier. It looks for 32-bit divisions where the denominator is a 16-bit constant. If so, it reconstructs the call to invoke the assembly language division function I wrote before which is always a bit faster than the default one, and sometimes very much faster. Here's an example:
Code: Select all
bsr __mulsi3 bsr __mulsi3
leas 4,sp leas 4,sp
std 26,sp std 26,sp
stx 24,sp stx 24,sp
movw #100,2,-sp LDD 26,sp
clr 1,-sp LDX 24,sp
clr 1,-sp ldy #100
ldd 30,sp jsr divu32_16
ldx 28,sp std 26,sp
bsr __udivsi3 stx 24,sp
leas 4,sp .L51:
std 26,sp
stx 24,sp
.L51:
Still, I think this is greatly preferable to what I was suggesting a couple of weeks ago. The C code is much more readable, and the Perl script hides the ugliness (until you look inside it!). I think this approach might be applied to other areas where we want particular optimisations. For instance, there are blocks of inline assembly language that might have been able to be done with this same sort of hack, keeping the code more readable and, more importantly, giving us one place to fix errors in the inserted code.
Now, to the performance change. Not huge, but significant enough I think. The image was something like 1300 bytes smaller, and the looptime on the stim dropped from about 400us to 370us with my configuration -- a bit less than 10% faster. And my car runs OK with it too, so it doesn't seem to have ballsed anything up (for me).
Here's the patch: Where to from here?
Obviously we can add more patterns for improvement, but there are limits. This is a peephole optimiser. It is a pretty stupid beast that simply looks for patterns and has little, if any, understanding of what the instructions do.
It is possible to write a more sophisticated optimiser. It would need to understand the effects of each instruction and could then spot things like the snippet on the RHS of the last example above -- where D and X are stored and immediately retrieved from the same addresses. These data movement concerns are probably the majority of the bad code that gcc generates, but writing a full data flow analysis based optimiser is a big job and pretty hard to justify for the benefits we'll reap.
I think there is probably more future in the hybrid approach I used above to implement div32. It gives us a new way to integrate custom assembly language code while still keeping the C code readable.
Have fun,
Rob.