Experimental peephole optimiser

robs · Post by **robs** » Wed Jan 05, 2011 7:30 pm

A new thread for a new year -- besides, it had already drifted well away from trying to make the code more modular.

I didn't say it when I posted the patch, but I thought that changing lines like "b = a / 100" into "b = divu_100(a)" was pretty crappy. What I said was that it felt "error prone" -- borderline unreadable would be nearer the mark. So I'm glad to report that I have come up with a better way. No need to make the C code unreadable, better performance, and smaller executable.

What I've done is put a somewhat general purpose optimiser inside the build process. I tweaked Makefile.ms2 so that C code would go through a pipeline:

gcc takes .c file, compiles into its assembly language equivalent and stashes this as a .S file
peepopt (a Perl script) takes the .S file, applies a number of optimisations to the code and stores the result as a .s file
the .s is assembled and linked as usual by the gnu assembler and linker

So, the key question is, what does peepopt do?

Peepopt isn't all that fancy. In a way it's just an automated text editor which does search and replace of dumb code patterns with smarter ones.
It is, itself, pretty dumb -- but I hope it'll be worthwhile. It currently performs three optimisations. Each makes the code slightly smaller and faster.

Optimisation 1 -- merge consecutive MOVB instructions into MOVW instructions. Given that a MOVW moves two bytes in just the same time and with just the same amount of code as a MOVB, this can make for a tidy saving. Here's an example from ms2_extra_misc.S:

Code: Select all

.LSM306:                                  .LSM306:
        movw    outpc+6,txbuf+6                  movw    outpc+6,txbuf+6
.LSM307:                                  .LSM307:
        movw    outpc+8,txbuf+8                  movw    outpc+8,txbuf+8
.LSM308:                                  .LSM308:
        movb    outpc+10,txbuf+10         .LSM309:
.LSM309:                                         MOVW    outpc+10,txbuf+10
        movb    outpc+11,txbuf+11         .LSM310:
.LSM310:                                  .LSM311:
        movb    outpc+12,txbuf+12                MOVW    outpc+12,txbuf+12
.LSM311:                                  .LSM312:
        movb    outpc+13,txbuf+13         .LSM313:
.LSM312:                                         MOVW    outpc+14,txbuf+14
        movb    outpc+14,txbuf+14         .LSM314:
.LSM313:                                         movw    outpc+16,txbuf+16
        movb    outpc+15,txbuf+15         .LSM315:
.LSM314:                                         movw    outpc+18,txbuf+18
        movw    outpc+16,txbuf+16
.LSM315:
        movw    outpc+18,txbuf+18

(sorry about all the ugly labels -- gcc generated them and they need to be kept for the later build stages)

Optimisation 2 is quite similar. Merging CLR instructions so that memory is cleared 16-bits at a time. Unfortunately, there isn't a way of doing this quite as easily as the MOV* instructions, so each candidate CLR block is evaluated whether the overhead will be worthwhile (as long as there are more than 3 byte-pairs to be cleared, it's worththile). If it is, it is bracketed in PSHD,CLRA, CLRB, then STD in whatever addresses need 16-bit clears, all rounded off with a PULD. Here's one of the blocks as updated:

Code: Select all

.L54:                       .L54:
.LSM49:                     ; ========================================
        clr     dt2.0+2     ; Merge CLR instructions
        clr     dt2.0+3             PSHD
        clr     dt2.0               CLRA
        clr     dt2.0+1             CLRB
.LSM50:                             STD     dt2.0
        clr     dt3.1+2     .LSM49:
        clr     dt3.1+3             STD     dt2.0+2
        clr     dt3.1               STD     dt3.1
        clr     dt3.1+1     .LSM50:
                                    STD     dt3.1+2
                                    PULD
                            ; End merged CLR instructions

Might look longer but it's mostly comments. The original is 24 bytes worth, the right is 16. And 24 CPU cycles vs 19.

Optimisation 3 is quite a bit more involved and implements the divide stuff that I was talking about earlier. It looks for 32-bit divisions where the denominator is a 16-bit constant. If so, it reconstructs the call to invoke the assembly language division function I wrote before which is always a bit faster than the default one, and sometimes very much faster. Here's an example:

Code: Select all

        bsr     __mulsi3             bsr     __mulsi3
        leas    4,sp                 leas    4,sp
        std     26,sp                std     26,sp
        stx     24,sp                stx     24,sp
        movw    #100,2,-sp	        LDD     26,sp
        clr     1,-sp                LDX     24,sp
        clr     1,-sp                ldy     #100
        ldd     30,sp                jsr     divu32_16
        ldx     28,sp                std     26,sp
        bsr     __udivsi3            stx     24,sp
        leas    4,sp          .L51:   
        std     26,sp
        stx     24,sp
.L51:

The denominator is passed in the Y register, and my menagerie of div[su]32_nnn functions have been reduced to just one each for signed and unsigned. If you look closely, you'll see the arguments have had a bit of shuffling and the script has accounted for the movement on the stack. A bit tricky.

Still, I think this is greatly preferable to what I was suggesting a couple of weeks ago. The C code is much more readable, and the Perl script hides the ugliness (until you look inside it!). I think this approach might be applied to other areas where we want particular optimisations. For instance, there are blocks of inline assembly language that might have been able to be done with this same sort of hack, keeping the code more readable and, more importantly, giving us one place to fix errors in the inserted code.

Now, to the performance change. Not huge, but significant enough I think. The image was something like 1300 bytes smaller, and the looptime on the stim dropped from about 400us to 370us with my configuration -- a bit less than 10% faster. And my car runs OK with it too, so it doesn't seem to have ballsed anything up (for me).

Here's the patch:

patch_peepopt.txt

Where to from here?

Obviously we can add more patterns for improvement, but there are limits. This is a peephole optimiser. It is a pretty stupid beast that simply looks for patterns and has little, if any, understanding of what the instructions do.

It is possible to write a more sophisticated optimiser. It would need to understand the effects of each instruction and could then spot things like the snippet on the RHS of the last example above -- where D and X are stored and immediately retrieved from the same addresses. These data movement concerns are probably the majority of the bad code that gcc generates, but writing a full data flow analysis based optimiser is a big job and pretty hard to justify for the benefits we'll reap.

I think there is probably more future in the hybrid approach I used above to implement div32. It gives us a new way to integrate custom assembly language code while still keeping the C code readable.

Have fun,

Rob.

Post by **muythaibxr** » Wed Jan 05, 2011 8:48 pm

Pretty cool idea!

Fixing the compiler itself not to make the code mistakes in the first place would be a good way to take it a step further, but at the very least this is a very good method to prove out your ideas!

Ken

Post by **jsmcortina** » Thu Jan 06, 2011 6:19 am

The CLR improvement can go a step further on S12X by using the CLRW mnemonic.

On the divide, what happens if the 16bit divisor is a small number - how do you handle overflow?
e.g. 65536 / 1

James

robs · Post by **robs** » Thu Jan 06, 2011 3:27 pm

jsmcortina wrote:The CLR improvement can go a step further on S12X by using the CLRW mnemonic.

On the divide, what happens if the 16bit divisor is a small number - how do you handle overflow?
e.g. 65536 / 1

Not too hard to give the script a "-X" option (say) to let the optimiser know it can use S12X instructions.

The divide functions divs32_16 and divu32_16 are 32-bit divided by 16-bit functions. They return a 32-bit quotient. This common case will be much faster than 32/32 division because it can be achieved in at most two hardware divide operations. Full 32-bit division can only be done using bit-by-bit operations which take ages.

This mixed-mode stuff where we have known 32/16-bit divisions and such like is something that the compiler doesn't do well. Fair enough, in a way, since the C standard only says that the computation must hold a certain precision, leaving the compiler free to promote or not as it sees fit. There are probably still some rewards to be reaped from other similar "strength reduction" optimisations. e.g. a few hand coded functions for things like 16x32bit multiplication in combination with the peephole optimiser spotting that the high order word of a 32-bit operand is always zero, and tweaking the call accordingly.

BTW, in ms2_extra_idle.c, I noticed a hardwired division by 10,000,000. I'm almost certain this would execute faster as consecutive divisions by (say) 10,000 and 1,000 so that we have 16-bit denominators. Whether it amounts to much...

And Ken, glad the idea appeals.

The compiler could certainly be improved upon, but I can't get terribly enthusiastic about it. Gcc is pretty complex and I'm rather slow at getting to grips with other people's code. Intercepting it afterwards seemed a much easier approach.

What I'd really like to think about is an application specific language for ECU development. In my imagined future, your config file would lead to an image being generated specifically for you -- just like the Linux kernel (and Unix kernels for decades). So our top level language might be something like the current config file. At the bottom is HCS12 machine code. In the middle is some sort of vocabulary of components that can be put in various combinations. But there we run the risk of my harping on about some mythical driver layer again.

Have fun,

Robert.

davcol · Post by **davcol** » Thu Jan 06, 2011 8:46 pm

Great work.
So , will there be any thing for testing soon

.

Post by **jsmcortina** » Fri Jan 07, 2011 7:00 am

Robs,
I wish we could persuade you to tackle this in gcc itself! I'm sure there are some really big gains to be had there.

For one, brset/brclr need fixing up. There are some notes in the hc11md file from Carrez about problems with that which is why the generated code is often so poor for single bit-checks. His previous optimisation caused a nasty bug is what I recall it saying.

What results exactly were you seeing with the reduction of the soft registers? I recall trying this a while back but I forget whatI saw. I know taking it too far caused an internal compiler error.

How do the latest ideas impact ms2_extra_ign_in ? That's the single best place to make try to make speedups.

James

robs · Post by **robs** » Fri Jan 07, 2011 4:17 pm

jsmcortina wrote: I wish we could persuade you to tackle this in gcc itself! I'm sure there are some really big gains to be had there.

For sure. My few benchmarks have CodeWarrior about 30% better on both size and speed -- it's output looks about as good as you could hope for from a compiler.

OTOH, this single little optimiser got about 7% speedup, which wasn't bad for a fairly small amount of work.

jsmcortina wrote: For one, brset/brclr need fixing up. There are some notes in the hc11md file from Carrez about problems with that which is why the generated code is often so poor for single bit-checks. His previous optimisation caused a nasty bug is what I recall it saying.

Where can I find hc11md?

jsmcortina wrote: What results exactly were you seeing with the reduction of the soft registers? I recall trying this a while back but I forget whatI saw. I know taking it too far caused an internal compiler error.

Nothing concrete -- just a general feel I got from looking at the output. Less pointless storing and retrieving of values. No question that the output was a few hundred bytes smaller which often means faster with these processors.

jsmcortina wrote: How do the latest ideas impact ms2_extra_ign_in ? That's the single best place to make try to make speedups.

The object size has gone from 13,703 bytes to 13,601. A diff of the two (tkdiff gives nicer output, but I'll spare you the screen dump):

Code: Select all

--- ms2_extra_ign_in.S	2011-01-06 11:44:33.000000000 +1100
+++ ms2_extra_ign_in.s	2011-01-08 09:19:40.000000000 +1100
@@ -436,16 +436,18 @@
 .LSM48:
 	clr	tooth_no
 .L54:
+; Merge CLR instructions
+	PSHD
+	CLRA
+	CLRB
+	STD	dt2.0
 .LSM49:
-	clr	dt2.0+2
-	clr	dt2.0+3
-	clr	dt2.0
-	clr	dt2.0+1
+	STD	dt2.0+2
+	STD	dt3.1
 .LSM50:
-	clr	dt3.1+2
-	clr	dt3.1+3
-	clr	dt3.1
-	clr	dt3.1+1
+	STD	dt3.1+2
+	PULD
+; End merged CLR instructions
 .LSM51:
 	ldx	#0
 	ldd	dt3.1+2
@@ -464,35 +466,31 @@
 .LSM54:
 	movw	_.d4,tooth_time1.2+2
 	movw	_.d3,tooth_time1.2
-.LSM55:
-	clr	tooth_time2.3+2
-	clr	tooth_time2.3+3
-	clr	tooth_time2.3
-	clr	tooth_time2.3+1
-.LSM56:
-	clr	tooth_diff_this+2
-	clr	tooth_diff_this+3
-	clr	tooth_diff_this
-	clr	tooth_diff_this+1
+; Merge CLR instructions
+	PSHD
+	CLRA
+	CLRB
+.LSM61:
+	CLR	firstsync_iter
+.LSM60:
+	CLR	fuel_cntr
+	STD	tooth_diff_last
 .LSM57:
-	clr	tooth_diff_last+2
-	clr	tooth_diff_last+3
-	clr	tooth_diff_last
-	clr	tooth_diff_last+1
+	STD	tooth_diff_last+2
+	STD	tooth_diff_last_1
 .LSM58:
-	clr	tooth_diff_last_1+2
-	clr	tooth_diff_last_1+3
-	clr	tooth_diff_last_1
-	clr	tooth_diff_last_1+1
+	STD	tooth_diff_last_1+2
+	STD	tooth_diff_last_2
 .LSM59:
-	clr	tooth_diff_last_2+2
-	clr	tooth_diff_last_2+3
-	clr	tooth_diff_last_2
-	clr	tooth_diff_last_2+1
-.LSM60:
-	clr	fuel_cntr
-.LSM61:
-	clr	firstsync_iter
+	STD	tooth_diff_last_2+2
+	STD	tooth_diff_this
+.LSM56:
+	STD	tooth_diff_this+2
+	STD	tooth_time2.3
+.LSM55:
+	STD	tooth_time2.3+2
+	PULD
+; End merged CLR instructions
 .LSM62:
 	bclr	flagbyte4, #16
 .LSM63:
@@ -5946,13 +5944,11 @@
 	movw	2,x,next_spark+2
 	movw	0,x,next_spark
 .LSM1052:
-	movb	4,x,next_spark+4
 .LSM1053:
-	movb	5,x,next_spark+5
+	MOVW	4,x,next_spark+4
 .LSM1054:
-	movb	6,x,next_spark+6
 .LSM1055:
-	movb	7,x,next_spark+7
+	MOVW	6,x,next_spark+6
 .L861:
 .LSM1056:
 	ldab	15,sp
@@ -6030,9 +6026,8 @@
 	movw	2,x,next_dwell+2
 	movw	0,x,next_dwell
 .LSM1072:
-	movb	4,x,next_dwell+4
 .LSM1073:
-	movb	5,x,next_dwell+5
+	MOVW	4,x,next_dwell+4
 .L869:
 .LSM1074:
 	ldab	15,sp
@@ -6098,9 +6093,8 @@
 	movw	2,x,next_dwl_trl+2
 	movw	0,x,next_dwl_trl
 .LSM1088:
-	movb	4,x,next_dwl_trl+4
 .LSM1089:
-	movb	5,x,next_dwl_trl+5
+	MOVW	4,x,next_dwl_trl+4
 .L877:
 .LSM1090:
 	ldab	15,sp
@@ -6166,9 +6160,8 @@
 	movw	2,x,next_spk_trl+2
 	movw	0,x,next_spk_trl
 .LSM1104:
-	movb	4,x,next_spk_trl+4
 .LSM1105:
-	movb	5,x,next_spk_trl+5
+	MOVW	4,x,next_spk_trl+4
 .L885:
 .LSM1106:
 	ldab	15,sp
@@ -6359,11 +6352,8 @@
 	movw	dt2.0,2,-sp
 	bsr	__mulsi3
 	leas	4,sp
-	movw	#100,2,-sp
-	clr	1,-sp
-	clr	1,-sp
-	bsr	__udivsi3
-	leas	4,sp
+	ldy	#100
+	jsr	divu32_16
 	ldy	#dt3.1
 	cpx	0,y
 	blo	.L905
@@ -6405,11 +6395,8 @@
 	movw	dt2.0,2,-sp
 	bsr	__mulsi3
 	leas	4,sp
-	movw	#100,2,-sp
-	clr	1,-sp
-	clr	1,-sp
-	bsr	__udivsi3
-	leas	4,sp
+	ldy	#100
+	jsr	divu32_16
 	std	_.d2
 	stx	_.d1
 .LSM1151:
@@ -6827,17 +6814,19 @@
 	tbeq	d,.L951
 .LSM1253:
 .L1008:
-	clr	outpc+2
-	clr	outpc+3
-.LSM1254:
-	clr	req_pw1
-	clr	req_pw1+1
+; Merge CLR instructions
+	PSHD
+	CLRA
+	CLRB
+	STD	outpc+2
 .LSM1255:
-	clr	outpc+4
-	clr	outpc+5
+	STD	outpc+4
+.LSM1254:
+	STD	req_pw1
 .LSM1256:
-	clr	req_pw2
-	clr	req_pw2+1
+	STD	req_pw2
+	PULD
+; End merged CLR instructions
 .LSM1257:
 	ldab	seq_inj_ctrl
 	anda	#0

So, a few CLRs and MOVBs merged and one divide sped up. Won't hurt in any case.

So, to your particular wish that I'd put my efforts into gcc... There are a few things that make me unwilling. Firstly, as I said, I'm slower than most at getting to grips with other people's code. This is particularly the case with typical GNU stuff, where the authors seem to believe that their code is so eloquent that it needs no comments.

On a more technical level, it would require me to get to grips with gcc's internal data structures. Since the '80s or so, compilers have generally worked in the same way. They have a general purpose language parser and code generator which, typically, generates bloody awful code for a pseudo-machine. At this stage some optimisations are usually performed. Then there is the real code generator which turns that pseudo-machine's code into code for the actual target machine. This code can then have machine-specific optimisations applied to it. So, apart from the last two steps, the process is not terribly relevant to making things better for the HCS12 processor, but to do this work within gcc, I bet I'd need to get a pretty good understanding of what was going on in the earlier steps.

A better way, I believe, is to write a real optimiser in just the same way that I've done the peephole optimiser. A proper optimiser would be a lot more sophisticated, knowing inputs, outputs and side-effects of all machine instructions and library functions. The code would be broken up into "basic blocks" (by definition, a basic block is a sequence of instructions only ever entered at the first, and exited at the last). Each basic block is analysed for its combined inputs, outputs and side-effects. You can then reap two benefits -- instructions within each basic block can be analysed for better ways to achieve the combined input,output,side-effects, and whole blocks themselves can be reordered and possibly eliminated when you consider communications between blocks. What's more, an application wide optimiser might be able to do this between source files, not just within files like the C compiler does.

But the job is BIG and error prone. And it's very hard to justify when it's only to be used for a fairly small amount of code. Pragmatically, it would be far easier to compile to assembly language using CodeWarrior just once, then (the hard part) walking through what it's generated, working out what the hell it's done, commenting and tweaking appropriately, and maintaining it henceforth as assembly language code. A bit ethically dubious on the demo licence of course.

A peephole optimiser is a much simpler beast. It still has the basic blocks worked out (demarcated by the ";======" lines in the output), but only so that it can apply its pre-cooked rules to recognised patterns within blocks, not across them. But, simple as it is, its benefits can be substantial. Perhaps the BRSET/BRCLR problems might be able to be tackled this way. Don't know. What I think it does give is a more palatable alternative to inline assembly language.

Have fun,

Rob.

Post by **jsmcortina** » Sun Jan 09, 2011 7:30 am

robs wrote:Where can I find hc11md?

It is part of the m68hc11/12 port to gcc

Code: Select all

[jsm@jsm2 m68hc11]$ pwd
/usr/src/jsm/gcc-3.3.6/gcc/config/m68hc11
[jsm@jsm2 m68hc11]$ ls -l
total 580
-rw-r--r-- 1 jsm users  20886 2010-11-09 20:47 larith.asm
-rw-rw-rw- 1 jsm users   9538 2010-11-09 20:47 ldivmod.asm
-rw-r--r-- 1 jsm users 155544 2010-11-09 20:47 m68hc11.c
-rw-r--r-- 1 jsm users   3151 2010-11-09 20:47 m68hc11-crt0.S
-rw-r--r-- 1 jsm users  75080 2010-11-09 20:47 m68hc11.h
-rw-r--r-- 1 jsm users 231156 2010-11-09 20:47 m68hc11.md
-rw-r--r-- 1 jsm users   6997 2010-11-09 20:47 m68hc11-protos.h
-rw-r--r-- 1 jsm users   2382 2010-11-09 20:47 m68hc12.h
-rw-rw-rw- 1 jsm users   2690 2010-11-09 20:47 m9s12x.h
-rw-r--r-- 1 jsm users   2703 2010-11-09 20:47 t-m68hc11-gas
[jsm@jsm2 m68hc11]$

m68hc11.md is the "machine description" file.

James

robs · Post by **robs** » Tue Jan 11, 2011 7:24 pm

jsmcortina wrote: It is part of the m68hc11/12 port to gcc

A cunning way of getting me to load the compiler source!

While the file is well commented (proving me wrong) I don't see anything in there about brset/brclr. I think I got the right file -- loaded the generic gcc-3.3.6 source, then applied the gcc patch from your latest compiler release (s12x-tools-source-20101109.zip).

Still, while it looks interesting, there is plenty to get to grips with before getting useful work done. There's the lisp-like language to begin with, but I think the difficult part would be getting on top of the machine-independent intermediate representation generated by the compiler front-end. I Googled around for a while, but didn't find much documenting this, or other details on the gentle art of writing your own code generator. A few promising leads, but all that I looked at led to dead ends.

I think I'm feeling firmer that writing an optimiser outside gcc is the better way, restricting gcc work to fixing actual errors, not optimising. A couple of reasons:

gcc's intermediate representation is a long way from the hcs12 architecture. The first code generation phase will necessarily generate pretty poor code. So everything hinges on the subsequent optimisation steps.
gcc's optimisation phase looks like it tries to be a bit "machine independent" too, with standard functions to help define peephole windows and such like. This is good for people who are going to be writing backends for several processors, but probably makes it a deal harder to do very machine specific tweaks
writing a post-processor for gcc's assembly language output avoids all this indirectness. You're just taking hcs12 code as input, and trying to generate better hcs12 code as output. To someone who understands hcs12 code, it couldn't be much more direct.

What are the drawbacks of the Perl script approach I've put forward here? Native Windows developers mightn't like having to use Perl, but I'm guessing Cygwin is used for Windows, in which case Perl is no big deal. Other problems?

Have fun,

Rob.

Post by **muythaibxr** » Wed Jan 12, 2011 7:08 am

The main problem with perl is for the few people who make their own mods and don't install cygwin. It is another dependency.

Ken

hassmaschine · Post by **hassmaschine** » Wed Jan 12, 2011 7:59 am

people on windows could use a web based script that anyone can get to.

I love perl

(mainly because it's the only language I've ever been able to pick up and use)

robs · Post by **robs** » Wed Jan 12, 2011 10:43 pm

Poking around in the code a bit more and I think I see what the optimisation with BRCLR/BRSET might be.

I spotted a bunch of rather stupid stuff (most frequently in ms2_extra_ign_in.S) where it was doing:

Code: Select all

    LDAB    flagbyte0
    ANDA    #0
    ANDB    #32
    TBEQ    D,.L51

The instantly obvious silliness is that you can easily code ANDA #0 as CLRA and save a byte. There are 601 instances of ANDA #0 in the code, and the executable size shrinks by 608. The extra 7 bytes are because of the shrinkage occasionally allowing short branches where there used to be long branches. So, despite CLRA and ANDA #0 both running for one cycle, the optimisation would actually be slightly quicker (very slightly).

But, when you look at the larger snippet above, a much better optimisation is available. All four instructions can be replaced with:

Code: Select all

BRCLR flagbyte0,32,.L51

That's 5 bytes of code replacing 10, and 5 cycles replacing 8. Good win.

Unfortunately, a peephole optimiser can't be free to do this because the original code left D containing either 0 or 32 and the new version leaves D with whatever it used to have still in it. There's just the chance (and I bet it's rare with this compiler) that subsequent code is relying on D's contents and, without this dependency information, you can't know whether or not it can be reduced this much. I wonder if the optimiser problem with BRCLR was hitting a similar snag?

The best peephole optimisation I see for this one is:

Code: Select all

    LDAB    flagbyte0
    CLRA                ; one byte shorter
    ANDB    #32
    BEQ    .L51         ; previous instruction has set the flag

Because the ANDB affects the Z flag, we hardly the TBEQ form testing the register again. TBEQ can jump twice as far as BEQ, but the label always looks to be nearby. So we have 6/8 cycles depending on whether the branch is taken vs. 8 cycles in the original and 8 bytes vs 10 in the original. A smaller win, but not bad for not much work.

I've already added the simple blanket replace of ANDA #0 with CLRA to the peephole optimiser and will probably pop in the above too. I'm interested to know whether there's any enthusiasm for this approach of doing the optimisations outside the compiler itself. The existing Perl script is just an ugly prototype to get an idea of the problem. Before I get to making it more solid, I'd like to know what the chances are of the work being adopted.

Of course the optimiser should be an optional step in the build process, and it'd be pretty easy to substitute "cp" for "peepopt" in order to bypass it.

Have fun,

Rob.

Post by **jsmcortina** » Thu Jan 13, 2011 5:52 am

Over the last few days I have done some work on gcc on that exact issue. They were actually disabled in the first place which rather explains why they were NEVER used, but with them enabled the optimiser is still chosing other methods. I actually made an improvement by disabling the TBEQ instructions (as I was sick of seeing it) and then gcc finally does a compare simply on B instead of D. (This is in m68hc11.md)

So that code snippet you posted will often become

Code: Select all

    LDAB    flagbyte0
    ANDB    #32
    BEQ    .L51

I posted about this on the gcc/hc11 yahoo group. http://tech.groups.yahoo.com/group/gnu- ... ssage/9563

While I was at it, I did add some S12X code additions(CLRW, ALSX, ROLX etc.) which get usd. However, where you started on this topic is still not fully handled. The CLRW addition allows 1000 instances to be emitted for setting words to zero, but where multiple variables are initialised to a non-zero value, MOVW # is used instead of LDD # , STD, STD...
I haven't explored that greatly as after two days compiler coding I saved a mere 10us (at best) on the mainloop time and was a little deflated.

I like the idea of using just 4 soft-registers as that would appear to reduce the register swapping. I tried bumping up the 'cost' of the soft-regs but that had no effect.

PS. Don't waste your time trying to use a later gcc, that is a GREAT way to burn up days of time and get nowhere. Only use 3.3.6 with the s12x/scz patches.

James

Post by **jsmcortina** » Thu Jan 13, 2011 11:12 am

I compared the cycles of that code segment I posted earlier and the brset/clr we think we want. Much the same.
An 8 bit compare and branch is 3 or 5 cycles depending on whether the branch is taken on not. brset/clr are 4 cycles. So not worth the effort to go beyond the removing tbeq(HI)

James

robs · Post by **robs** » Thu Jan 13, 2011 4:41 pm

I added the optimisation mentioned in my last posting, so TBxx now becomes Bxx, and the ANDA #0 becomes CLRA. Here are the size results of the few optimisations I have added:

Code: Select all

opt             size    diff
standard        98095   0               default build
CLRmerge        97599   496             CLR 16-bits at a time
MOVBmerge       97743   352             MOVB->MOVW
DIVx32_16       97603   492             Special function for 16-bit divisor
CLRATBxx        97067   1028            ANDA #0->CLRA;TBxx -> Bxx
all             95729   2366            all of the above

So a 4% size reduction. Not sure about the speed improvement, and I haven't even checked that it still runs, but I'm fairly confident. The tweaks are pretty conservative (apart from the division one).

Interesting that you were also focussed on tidying up the TBxx, James. As you say, using several simpler instructions is no slower. They do cost you the D register though, which a smart optimiser might have been able to use for something else. But gcc's doesn't seem to be terribly smart. I know that I'm harping on about this, but I doubt the wisdom of trying to fix this within the gcc structure -- it feels like going two sides of a triangle when there's a more direct path to follow, or maybe like doing everything with lazy tongs. If a decent standalone optimiser were developed, there's nothing to stop it being contributed to the GNU project as a supplement for the m68hc1x compiler, so the wider community might benefit.

I'm surprised that LDD #xx;STD a1;STD a2;... is useful (for xx other than zero). None of my optimisations would address this. Obviously it's faster than a bunch of MOVWs, but is it such a common thing? BTW, I'd be happy to put in the S12X CLRW optimisation if you want to give it a go. Just say the word.

Have fun,

Rob.

Post by **jsmcortina** » Wed Jan 19, 2011 3:15 pm

jsmcortina wrote:I compared the cycles of that code segment I posted earlier and the brset/clr we think we want.

Learned some more on this.. talking with Ken I checked the size of the constant - gcc is treating it as a short - this follows the C standard.

So (uchar & CONSTANT)
is _supposed_ to treat CONSTANT as a 16bit variable. So gcc is "correct" to do what it does - that's why the 8bit patterns were never matching.

I saved 1 byte on each call by replacing anda #0 with clra. Same one cycle though.

I've got gcc to emit CLRW and a bunch of other S12Xism too, using some of my work and some 2008 work of a friend of mine. Using the X,Y instructions saves 300 XGDX/XGDY

It still needs a lot more testing before it is safe to use and while the s19 is 2% smaller, the mainloop time is barely changed 20us if I'm lucky. A slight compiler error can totally break the code too...

James

robs · Post by **robs** » Wed Jan 19, 2011 5:16 pm

jsmcortina wrote: Learned some more on this.. talking with Ken I checked the size of the constant - gcc is treating it as a short - this follows the C standard.

So (uchar & CONSTANT)
is _supposed_ to treat CONSTANT as a 16bit variable. So gcc is "correct" to do what it does - that's why the 8bit patterns were never matching.

It's supposed to treat it as if it were an int (16-bit), but that doesn't mean it must promote it. If it's only to be used in a conditional, the compiler is certainly at liberty to "know" that the high order 8 bits will always be zero and only work with the lower 8 bits. I commented on this in the "driver layer" thread, where an 8x8-bit multiply was promoted to 32x32-bit because of the result being cast to long. It complies with the standard to do so, but the standard doesn't require it. It just demands the correct long result as if it had been done in longs.

jsmcortina wrote: It still needs a lot more testing before it is safe to use and while the s19 is 2% smaller, the mainloop time is barely changed 20us if I'm lucky. A slight compiler error can totally break the code too...

Not a great reward! There might be more to be gained tracking down where gcc has generated 32-bit operations unnecessarily. And I agree with your worries about compiler errors.

Have fun,

Rob.

Post by **jsmcortina** » Wed Jan 19, 2011 5:34 pm

robs wrote:Not a great reward! There might be more to be gained tracking down where gcc has generated 32-bit operations unnecessarily. And I agree with your worries about compiler errors.

There is some reward in seeing less crappy looking code though.. and that I've won a small battle with mastering a little of gcc. It really is heavy going.

The concern about compiler errors is very real. The build I had yesterday would build the MS3 code ok and it ran without crashing but the PW was wrong. This afternoon I had a look at it and it was very nasty indeed. The new S12X 16bit subtract (subhi3_x) was somehow incorrect and was giving a negative result.

The piece of code was (10000 - tmp1). gcc was actually doing (tmp1 - 10000) and so the calculation was horribly broken. The result was subtle though as it created a MAT/CLT blend temperature off by 60degF and influenced Gair by 23%.

Yesterday I found another nasty but tiny bug.
On S12 for a 32bit add you'll typically see:

Code: Select all

ldd  ...
ldx ...
add ...
xgdx
abcb ....
adca ...
xgdx
std ...
stx ...

Well, on S12X it can be:

Code: Select all

ldd  ...
ldx ...
add ...
adex ...
std ...
stx ...

fine...

But one piece of code was checking the Z bit after the adex, assuming it would be set if X was zero. But adex doesn't work like that. The Z bit is only set if ALL bits in the addition chain are zero (i.e. D == X == 0) That caused the PW to be stuck on 65000us.

One area where there are some more potential savings is with movb/movw - S12X supports a whole ton of modes and gcc does use them if encouraged. I've tried it and they are emitted, but I haven't written the binutils support yet to assemble them.

I'm actually softening on my "it must be in gcc" stance too. Implementing the improvement within gcc IS the right way to do it, but given extremely difficult I find it to work in, the external approach has its merits too.

The 16 bit constant is likely a tough one to resolve as it may well be in core gcc not the m68hc11 port. "Fixing" that within gcc is likely the only place to do it. The current code will sometimes stash a copy of the promoted or anded value into Y, or use D later. So it needs to be recognised as an 8bit value before that stage of processing.

James

robs · Post by **robs** » Thu Jan 20, 2011 5:42 pm

jsmcortina wrote: There is some reward in seeing less crappy looking code though.. and that I've won a small battle with mastering a little of gcc. It really is heavy going.

Less crappy looking code is a definite plus. Mastering some of gcc's internals -- maybe, but perhaps not much more than a satisfying intellectual challenge, a bit more useful than getting a crossword out.

The headaches with subtle and not so subtle code generation problems were the sort of thing that I was concerned with. It's hard enough getting the code right without having the vagaries of working out which gcc lever to pull to make it happen. And I'm glad to hear that you're warming to the idea of taking a more direct route. Not that it's easy, just less weird.

jsmcortina wrote: The 16 bit constant is likely a tough one to resolve as it may well be in core gcc not the m68hc11 port. "Fixing" that within gcc is likely the only place to do it. The current code will sometimes stash a copy of the promoted or anded value into Y, or use D later. So it needs to be recognised as an 8bit value before that stage of processing.

I think it may be attacked by other means -- at least sometimes. The 32/16-bit division was pretty easy to identify by code pattern (look for the top two bytes of the divisor being initialised with CLR instructions) then shanghai the inefficient 32/32-bit call and replace it with a 32/16-bit one. A similar easily found pattern might be there.

Also, while I've been saying (truly enough) that a full data flow analysis based optimiser is a big job, a basic-block decomposition and analysis of outputs and inputs to these blocks is not all that hard and would allow us to see if (say) D's value is depended upon later, or can be ignored. IOW, we could probably get some reasonable gains fairly quickly; if we got carried away we could spend our lives working on the optimiser.

Have fun,

Rob.

Megasquirt Support Forum (MSEXTRA)

Experimental peephole optimiser

Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser

Re: Experimental peephole optimiser