Forum (#44895) - Plus/4 World



	Home Search Games Tapes Covers Cheats Maps Software New Stuff Hall Of Fame HVTC Game Endings Solutions Remakes Publications Magazines Effects Top List Members Groups Features Plus/4 Encyclopedia Hardware Tools Options Forum Home Search Games Tapes Covers Cheats Maps Software New Stuff Hall Of Fame HVTC Game Endings Solutions Remakes Publications Magazines Effects Top List Members Groups Features Plus/4 Encyclopedia Hardware Tools Options Forum	Login

Back to forum

See the full topic

Go to last reply

Posted By

Mad
on 2022-05-31
15:29:24

Re: cc65 compressor too slow: help needed optimizing

Maybe you have to go through your own process in optimizing stuff.. Some tips for that anyways:

- Here is a list of the cycle count of every asm instruction: http://www.oxyron.de/html/opcodes02.html
- You can use illegal opcodes
- especially lax is for me sometimes worth some gold (it loads accu and x at once with a value), you can see in this list that some forms of lax seem to be not stable: http://www.oxyron.de/html/opcodes02.html
- Use self modifying code where ever you can
- Unroll loops, most of the time you can even add a lot of functionality whilst also getting a lot faster with this
- conditional branches like beq,bne and so on use 3 cycles if they jump, if they don't jump they use 2 cycles. That means sometimes it's worth to take the inverse jump (bne instead of beq for instance)
- don't use jsr in timing critical code.. Jsr takes 6 cycles and rts takes 6 cycles,too which is a lot to bear (12 cycles :/).
- use tables wherever possible (if you have the memory precalc stuff) you even can use a table instead of an add if it uses lower cycles (or if you want to keep the carry bit intact).
- there is a lot more, but mainly you have to got through your own process.. Optimizing stuff is where the fun starts and most of the demoscene people rely solely on the speed of there routines.. So to say be a good optimizer for your self, it really will help you a lot of times on this platform.

good luck with your project..

Ok, some simple optimization anyways:
This one:
"@a01:\n"
"\tlda\t_vo3_cur,x\n"
"\tsta\t_vo3_best,x\n"
"\tdex\n"
"\tbpl\t@a01\n"

Maybe it's worth to do "fragment copies" here if it's a long region to copy:

inx ; no bpl used this way (maybe also got to copy more than 128 bytes by this)
beq .error ; copy extactly 256 bytes not possible with this technique I think
cpx #$10
bcc .no16loop ; maybe bmi if you have signed values in x
; implicit sec for sbc no need to set it (if bmi it would need a sec here)
.a
"\tlda\t_vo3_cur-0-1,x\n" ; -1 because of inx
"\tsta\t_vo3_best-0-1,x\n"
"\tlda\t_vo3_cur-1-1,x\n"
"\tsta\t_vo3_best-1-1,x\n"
"\tlda\t_vo3_cur-2-1,x\n"
"\tsta\t_vo3_best-2-1,x\n"
"\tlda\t_vo3_cur-3-1,x\n"
"\tsta\t_vo3_best-3-1,x\n"
"\tlda\t_vo3_cur-4-1,x\n"
"\tsta\t_vo3_best-4-1,x\n"
"\tlda\t_vo3_cur-5-1,x\n"
"\tsta\t_vo3_best-5-1,x\n"
....
"\tlda\t_vo3_cur-15-1,x\n"
"\tsta\t_vo3_best-15-1,x\n"
txa
sbc #$10
tax
cpx #$10
bcs .a ; implicit sec for sbc (if you use a bmi you need some sec somewhere before the sbc)
.no16loop

cpx #$00 ; got all already copied by the 16er loop?
beq .no
.b
"\tlda\t_vo3_cur-0-1,x\n"
"\tsta\t_vo3_best-0-1,x\n"
dex
bne .b
.no

etc..

You spare almost 16 times the:
dex
bpl .b
which is 16 * (2+3) = 80 - 11 = around 69 cycles for every 16er loop spared

I didn't test this code, maybe it's malicious just that you can get the idea, of unrolling atleast parts of this loop..

Another technique would be:

loop:
"\ttlda\t_vo3_cur+0\n"
"\tsta\t_vo3_best+0\n"
"\ttlda\t_vo3_cur+1\n"
"\tsta\t_vo3_best+1\n"
"\ttlda\t_vo3_cur+2\n"
"\tsta\t_vo3_best+2\n"
"\ttlda\t_vo3_cur+3\n"
"\tsta\t_vo3_best+3\n"
"\ttlda\t_vo3_cur+4\n"
"\tsta\t_vo3_best+4\n"
..
"\ttlda\t_vo3_cur+255\n"
"\tsta\t_vo3_best+255\n"
rts ; if you want to copy all 256 values this byte needs to be here

put rts into loop at position "loop + (x+1) * (3*2)" # (3 bytes for lda and 3 bytes for sta), you can use a table (actuall two tables, lo and hi) for the multiplication
jsr loop
remove rts from loop

(you just would have to insert an rts by selfmodifying code there, where you want the (completely unrolled) loop to stop.
But be aware that you have to remove the rts again afterwards.)

Maybe this doesn't help, but you asked for some suggestions.. I don't know what you code should do, I even don't know if this loop is a problem at all. This loop needs to copy more than let's say 32 bytes to get the optimizations working.. If it copies around 16 bytes everytime these optimizations are senseless..

For small loops you could use

lda postable,x ; this is (20-(x + 1))*(3*2)
sta .modify + 1
lda #$00
.modify beq loop2 ; this jumps into the unrolled code section at the right position no need for a hi and lo table this way, the unrolled code must be inverse this way and end with an rts like this:

loop2
"\ttlda\t_vo3_cur+19\n"
"\tsta\t_vo3_best+19\n"
...
"\ttlda\t_vo3_cur+0\n"
"\tsta\t_vo3_best+0\n"
rts

but this would only copy max 20 bytes