Posted By
indi on 2006-01-05 18:49:35
| Xply routines...
Hiya...
IM tryng to get the fastest XPLY I can, and currently my best is 131 cylces best, 163 worst case for an 8x8 bit multiply. I do have a faster one, but it uses a nibble lookup table, which I dont want to waste memory on.
Anyone got a better one? Heres mine in case someone can improve it.... (no illegal ops though please)
xp macro ;xply macro lsr xplyc ; 5 bcc !here ; 2 (worst case, since 3 if branch taken actually saves 4 cycles) clc ; 2 adc xplyd ; 3 !here ror a ; 2 ror xplyb ; 5 = 19 endm
;***************************************************************************** ; XPLY routine ; ; enter : c = xp 1 ; d = xp 2 ; exit : A = ans hi / b = ans lo ; ; used : a ; ; Totals including RTS (not including jsr to here) ; 163 WORST case ; 131 BEST case ; ;***************************************************************************** XPLY: lda #$00 sta xply xp xp ;xp = lsr memC xp ; bcc !here xp ; clc xp ;!here adc memD xp ; ror a xp ; ror memB xp rts
|
|
Posted By
indi on 2006-01-05 18:50:04
| Re: Xply routines...
hell.... is there a way to post "pre-formated" text?
|
|
Posted By
Csabo on 2006-01-05 19:38:00
| Re: Xply routines...
Sure, just put a PRE HTML tag around your code. I'll have to think about the multiply routine though.
|
|
Posted By
filker on 2006-01-07 02:24:57
| Re: Xply routines...
C=Hacking #9 discusses some multiply algorithms. Discussed methods use these functions:
f(x) = x*x/4 f(a+b) - f(a-b) = a*b
and
log(x*y) = log(x) + log(y) log_b(x) = y <=> b^y = x
I know you don't want tables, but article's writer Stephen Judd says it takes only 24 to 36 cycles. I think worth to try...
|
|
Posted By
indi on 2006-01-07 05:47:05
| Re: Xply routines...
I dont quite follow this.... but form what I can see, it only returns an 8bit number. I need a full 8*8=16bit number back.
My old nibble table was 8k and took 92 cycles(fixed) due to the most part of having to deal with 16 bit answers.
I think this can be modified to make a 16bit answer, but it would probably take around the same time as mine...Im not sure.
I have pinched their "noirmal" xply, as it seems to be a bit quicker...same idea... different code
The new one comes out at 134 worst, and 98 best... which is much better.
|
|
Posted By
Rachy on 2006-01-07 06:03:21
| Re: Xply routines...
I never thought over, but once we talked with Pigmy about this topic, and he mentioned a lookup table version. You can split up each byte into two 4 bits, which means 4*4=8 bit result. This result fits into a byte, so the initial two 4 bits can be used for indexing on a multiplication lookup table. After multiply the lower and the higher 4 bits you can add them and voa'la.
I don't know if it would be faster, but it is possible IMO, it worth a try.
|
|
Posted By
indi on 2006-01-07 06:13:45
| Re: Xply routines...
BTW.... the listing in C= hacking is wrong...
This is whats listed.....
MULT LDA #$00 LDY #$09 ]LOOP LSR ROR ACC BCC MULT2 CLC ADC AUX MULT2 DEY BNE ]LOOP STA EXT
This is what it should be...
MULT LDA #$00 LDY #$09 CLC ]LOOP ROR ROR ACC BCC MULT2 CLC ADC AUX MULT2 DEY BNE ]LOOP STA EXT
He probably unrolls the loop in is code, and has only the first one as a LSR to avoid the CLC
|
|
Posted By
indi on 2006-01-07 06:38:18
| Re: Xply routines...
I coded in a nibble * nibble.... but before I even try and load and shift the answers, its hitting 98 cycles... so that won't work
|
|
Posted By
filker on 2006-01-07 07:22:42
| Re: Xply routines...
Hello again,
6502.org has some math code. And one of the examples implements S. Judd's method for 16bit output. It seems it's no more than 50 cycles, but has 2K of table.
Related links are:
http://www.6502.org/source/ http://www.6502.org/source/integers/fastmult.htm
I hope this helps.
bye.
Ilker Ficicilar http://cbm.ficicilar.name.tr/
--
|
|
Posted By
Rachy on 2006-01-08 05:41:17
| Re: Xply routines...
@MikeD I don't exactly get what is wrong with LSR in that code...
|
|
Posted By
indi on 2006-01-08 10:40:51
| Re: Xply routines...
Because its in a loop, all following shifts must include the carry from the add at the bottom.
I tried several tests to make sure... honest
If you unroll the loop, then using a LSR instead on a ROR saves you 2 clocks because you dont need the CLC. Chances are his own routine is expanded (since it saves 44 clocks!), but its not nearly as easy to read, so he probably compacted it and forgot about the first LSR.
|
|
Posted By
Rachy on 2006-01-09 02:13:33
| Re: Xply routines...
Ah, sorry! I missed the loop label... Now it is perfectly clear.
|
|
Posted By
gerliczer on 2006-01-18 06:58:32
| Re: Xply routines...
Maybe it is possible to decrease the number of atomic add+rotate operations if you BIT test the value of c and skip one or two with the appropriate braches (BNE and BVC (?)). But this testing makes the worst case even worse.
|
|
Posted By
bubis on 2006-01-18 07:28:43
| Re: Xply routines...
Hi,
I don't think the first routine posted here is correct. "adc xplyd" may set the carry and the next ror moves it into the lowest bit of A, which is not what u want.
Regards, Andras
|
|
Posted By
indi on 2006-01-18 09:39:23
| Re: Xply routines...
Actually... ROR shifts the bit into the TOP of "A". Also, since A starts out as 0, the resulting carry from this is always 0.
C -> [76543210] -> C
This routine was lifted directly from the source to Blood Money where it was used in circle calculations. It was then used in xeo3 for exactly the same thing (along with some clipping code) before I posted here to try and speed things up. It works fine - honest. Id say try it and see, but the second one posted is much quicker... so use that instead -particually when you unroll it.
Actually...I just thought.... if you pre-decriment "AUX" you can ignore the CLC (since it will now add 1 extra ALWAYS) and that'll save 18 cycles more - worst case. But adds 5 best case... Still its more consistant then... And if its a constant your multiplying by, you can "pre-decrement" as well, saving you those 5 cycles...
It was 134 worst, and 98 best... Which now becomes 121 worst, 103 best. Thats "almost" a whole scan line saved from the first version posted.
|
|
Posted By
bubis on 2006-01-20 08:33:05
| Re: Xply routines...
Ups, sorry. You are right!
|
|
Posted By
indi on 2006-01-25 14:09:58
| Re: Xply routines...
Thought I'd post thew final routine.... To give you an idea of how important this was, my path system uses a "circle" command which allows complex paths from a single command. Each takes 2 xply's. 6 baddies*2 muls=36 scan lines spent in just the xply routine.
Now however, I've cut that by 12 scan lines... quite a saving... So here it is. You may notice I've added the "dec xplyc" to offset clearing the carry all the time, and I've removed the first "lsr a" since it was shifting nothing, and moved it to "lsr xplyd", which saved another 2.
For a non-table driver routine, its pretty good! Also, if you add a one over table, you can use this as a divide.
;***************************************************************************** ; XPLY routine ; ; enter : xplyc = xp 1 ; xplyd = xp 2 ; exit : A = ans hi / xplyc = ans lo ; ; used : a ; ; Totals including RTS (not including jsr to here) ; 163 WORST case - OLD ; 131 BEST case - OLD ; ; 119 worst ; 101 best ;***************************************************************************** XPLY: LDA #$00 ; 2 dec xplyc ; 5 pre-decrement to offset a SET carry when adding "xplyc"
lsr xplyd ; 5 BCC !skip1 ; 2 ADC xplyc ; 3 =10 !skip1 ROR ; 2 ROR xplyd ; 5 BCC !skip2 ; 3 = 10 (*9=90 best) ADC xplyc ; !skip2 ROR ; 2 ROR xplyd ; 5 BCC !skip3 ; 2 ADC xplyc ; 3 = 12 (*8=96 worst) !skip3 ROR ROR xplyd BCC !skip4 ADC xplyc !skip4 ROR ROR xplyd BCC !skip5 ADC xplyc !skip5 ROR ROR xplyd BCC !skip6 ADC xplyc !skip6 ROR ROR xplyd BCC !skip7 ADC xplyc !skip7 ROR ROR xplyd BCC !skip8 ADC xplyc !skip8 ROR ROR xplyd BCC !skip9 ADC xplyc !skip9 rts
|
|
Posted By
indi on 2008-07-01 15:58:59
| Re: Xply routines...
Mmm... Found a bug with this routine - after all this time! I was plugging it into my old Blood Money source and it was working for most things, EXCEPT when you multiply by a zero! (0), then its just wrong! Pooh.
So, xplyd can be zero, but xplyc can't be... The pre-decrement is killing it. This means you'd have to check it either before calling, or inside the routine itself.
Also the comment here was wrong. A = ans hi, xplyd = ans lo
|
|
Posted By
Degauss on 2008-07-01 17:25:08
| Re: Xply routines...
I feel ashamed: I was using your routine somewhen and found that "bug" too, but i didn't post about this. I thought its a fair trade: fast-code but faulty when multiplying with zero. Anyway this routine is pretty cool.
|
|