Login
Back to forumReply to this topicGo to last reply

Posted By

indi
on 2006-01-05
18:49:35
 Xply routines...

Hiya...

IM tryng to get the fastest XPLY I can, and currently my best is 131 cylces best, 163 worst case for an 8x8 bit multiply. I do have a faster one, but it uses a nibble lookup table, which I dont want to waste memory on.

Anyone got a better one?
Heres mine in case someone can improve it.... (no illegal ops though please)

xp			macro			;xply macro
lsr xplyc ; 5
bcc !here ; 2 (worst case, since 3 if branch taken actually saves 4 cycles)
clc ; 2
adc xplyd ; 3
!here ror a ; 2
ror xplyb ; 5 = 19
endm



;*****************************************************************************
; XPLY routine
;
; enter : c = xp 1 ; d = xp 2
; exit : A = ans hi / b = ans lo
;
; used : a
;
; Totals including RTS (not including jsr to here)
; 163 WORST case
; 131 BEST case
;
;*****************************************************************************
XPLY: lda #$00
sta xply
xp
xp ;xp = lsr memC
xp ; bcc !here
xp ; clc
xp ;!here adc memD
xp ; ror a
xp ; ror memB
xp
rts


Posted By

indi
on 2006-01-05
18:50:04
 Re: Xply routines...

hell.... is there a way to post "pre-formated" text?

Posted By

Csabo
on 2006-01-05
19:38:00
 Re: Xply routines...

Sure, just put a PRE HTML tag around your code. I'll have to think about the multiply routine though.

Posted By

filker
on 2006-01-07
02:24:57
 Re: Xply routines...

C=Hacking #9 discusses some multiply algorithms. Discussed methods use these functions:

f(x) = x*x/4
f(a+b) - f(a-b) = a*b

and

log(x*y) = log(x) + log(y)
log_b(x) = y <=> b^y = x

I know you don't want tables, but article's writer Stephen Judd says it takes only 24 to 36 cycles. I think worth to try...

Posted By

indi
on 2006-01-07
05:47:05
 Re: Xply routines...

I dont quite follow this.... but form what I can see, it only returns an 8bit number. I need a full 8*8=16bit number back.

My old nibble table was 8k and took 92 cycles(fixed) due to the most part of having to deal with 16 bit answers.

I think this can be modified to make a 16bit answer, but it would probably take around the same time as mine...Im not sure.

I have pinched their "noirmal" xply, as it seems to be a bit quicker...same idea... different code happy

The new one comes out at 134 worst, and 98 best... which is much better.

Posted By

Rachy
on 2006-01-07
06:03:21
 Re: Xply routines...

I never thought over, but once we talked with Pigmy about this topic, and he mentioned a lookup table version.
You can split up each byte into two 4 bits, which means 4*4=8 bit result. This result fits into a byte, so the initial two 4 bits can be used for indexing on a multiplication lookup table. After multiply the lower and the higher 4 bits you can add them and voa'la.

I don't know if it would be faster, but it is possible IMO, it worth a try.

Posted By

indi
on 2006-01-07
06:13:45
 Re: Xply routines...

BTW.... the listing in C= hacking is wrong...


This is whats listed.....

MULT LDA #$00
LDY #$09
]LOOP LSR
ROR ACC
BCC MULT2
CLC
ADC AUX
MULT2 DEY
BNE ]LOOP
STA EXT


This is what it should be...

MULT LDA #$00
LDY #$09
CLC
]LOOP ROR
ROR ACC
BCC MULT2
CLC
ADC AUX
MULT2 DEY
BNE ]LOOP
STA EXT


He probably unrolls the loop in is code, and has only the first one as a LSR to avoid the CLC

happy

Posted By

indi
on 2006-01-07
06:38:18
 Re: Xply routines...

I coded in a nibble * nibble.... but before I even try and load and shift the answers, its hitting 98 cycles... so that won't work happy

Posted By

filker
on 2006-01-07
07:22:42
 Re: Xply routines...

Hello again,

6502.org has some math code. And one of the examples implements S. Judd's method for 16bit output.
It seems it's no more than 50 cycles, but has 2K of table.

Related links are:

http://www.6502.org/source/
http://www.6502.org/source/integers/fastmult.htm

I hope this helps.

bye.

Ilker Ficicilar
http://cbm.ficicilar.name.tr/

--

Posted By

Rachy
on 2006-01-08
05:41:17
 Re: Xply routines...

@MikeD
I don't exactly get what is wrong with LSR in that code...

Posted By

indi
on 2006-01-08
10:40:51
 Re: Xply routines...

Because its in a loop, all following shifts must include the carry from the add at the bottom.

I tried several tests to make sure... honest happy

If you unroll the loop, then using a LSR instead on a ROR saves you 2 clocks because you dont need the CLC. Chances are his own routine is expanded (since it saves 44 clocks!), but its not nearly as easy to read, so he probably compacted it and forgot about the first LSR.

Posted By

Rachy
on 2006-01-09
02:13:33
 Re: Xply routines...

Ah, sorry! I missed the loop label... Now it is perfectly clear. happy

Posted By

gerliczer
on 2006-01-18
06:58:32
 Re: Xply routines...

Maybe it is possible to decrease the number of atomic add+rotate operations if you BIT test the value of c and skip one or two with the appropriate braches (BNE and BVC (?)). But this testing makes the worst case even worse.

Posted By

bubis
on 2006-01-18
07:28:43
 Re: Xply routines...

Hi,

I don't think the first routine posted here is correct. "adc xplyd" may set the carry and the next ror moves it into the lowest bit of A, which is not what u want.

Regards,
Andras

Posted By

indi
on 2006-01-18
09:39:23
 Re: Xply routines...

Actually... ROR shifts the bit into the TOP of "A". Also, since A starts out as 0, the resulting carry from this is always 0.

C -> [76543210] -> C

This routine was lifted directly from the source to Blood Money where it was used in circle calculations. It was then used in xeo3 for exactly the same thing (along with some clipping code) before I posted here to try and speed things up. It works fine - honest. Id say try it and see, but the second one posted is much quicker... so use that instead -particually when you unroll it. happy



Actually...I just thought.... if you pre-decriment "AUX" you can ignore the CLC (since it will now add 1 extra ALWAYS) and that'll save 18 cycles more - worst case. But adds 5 best case... Still its more consistant then... And if its a constant your multiplying by, you can "pre-decrement" as well, saving you those 5 cycles...

It was 134 worst, and 98 best... Which now becomes 121 worst, 103 best. Thats "almost" a whole scan line saved from the first version posted.

Posted By

bubis
on 2006-01-20
08:33:05
 Re: Xply routines...

Ups, sorry. You are right!

Posted By

indi
on 2006-01-25
14:09:58
 Re: Xply routines...

Thought I'd post thew final routine.... To give you an idea of how important this was, my path system uses a "circle" command which allows complex paths from a single command. Each takes 2 xply's. 6 baddies*2 muls=36 scan lines spent in just the xply routine.

Now however, I've cut that by 12 scan lines... quite a saving... So here it is. You may notice I've added the "dec xplyc" to offset clearing the carry all the time, and I've removed the first "lsr a" since it was shifting nothing, and moved it to "lsr xplyd", which saved another 2.

For a non-table driver routine, its pretty good! Also, if you add a one over table, you can use this as a divide.


;*****************************************************************************
; XPLY routine
;
; enter : xplyc = xp 1 ; xplyd = xp 2
; exit : A = ans hi / xplyc = ans lo
;
; used : a
;
; Totals including RTS (not including jsr to here)
; 163 WORST case - OLD
; 131 BEST case - OLD
;
; 119 worst
; 101 best
;*****************************************************************************
XPLY:
LDA #$00 ; 2
dec xplyc ; 5 pre-decrement to offset a SET carry when adding "xplyc"

lsr xplyd ; 5
BCC !skip1 ; 2
ADC xplyc ; 3 =10
!skip1
ROR ; 2
ROR xplyd ; 5
BCC !skip2 ; 3 = 10 (*9=90 best)
ADC xplyc ;
!skip2
ROR ; 2
ROR xplyd ; 5
BCC !skip3 ; 2
ADC xplyc ; 3 = 12 (*8=96 worst)
!skip3
ROR
ROR xplyd
BCC !skip4
ADC xplyc
!skip4
ROR
ROR xplyd
BCC !skip5
ADC xplyc
!skip5
ROR
ROR xplyd
BCC !skip6
ADC xplyc
!skip6
ROR
ROR xplyd
BCC !skip7
ADC xplyc
!skip7
ROR
ROR xplyd
BCC !skip8
ADC xplyc
!skip8
ROR
ROR xplyd
BCC !skip9
ADC xplyc
!skip9
rts


Posted By

indi
on 2008-07-01
15:58:59
 Re: Xply routines...

Mmm... Found a bug with this routine - after all this time! I was plugging it into my old Blood Money source and it was working for most things, EXCEPT when you multiply by a zero! (0), then its just wrong! Pooh.

So, xplyd can be zero, but xplyc can't be... The pre-decrement is killing it. This means you'd have to check it either before calling, or inside the routine itself.

Also the comment here was wrong. A = ans hi, xplyd = ans lo

happy

Posted By

Degauss
on 2008-07-01
17:25:08
 Re: Xply routines...

I feel ashamed: I was using your routine somewhen and found that "bug" too, but i didn't post about this. I thought its a fair trade: fast-code but faulty when multiplying with zero. Anyway this routine is pretty cool.



Back to topReply to this topic


Copyright © Plus/4 World Team, 2001-2025. Support Plus/4 World on Patreon