Rocksolid Light - comp.arch - Re: AMD CPU funny

Re: AMD CPU funny

<c2d29ce394401627a285cd67760b773f@news.novabbs.com>

https://news.novabbs.org/devel/article-flat.php?id=36565&group=comp.arch#36565

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 3 Jan 2024 17:53:53 +0000
Organization: novaBBS
Message-ID: <c2d29ce394401627a285cd67760b773f@news.novabbs.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2150890"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$LTAmNaDAGIpJwNlLwdEtIOPUkDURj0s2Cm7GZ6/MAKxfylSrrQZO2
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Wed, 3 Jan 2024 17:53 UTC

Terje Mathisen wrote:

>
> Something like

> if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
> (60-n*6) & 63;

unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 };
unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

fib( unsigned n )
{ if( n <= 14 ) return bytes[n];
if( n <= 25 ) return half[n-14];
return fib( n-1 ) + fib( n-2 );
}

Re: AMD CPU funny

<un4bs7$3a8u6$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36573&group=comp.arch#36573

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 3 Jan 2024 20:17:26 +0100
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <un4bs7$3a8u6$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <um43of$1j3jj$1@dont-email.me>
<eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<c2d29ce394401627a285cd67760b773f@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Wed, 3 Jan 2024 19:17:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6bcf59e7c5b1f9833db8830a1d1f2250";
logging-data="3482566"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/zsf2N4XL2ECoup7/q7ow0e+5HZlRTu4ht+q/x6DYERQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:hJC7dSu53hNsIApjFkTl11No2cQ=
In-Reply-To: <c2d29ce394401627a285cd67760b773f@news.novabbs.com>

by: Terje Mathisen - Wed, 3 Jan 2024 19:17 UTC

MitchAlsup wrote:
> Terje Mathisen wrote:
>
>>
>> Something like
>
>>    if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
>> (60-n*6) & 63;
>
>
> unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,
> 233 };
> unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946,
> 17711, 28657, 46368 };
>
> fib( unsigned n )
> {
>     if( n <= 14 ) return bytes[n];
>     if( n <= 25 ) return half[n-14];
>     return fib( n-1 ) + fib( n-2 );
> }
Very nice except for not using the unroll-by-two, started with the two
last table entries? (Or second-to-last to fix any parity issue)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36578&group=comp.arch#36578

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 3 Jan 2024 21:05:32 +0000
Organization: novaBBS
Message-ID: <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2167993"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$D/ezz7.UmucCygAxAwmuke8PV8fmZbOd7I1znSZl3x40cPYPEkIRW
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Wed, 3 Jan 2024 21:05 UTC

unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 };
unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

fib( unsigned n )
{ if( n <= 14 ) return bytes[n];
if( n <= 25 ) return half[n-14];
uint64_t a = half[9], b = half[10];
for( n -= 25; n; n-- )
{
b += a;
a += b;
}
return a;
}

Re: AMD CPU funny

<86ttntpxus.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36589&group=comp.arch#36589

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 03 Jan 2024 19:43:23 -0800
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <86ttntpxus.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <cd3668bf81038e750419de48521e3ab9@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3726108"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m0450xd7ZTlMDL+SLpgkejpEi2YEa5ws="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:MIwW+BAxVRdFOZUZf1PzEjxS5dA=
sha1:j84b4hdjypGGgOjdjjJNZdltFKE=

by: Tim Rentsch - Thu, 4 Jan 2024 03:43 UTC

mitchalsup@aol.com (MitchAlsup) writes:

> Tim Rentsch wrote:
>
>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>>>
>
>>>> Incidentally your code produces wrong values, as
>>>> fib(0) is 0 and fib(1) is 1.
>>>
>>> So return n for (n<2)?
>>
>> In situations like this one, usually I would favor something
>> more like this:
>>
>> if( n < 8 ) return 0xd8532110u >> n*4 & 15;
>
> Which is::
>
> CMP R2,R1,#8
> PLT R2,TTT
> LEA R2,[IP,R1<<2,0xF00000000-.]
> SLL R1,0xd853110,R2
> RET
>
> ....

Probably not exactly that, but thank you.

Re: AMD CPU funny

<86plyhpvcw.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36591&group=comp.arch#36591

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nntp.terraraq.uk!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 03 Jan 2024 20:37:19 -0800
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <86plyhpvcw.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3737573"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/c0+KEbHc8zfZqovbx5TLEiQTi3nvuedw="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:fymjNDIw9vAaSeguvZeSwRQfiZg=
sha1:8nfL/j/mkDilmPgEBqFySv5M8Ik=

by: Tim Rentsch - Thu, 4 Jan 2024 04:37 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Tim Rentsch wrote:
>
>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>
>>> Tim Rentsch wrote:
>>>
>>>> Thomas Koenig <tkoenig@netcologne.de> writes:
>>
>> [...]
>>
>>>>> unsigned long int fib (int n)
>>>>> {
>>>>> unsigned long f1, f2, fn;
>>>>>
>>>>> if (n < 2)
>>>>> return 1;
>>>>>
>>>>> f1 = 1;
>>>>> f2 = 1;
>>>>> while (n >= 2) {
>>>>> fn = f2 + f1;
>>>>> f1 = f2;
>>>>> f2 = fn;
>>>>> n--;
>>>>> }
>>>>> return fn;
>>>>> }
>>
>> [...]
>>
>>>> Incidentally your code produces wrong values, as
>>>> fib(0) is 0 and fib(1) is 1.
>>>
>>> So return n for (n<2)?
>>
>> In situations like this one, usually I would favor something
>> more like this:
>>
>> if( n < 8 ) return 0xd8532110u >> n*4 & 15;
>
> Nice!
>
> Could be extended a little bit more with a lookup table, or to 16
> values with a SIMD in-register lookup operation.

No reason to provide more values. It turns out these initial 8
values are just what the doctor ordered as a lead in to my latest
iterative version. In earlier versions putting in this special
case hurt more than it helped.

> Using a 64-bit
> constant allows 11 values to fit inside 6-bit sub-containers, as long
> as they are stored in reverse order so that the bottom value (0) can
> survive without the top two bits.
>
> Something like
>
> if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
> (60-n*6) & 63;

Probably more clever than useful. If I were doing this most likely
I would write the constant in octal:

0001010203051015254267u >> (60-6*n) & 63

Note by the way that you left off the value 55, which puts the
values 21 and 34 in the wrong places.

> Anyway, this is code golfing and not a realistic scenario where you
> would definitely use a direct lookup table for all the possible
> values.

A full lookup table is too doggone big. Also part of the point
of using the constant shift method is not to access data memory,
as that has other costs (and which are unlikely to show up in
speed measurements). The computational form is plenty fast
enough (worst case time between 7 and 8 nanoseconds in my tests).

This technique of packing several small values into a constant
and shifting to get a specific result is useful in lots of
situations. For example, if we are writing a function to test
for primality, it can start with

if( n < 64 ) return 0240404242024042424254 >>n &1;

for a fast test on small integers.

Re: AMD CPU funny

<86le95pv6k.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36592&group=comp.arch#36592

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 03 Jan 2024 20:41:07 -0800
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <86le95pv6k.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3737573"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18VQfgCQsyM8BQOQXjqZ4v8t0VRaJNC/zI="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:ijLd8m1dGTggV3e4LanEVEpGlI4=
sha1:LaB0k0X3kXIPtHtuQrbmzMy5Lz4=

by: Tim Rentsch - Thu, 4 Jan 2024 04:41 UTC

mitchalsup@aol.com (MitchAlsup) writes:

> Terje Mathisen wrote:
>
>> Something like
>>
>> if (n < 11) return
>> ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;
>
> unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 };
> unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
> 10946, 17711, 28657, 46368 };
>
> fib( unsigned n )
> {
> if( n <= 14 ) return bytes[n];
> if( n <= 25 ) return half[n-14];
> return fib( n-1 ) + fib( n-2 );
> }

Several bugs in this code (and exponential blowup when n > 25).

Re: AMD CPU funny

<86h6jtpurv.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36593&group=comp.arch#36593

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Wed, 03 Jan 2024 20:49:56 -0800
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <86h6jtpurv.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3737573"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+MnzeKQo6PbV5sYqjBTZltKLthShTRvwY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:W9ZBE0Zu5YtStBMCBYHOSH8K1zY=
sha1:yStnBJ/LBsmmyAGeAN6147rgb2c=

by: Tim Rentsch - Thu, 4 Jan 2024 04:49 UTC

mitchalsup@aol.com (MitchAlsup) writes:

> unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 };
> unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
> 10946, 17711, 28657, 46368 };
>
> fib( unsigned n )
> {
> if( n <= 14 ) return bytes[n];
> if( n <= 25 ) return half[n-14];
> uint64_t a = half[9], b = half[10];
> for( n -= 25; n; n-- )
> {
> b += a;
> a += b;
> }
> return a;
> }

You need to divide n-25 by 2, and adjust for an even or odd
value before the divide.

Re: AMD CPU funny

<un5l09$3it6l$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36596&group=comp.arch#36596

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 07:59:21 +0100
Organization: A noiseless patient Spider
Lines: 109
Message-ID: <un5l09$3it6l$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<86plyhpvcw.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 4 Jan 2024 06:59:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f0ddfc33a47954e9db22546cc8017656";
logging-data="3765461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19H9sLQK5M966QHOziNi1+JAHJ3IdQYwXU0k7+NUYCe+A=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:969b4g1VzUj00WMt3Lmu+ST5qpo=
In-Reply-To: <86plyhpvcw.fsf@linuxsc.com>

by: Terje Mathisen - Thu, 4 Jan 2024 06:59 UTC

Tim Rentsch wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>> Tim Rentsch wrote:
>>
>>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>>
>>>> Tim Rentsch wrote:
>>>>
>>>>> Thomas Koenig <tkoenig@netcologne.de> writes:
>>>
>>> [...]
>>>
>>>>>> unsigned long int fib (int n)
>>>>>> {
>>>>>> unsigned long f1, f2, fn;
>>>>>>
>>>>>> if (n < 2)
>>>>>> return 1;
>>>>>>
>>>>>> f1 = 1;
>>>>>> f2 = 1;
>>>>>> while (n >= 2) {
>>>>>> fn = f2 + f1;
>>>>>> f1 = f2;
>>>>>> f2 = fn;
>>>>>> n--;
>>>>>> }
>>>>>> return fn;
>>>>>> }
>>>
>>> [...]
>>>
>>>>> Incidentally your code produces wrong values, as
>>>>> fib(0) is 0 and fib(1) is 1.
>>>>
>>>> So return n for (n<2)?
>>>
>>> In situations like this one, usually I would favor something
>>> more like this:
>>>
>>> if( n < 8 ) return 0xd8532110u >> n*4 & 15;
>>
>> Nice!
>>
>> Could be extended a little bit more with a lookup table, or to 16
>> values with a SIMD in-register lookup operation.
>
> No reason to provide more values. It turns out these initial 8
> values are just what the doctor ordered as a lead in to my latest
> iterative version. In earlier versions putting in this special
> case hurt more than it helped.
>
>> Using a 64-bit
>> constant allows 11 values to fit inside 6-bit sub-containers, as long
>> as they are stored in reverse order so that the bottom value (0) can
>> survive without the top two bits.
>>
>> Something like
>>
>> if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
>> (60-n*6) & 63;
>
> Probably more clever than useful. If I were doing this most likely
> I would write the constant in octal:
>
> 0001010203051015254267u >> (60-6*n) & 63
>
> Note by the way that you left off the value 55, which puts the
> values 21 and 34 in the wrong places.
>
>> Anyway, this is code golfing and not a realistic scenario where you
>> would definitely use a direct lookup table for all the possible
>> values.
>
> A full lookup table is too doggone big. Also part of the point
> of using the constant shift method is not to access data memory,
> as that has other costs (and which are unlikely to show up in
> speed measurements). The computational form is plenty fast
> enough (worst case time between 7 and 8 nanoseconds in my tests).

I agree, this is indeed "fast enough".
>
> This technique of packing several small values into a constant
> and shifting to get a specific result is useful in lots of
> situations. For example, if we are writing a function to test
> for primality, it can start with
>
> if( n < 64 ) return 0240404242024042424254 >>n &1;
>
> for a fast test on small integers.
>

I have been using this particular trick since at least the 386, so
around 1990. :-)

It did not make the same kind of sense when registers only had 16 bits
and a table lookup was about the same cost as a shift & mask.

Your 64-bit bitmap here is (inversely) related to the modulo 30 packing
I'm using in my sieve, where the 8 possible prime locations in each
block of 30 numbers will fit in a byte.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<un5lfs$3iv05$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36597&group=comp.arch#36597

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 08:07:40 +0100
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <un5lfs$3iv05$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <um43of$1j3jj$1@dont-email.me>
<eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<c2d29ce394401627a285cd67760b773f@news.novabbs.com>
<un4bs7$3a8u6$1@dont-email.me>
<8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Thu, 4 Jan 2024 07:07:40 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f0ddfc33a47954e9db22546cc8017656";
logging-data="3767301"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18aEYVORBr6qOKR+23nBH7w2Bd5fNaHqsMae7sYIN3BdA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:Zylu6Vz3ccqcwjwWaRn5aOPL9Tg=
In-Reply-To: <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>

by: Terje Mathisen - Thu, 4 Jan 2024 07:07 UTC

MitchAlsup wrote:
> unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,
> 233 };
> unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946,
> 17711, 28657, 46368 };
>
> fib( unsigned n )
> {
>     if( n <= 14 ) return bytes[n];
>     if( n <= 25 ) return half[n-14];
>     uint64_t a = half[9], b = half[10];
>     for( n -= 25; n; n-- )
>     {
>          b += a;
>          a += b;
>     }
>     return a;
> }

As Tim wrote, you need something like

uint64_t a = half[8+(n&1)], b = half[9+(n&1)];
for (n = (n-24)>>1; n; n--) {}

to start the unrolled loop with the correct parity, although I might
have an off-by-one error in the (a,b) initializer here...

Since Tim is using a 4-way unroll, I believe he would be loading 4
values from a table of at least 7 entries to handle loop alignement.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<un5sl2$3jouh$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36601&group=comp.arch#36601

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 10:09:53 +0100
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <un5sl2$3jouh$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <um43of$1j3jj$1@dont-email.me>
<eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<c2d29ce394401627a285cd67760b773f@news.novabbs.com>
<un4bs7$3a8u6$1@dont-email.me>
<8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
<un5lfs$3iv05$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 4 Jan 2024 09:09:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="16135d3076a7f0c078b92e0b6b660678";
logging-data="3793873"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/d6l2CMFwPL2zLWd2BADd/IbMv8sDb0zi8EAWhPs7/IQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:MRFyk5orOyB0sns9jB37DWD1Two=
In-Reply-To: <un5lfs$3iv05$1@dont-email.me>

by: Terje Mathisen - Thu, 4 Jan 2024 09:09 UTC

Terje Mathisen wrote:
>
> Since Tim is using a 4-way unroll, I believe he would be loading 4
> values from a table of at least 7 entries to handle loop alignement.
>

I took another look at this, and it looks pretty nice to use 4 starting
entries (a..d in increasing order), then a two-cycle recurrence:

a = c+d
b = c+2*d
c = a+b
d = a+2*b

which every LEA-capable compiler around could convert into

lea rax,[rcx+rdx]
lea rbx,[rcx+2*rdx]

lea rcx,[rax+rbx]
lea rdx,[rax+2*rbx]

The pairs of instructions can run at the same time, so this can run in
two cycles as long as back-to-back LEA can run in subsequent cycles.

To run even faster than this I seem to need actual multiplications, with
small fibonacci numbers as the multipliers:

f(n+1) = f(n)+f(n-1)
f(n+2) = 2*f(n)+f(n-1)
f(n+3) = 3*f(n)+2*f(n-1)
f(n+4) = 5*f(n)+3*(f(n-1)
....
f(n+8) = 34*(f(n)+21*f(n-1)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<un5vjo$3k52r$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36603&group=comp.arch#36603

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 11:00:23 +0100
Organization: A noiseless patient Spider
Lines: 96
Message-ID: <un5vjo$3k52r$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me>
<e2bb329488167ada6ece600c8f28d3ed@news.novabbs.com>
<um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<ums29d$1p6cc$1@dont-email.me> <8634vgrwc2.fsf@linuxsc.com>
<un0hcp$2jca3$1@dont-email.me> <861qazqtxu.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 4 Jan 2024 10:00:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="16135d3076a7f0c078b92e0b6b660678";
logging-data="3806299"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tXI9ukLTjFnD6EZggT2zM1SKu4nHiZMA8cBviqhy1PQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:gMVhKPpUfZUBAkeJYDwz42w0ayg=
In-Reply-To: <861qazqtxu.fsf@linuxsc.com>

by: Terje Mathisen - Thu, 4 Jan 2024 10:00 UTC

Tim Rentsch wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>> Tim Rentsch wrote:
>>
>>> Daniel James <daniel@me.invalid> writes:
>>>
>>>> On 31/12/2023 09:12, Pancho wrote:
>>>>
>>>>> In fact, given that fib grows exponentially, so we only ever need to
>>>>> calculate it for small values of n, the only ?bad? solution is the
>>>>> naive double recursive one.
>>>>
>>>> Well, I might disagree.
>>>>
>>>> Given that we only ever need to calculate it for small values of n (as
>>>> you say, and if that's true) its efficiency doesn't matter. If you
>>>> need to compute it often enough that efficiency might become relevant
>>>> just put the results in a table and use that. Given that fib goes over
>>>> 10^12 after about 60 terms it needn't be a very big table.
>>>
>>> When the size of the table is more than ten times the size of
>>> the code, and the code is reasonably fast, using a table is a
>>> poor choice. Fibonacci functions don't have to be slow.
>>>
>>>> The naive doubly recursive solution has the huge merit of being
>>>> human-readable.
>>>
>>> But unacceptably slow.
>>>
>>>> Unlike, say:
>>>>
>>>> double fib( int n )
>>>> {
>>>> return (pow( (1+sqrt(5.0))/2, n+1 )
>>>> - pow( (1-sqrt(5.0))/2, n+1 ))
>>>> / sqrt(5.0);
>>>> }
>>>>
>>>> ... which is great for fibs of big values, but not fast for small ones.
>>>
>>> Unacceptable for fibs of big values, because the values computed
>>> are wrong.
>>
>> So either you move to quad or arbitrary precision (which will of
>> course be somewhat slow,
>
> To me, what makes floating point unattractive in this case is
> great care must be taken to ensure correct values are produced.
> Add to that the increased time cost, and using floating point
> just looks like a lot more downside than upside.
>
>> or you use a hybrid technique, with lookup
>> tables for smallish n, iterative loop for somewhat larger and your
>> O(log(n)) approach only for very large n?
>
> Oh, you remembered I wrote an O(log(n)) fibonacci function. That
> is nice. :)
>
> Based on what I know now, my inclination is to use the constant-shift
> technique (as seen in another post) for n < 8, the O(log(n)) method
> for results that will be more than 64 bits, and a small-constant-linear
> method in between those two regions. (Incidentally, the intermediate
> method is something I had already written, and might be seen as a
> generalization of your two-fold speedup code. You might want to take
> a look at enhancing the two-fold speedup idea so it does at most one of
> those, and then after that gets a four-fold speedup. The best
> performing code I have for 64 bits goes two steps farther, so all the
> iterations after the "startup transient" are 16-fold speedup.)

Do you actually get 16 results per clock cycle?

I don't see _any_ way to do that...
>
>> Working with 256-bit uints, each addition will take 4-8 clock cycles,
>> and it overflows well before n reaches 400, right? In that case a
>> worst case iterative solution is less than 3000 clock cycles or about
>> a microsecond away.
>
> By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit

That is impressive!

> results. At a guess, 256-bit results would run about 5 to 10
> times slower than that.
>
> By the way, your two-fold speedup code runs about 1.6 times as
> fast as using simple iteration.
>
That's in the expected ballpark.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<86v889nul2.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36608&group=comp.arch#36608

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 04 Jan 2024 04:36:57 -0800
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <86v889nul2.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <86plyhpvcw.fsf@linuxsc.com> <un5l09$3it6l$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3841422"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+dZ99M08+bOnWe+MTj2P5L3RfV+INJ48M="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:SZIGIvoPX5gkB5ssjz5nWqqzNmg=
sha1:Bik5QT14r5bM3buhzVSUZedSW4U=

by: Tim Rentsch - Thu, 4 Jan 2024 12:36 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Tim Rentsch wrote:
[...]
>> This technique of packing several small values into a constant
>> and shifting to get a specific result is useful in lots of
>> situations. For example, if we are writing a function to test
>> for primality, it can start with
>>
>> if( n < 64 ) return 0240404242024042424254 >>n &1;
>>
>> for a fast test on small integers.
>
> [...]
>
> Your 64-bit bitmap here is (inversely) related to the modulo 30
> packing I'm using in my sieve, where the 8 possible prime
> locations in each block of 30 numbers will fit in a byte.

I wrote one of these about 20 years, and recently was motivated
to write one again. I'm curious - have you tried running your
sieve for all primes less than a trillion (1e12)?

Re: AMD CPU funny

<86r0ixnt6q.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36609&group=comp.arch#36609

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 04 Jan 2024 05:07:09 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <86r0ixnt6q.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3854445"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tvnHL9gguHh6GCZZrPxI9bsJpAIfm1KY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:Dn+CjcGUXwI5VicoaxlR1FNjqPc=
sha1:gWfI+TCuPgyv2W3Vqj3Gu6CslK0=

by: Tim Rentsch - Thu, 4 Jan 2024 13:07 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Terje Mathisen wrote:
>
>> Since Tim is using a 4-way unroll, I believe he would be loading 4
>> values from a table of at least 7 entries to handle loop alignement.
>
> I took another look at this, and it looks pretty nice to use 4
> starting entries (a..d in increasing order), then a two-cycle
> recurrence:
>
> a = c+d
> b = c+2*d
> c = a+b
> d = a+2*b
>
> which every LEA-capable compiler around could convert into
>
> lea rax,[rcx+rdx]
> lea rbx,[rcx+2*rdx]
>
> lea rcx,[rax+rbx]
> lea rdx,[rax+2*rbx]
>
> The pairs of instructions can run at the same time, so this can run in
> two cycles as long as back-to-back LEA can run in subsequent cycles.
>
> To run even faster than this I seem to need actual multiplications,
> with small fibonacci numbers as the multipliers:
>
> f(n+1) = f(n)+f(n-1)
> f(n+2) = 2*f(n)+f(n-1)
> f(n+3) = 3*f(n)+2*f(n-1)
> f(n+4) = 5*f(n)+3*(f(n-1)
> ...
> f(n+8) = 34*(f(n)+21*f(n-1)

As much as I am tempted to post the actual code, let me start
with something more like a hint, which I fully expect you can
carry forward to do the whole deal.

(forward by 0) a b
(forward by 1) b a+b
(forward by 2) a+b a+2b
(forward by 3) a+2b 2a+3b
(forward by 4) 2a+3b 3a+5b
(forward by 5) 3a+5b 5a+8b
(forward by 6) 5a+8b 8a+13b

and so forth.

Re: AMD CPU funny

<86mstlnsve.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36611&group=comp.arch#36611

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 04 Jan 2024 05:13:57 -0800
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <86mstlnsve.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <ums29d$1p6cc$1@dont-email.me> <8634vgrwc2.fsf@linuxsc.com> <un0hcp$2jca3$1@dont-email.me> <861qazqtxu.fsf@linuxsc.com> <un5vjo$3k52r$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="672bd22e7372e7cd28625c54c459a735";
logging-data="3854445"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/h0Wz9m0/9rvjpqnN56E+Z1LsKYU79St4="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:gbrmUDWmg21A1x4TqlmegwSjOQ4=
sha1:PJygt5YTH+pukVPV273dNQIHm7o=

by: Tim Rentsch - Thu, 4 Jan 2024 13:13 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Tim Rentsch wrote:

Just one short followup here for now...

>> By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit
>
> That is impressive!

I took the O(log(n)) code and put it in python (with arbirary
precision integers). Running fib( 1000000 ) took well under
a second.

Exponentials are insanely slow, and O(log(n)) is insanely fast...

Re: AMD CPU funny

<un6o7u$3nkq4$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36617&group=comp.arch#36617

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 18:00:45 +0100
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <un6o7u$3nkq4$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<86plyhpvcw.fsf@linuxsc.com> <un5l09$3it6l$1@dont-email.me>
<86v889nul2.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 4 Jan 2024 17:00:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f0ddfc33a47954e9db22546cc8017656";
logging-data="3920708"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19p7CGbiYBPSWAgHOdc/XKubJ4V/BvwDa25OFIvUxafFw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:YvQ6oABbGd9/xZ4BXLrMeZ6rV8Y=
In-Reply-To: <86v889nul2.fsf@linuxsc.com>

by: Terje Mathisen - Thu, 4 Jan 2024 17:00 UTC

Tim Rentsch wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>> Tim Rentsch wrote:
> [...]
>>> This technique of packing several small values into a constant
>>> and shifting to get a specific result is useful in lots of
>>> situations. For example, if we are writing a function to test
>>> for primality, it can start with
>>>
>>> if( n < 64 ) return 0240404242024042424254 >>n &1;
>>>
>>> for a fast test on small integers.
>>
>> [...]
>>
>> Your 64-bit bitmap here is (inversely) related to the modulo 30
>> packing I'm using in my sieve, where the 8 possible prime
>> locations in each block of 30 numbers will fit in a byte.
>
> I wrote one of these about 20 years, and recently was motivated
> to write one again. I'm curious - have you tried running your
> sieve for all primes less than a trillion (1e12)?
>
No, I wrote this back in the 32-bit days and even though it is blocked
(with a 128 kB block size corresponding to nearly 4M numbers sieved per
block afair), I never ran it _that_ far!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<557aa299a229d08e934e98e8503aae7e@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36624&group=comp.arch#36624

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 23:08:26 +0000
Organization: novaBBS
Message-ID: <557aa299a229d08e934e98e8503aae7e@news.novabbs.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2297637"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$qq12kjNx4Wwdumxumoa27OCM1Rdj6pgsfV4bPm17qwXtEcIl4QWrm
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Thu, 4 Jan 2024 23:08 UTC

Terje Mathisen wrote:

>
> To run even faster than this I seem to need actual multiplications, with
> small fibonacci numbers as the multipliers:

> f(n+1) = f(n)+f(n-1)
> f(n+2) = 2*f(n)+f(n-1)
> f(n+3) = 3*f(n)+2*f(n-1)
> f(n+4) = 5*f(n)+3*(f(n-1)
> ....
> f(n+8) = 34*(f(n)+21*f(n-1)

Combining::

unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
144, 233, 377, 610, 987, 1597, 2584 };

fib( unsigned n )
{ i = n & 15;
a = table[i ];
b = table[i+1];
for( n -= i; n ; n -= 16 )
{
i = n & 15;
a = a*table[i ] + b*table[i+1];
b = a*table[i+1] + b*table[i+2];
}
return a;
}

Knocking off 16 fibs per loop.

Re: AMD CPU funny

<189a652eb8b48230eaf2ba1fe099fd9e@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36625&group=comp.arch#36625

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Thu, 4 Jan 2024 23:19:28 +0000
Organization: novaBBS
Message-ID: <189a652eb8b48230eaf2ba1fe099fd9e@news.novabbs.com>
References: <ulv7j4$l0o0$1@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me> <557aa299a229d08e934e98e8503aae7e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2298471"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$AVKbVDP8iVHM1sAPyV5/fOIMLgOAWG2y/PuE.FeAMNlguriU1H1vu

by: MitchAlsup - Thu, 4 Jan 2024 23:19 UTC

MitchAlsup wrote:

> Terje Mathisen wrote:

>>
>> To run even faster than this I seem to need actual multiplications, with
>> small fibonacci numbers as the multipliers:

>> f(n+1) = f(n)+f(n-1)
>> f(n+2) = 2*f(n)+f(n-1)
>> f(n+3) = 3*f(n)+2*f(n-1)
>> f(n+4) = 5*f(n)+3*(f(n-1)
>> ....
>> f(n+8) = 34*(f(n)+21*f(n-1)

> Combining::

> unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
> 144, 233, 377, 610, 987, 1597, 2584 };

> fib( unsigned n )
> {
> i = n & 15;
> a = table[i ];
> b = table[i+1];
> for( n -= i; n ; n -= 16 )
> {
> i = n & 15;
t = a*table[i ] + b*table[i+1];
> b = a*table[i+1] + b*table[i+2];
a = t;
> }
> return a;
> }

> Knocking off 16 fibs per loop.

Re: AMD CPU funny

<un8gmv$3nm5$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36628&group=comp.arch#36628

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Fri, 5 Jan 2024 10:04:31 +0100
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <un8gmv$3nm5$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me>
<20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me>
<kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me>
<umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<c2d29ce394401627a285cd67760b773f@news.novabbs.com>
<un4bs7$3a8u6$1@dont-email.me>
<8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
<un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me>
<86r0ixnt6q.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 5 Jan 2024 09:04:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8beb3b537139c86d07e7244064d3bc26";
logging-data="122565"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18M+YvwEbCo+cZ9fMxQIMxZAVSpg0XJC51CKE+/WUs9fA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:toNXt3QuMf8ykJGZ1PsvmIa1W5M=
In-Reply-To: <86r0ixnt6q.fsf@linuxsc.com>

by: Terje Mathisen - Fri, 5 Jan 2024 09:04 UTC

Tim Rentsch wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>> Terje Mathisen wrote:
>>
>>> Since Tim is using a 4-way unroll, I believe he would be loading
>>> 4 values from a table of at least 7 entries to handle loop
>>> alignement.
>>
>> I took another look at this, and it looks pretty nice to use 4
>> starting entries (a..d in increasing order), then a two-cycle
>> recurrence:
>>
>> a = c+d b = c+2*d c = a+b d = a+2*b
>>
>> which every LEA-capable compiler around could convert into
>>
>> lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
>>
>> lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
>>
>> The pairs of instructions can run at the same time, so this can run
>> in two cycles as long as back-to-back LEA can run in subsequent
>> cycles.
>>
>> To run even faster than this I seem to need actual
>> multiplications, with small fibonacci numbers as the multipliers:
>>
>> f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
>> 3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
>> 34*(f(n)+21*f(n-1)
>
> As much as I am tempted to post the actual code, let me start with
> something more like a hint, which I fully expect you can carry
> forward to do the whole deal.
>
> (forward by 0) a b (forward by 1) b a+b (forward
> by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
> 4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
> 5a+8b 8a+13b
>
> and so forth.

Isn't that exactly the same as what I posted just above, i.e. that
subsequent levels require pairs of multiplications by smaller fib numbers?

All these muls are of course independent, but since we only have a very
small number of multipliers in current CPUS, we cannot expect to do more
than a pair of them in each cycle (all with latency 3-5).

Doing it with adds instead means that we have to generate
a,2*a,3*a,5*a,8*a,13*a etc.

We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
gets us to the next 5 fib numbers all done in parallel and at latency 1
or 2:

All these can be done in parallel on a 5-wide CPU:
a5 = LEA(a,a*4);
b3 = LEA(b,b*2); b5 = LEA(b,b*4);
f(n) = a+b;
f(n+1) = LEA(b+a*2);

These next operations can run in the second cycle, so we've generated 5
new fib() numbers in two cycles:

f(n+2) = f(n)+f(n+1);
f(n+3) = a5+b3;
f(n+4) = LEA(b5+a*8);

In the same cycle we can generate a couple more aggregate multiples:

a13 = LEA(a5,a*8); b13=LEA(b5,b*8);

and then we get to spew out the 6th fib value in the third cycle

f(n+5) = LEA(a13,b*8);

but I get the feel that it is better to stay at 4 or 5 new numbers in
each step?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<86il47oh3l.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36635&group=comp.arch#36635

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Fri, 05 Jan 2024 08:55:10 -0800
Organization: A noiseless patient Spider
Lines: 121
Message-ID: <86il47oh3l.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me> <86r0ixnt6q.fsf@linuxsc.com> <un8gmv$3nm5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="f452586c2b0e4043db6639ac46126fbe";
logging-data="246650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+FiG+Tb9fQi5H4TyUpipae9YY4dm+bwg="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:zT8XmXRFCJzjmBLwPHd/HuqlwJQ=
sha1:bpRAS6hzTZkXvpikCkq+OgEbWU4=

by: Tim Rentsch - Fri, 5 Jan 2024 16:55 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Tim Rentsch wrote:
>
>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>
>>> Terje Mathisen wrote:
>>>
>>>> Since Tim is using a 4-way unroll, I believe he would be loading
>>>> 4 values from a table of at least 7 entries to handle loop
>>>> alignement.
>>>
>>> I took another look at this, and it looks pretty nice to use 4
>>> starting entries (a..d in increasing order), then a two-cycle
>>> recurrence:
>>>
>>> a = c+d b = c+2*d c = a+b d = a+2*b
>>>
>>> which every LEA-capable compiler around could convert into
>>>
>>> lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
>>>
>>> lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
>>>
>>> The pairs of instructions can run at the same time, so this can run
>>> in two cycles as long as back-to-back LEA can run in subsequent
>>> cycles.
>>>
>>> To run even faster than this I seem to need actual
>>> multiplications, with small fibonacci numbers as the multipliers:
>>>
>>> f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
>>> 3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
>>> 34*(f(n)+21*f(n-1)
>>
>> As much as I am tempted to post the actual code, let me start with
>> something more like a hint, which I fully expect you can carry
>> forward to do the whole deal.
>>
>> (forward by 0) a b (forward by 1) b a+b (forward
>> by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
>> 4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
>> 5a+8b 8a+13b
>>
>> and so forth.
>
> Isn't that exactly the same as what I posted just above, i.e. that
> subsequent levels require pairs of multiplications by smaller fib
> numbers?

I didn't really follow what you were saying, but I'm pretty
sure the two aren't the same, because I keep only two values,
a and b, no matter how many levels of x-fold-speedup is done.

> All these muls are of course independent, but since we only have a
> very small number of multipliers in current CPUS, we cannot expect to
> do more than a pair of them in each cycle (all with latency 3-5).
>
> Doing it with adds instead means that we have to generate
> a,2*a,3*a,5*a,8*a,13*a etc.
>
> We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
> gets us to the next 5 fib numbers all done in parallel and at latency
> 1 or 2:
>
> All these can be done in parallel on a 5-wide CPU:
> a5 = LEA(a,a*4);
> b3 = LEA(b,b*2); b5 = LEA(b,b*4);
> f(n) = a+b;
> f(n+1) = LEA(b+a*2);
>
> These next operations can run in the second cycle, so we've generated
> 5 new fib() numbers in two cycles:
>
> f(n+2) = f(n)+f(n+1);
> f(n+3) = a5+b3;
> f(n+4) = LEA(b5+a*8);
>
> In the same cycle we can generate a couple more aggregate multiples:
>
> a13 = LEA(a5,a*8); b13=LEA(b5,b*8);
>
> and then we get to spew out the 6th fib value in the third cycle
>
> f(n+5) = LEA(a13,b*8);
>
> but I get the feel that it is better to stay at 4 or 5 new numbers in
> each step?

In the code that follows, b is the current fibonacci number, and a
is the previous fibonacci number.

U64
fib_example( unsigned n ){
U64 b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
U64 na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

Note that fib(8) is 21, and fib(16) is 987.

It should be easy to see how to modify this code to extend it one
level and get a 32-fold speedup for the last stage, which does in
fact give a significant performance improvement.

Also I have no doubt that you can see ways to make the code run
even faster. I'm interested to see what you come up with, with
the stipulation that no lookup tables be used.

Re: AMD CPU funny

<un9d0i$7kl7$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36636&group=comp.arch#36636

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Fri, 5 Jan 2024 11:07:27 -0600
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <un9d0i$7kl7$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me>
<hVh*RTmyz@news.chiark.greenend.org.uk> <um1h21$13l52$1@dont-email.me>
<e2bb329488167ada6ece600c8f28d3ed@news.novabbs.com>
<um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<ums29d$1p6cc$1@dont-email.me> <8634vgrwc2.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 5 Jan 2024 17:07:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e2732455f8e9d56b30f52cd0013f6db2";
logging-data="250535"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18hiJd4KeSaPL3/Kn4i//VC"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sA7ER64qy1x6I81lftY8N/Z+GVI=
Content-Language: en-US
In-Reply-To: <8634vgrwc2.fsf@linuxsc.com>

by: BGB - Fri, 5 Jan 2024 17:07 UTC

On 1/2/2024 2:08 AM, Tim Rentsch wrote:
> Daniel James <daniel@me.invalid> writes:
>
>> On 31/12/2023 09:12, Pancho wrote:
>>
>>> In fact, given that fib grows exponentially, so we only ever need to
>>> calculate it for small values of n, the only ?bad? solution is the
>>> naive double recursive one.
>>
>> Well, I might disagree.
>>
>> Given that we only ever need to calculate it for small values of n (as
>> you say, and if that's true) its efficiency doesn't matter. If you
>> need to compute it often enough that efficiency might become relevant
>> just put the results in a table and use that. Given that fib goes over
>> 10^12 after about 60 terms it needn't be a very big table.
>
> When the size of the table is more than ten times the size of
> the code, and the code is reasonably fast, using a table is a
> poor choice. Fibonacci functions don't have to be slow.
>
>> The naive doubly recursive solution has the huge merit of being
>> human-readable.
>
> But unacceptably slow.
>

Ironically, its slowness is its main use-case...

It puts function call/return through a workout, and also gives a simple
check value to detect if it has gone amiss.

There are many faster ways to calculate Fibonacci numbers, but not so
many that also happen to be a ready-made and fairly concise
microbenchmark...

Even if, yes, in premise a compiler could optimize it away.

But, say, if one wanted something less slow, say (untested):
long fasterfib(int n)
{ long c0, c1, c2;
if(n<2)
return(1);
c0=1;
c1=1;
c2=c0+c1;
while(n>2)
{
c0=c1;
c1=c2;
c2=c0+c1;
n--;
}
return(c2);
}

>> Unlike, say:
>>
>> double fib( int n )
>> {
>> return (pow( (1+sqrt(5.0))/2, n+1 )
>> - pow( (1-sqrt(5.0))/2, n+1 ))
>> / sqrt(5.0);
>> }
>>
>> ... which is great for fibs of big values, but not fast for small ones.
>
> Unacceptable for fibs of big values, because the values computed
> are wrong.

Re: AMD CPU funny

<un9jgr$8jbn$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36639&group=comp.arch#36639

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Fri, 5 Jan 2024 19:58:34 +0100
Organization: A noiseless patient Spider
Lines: 146
Message-ID: <un9jgr$8jbn$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <umn15d$svun$6@dont-email.me>
<kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me>
<umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me>
<86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me>
<c2d29ce394401627a285cd67760b773f@news.novabbs.com>
<un4bs7$3a8u6$1@dont-email.me>
<8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com>
<un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me>
<86r0ixnt6q.fsf@linuxsc.com> <un8gmv$3nm5$1@dont-email.me>
<86il47oh3l.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 5 Jan 2024 18:58:35 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8beb3b537139c86d07e7244064d3bc26";
logging-data="281975"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jcL8KP+y1PFE2bTOmIxAJRHBbPOYvo44qmQQmm5HmxA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:dYYH3pBwwRkVrVHttljMOG2h+Pc=
In-Reply-To: <86il47oh3l.fsf@linuxsc.com>

by: Terje Mathisen - Fri, 5 Jan 2024 18:58 UTC

Tim Rentsch wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>
>> Tim Rentsch wrote:
>>
>>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>>
>>>> Terje Mathisen wrote:
>>>>
>>>>> Since Tim is using a 4-way unroll, I believe he would be loading
>>>>> 4 values from a table of at least 7 entries to handle loop
>>>>> alignement.
>>>>
>>>> I took another look at this, and it looks pretty nice to use 4
>>>> starting entries (a..d in increasing order), then a two-cycle
>>>> recurrence:
>>>>
>>>> a = c+d b = c+2*d c = a+b d = a+2*b
>>>>
>>>> which every LEA-capable compiler around could convert into
>>>>
>>>> lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
>>>>
>>>> lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
>>>>
>>>> The pairs of instructions can run at the same time, so this can run
>>>> in two cycles as long as back-to-back LEA can run in subsequent
>>>> cycles.
>>>>
>>>> To run even faster than this I seem to need actual
>>>> multiplications, with small fibonacci numbers as the multipliers:
>>>>
>>>> f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
>>>> 3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
>>>> 34*(f(n)+21*f(n-1)
>>>
>>> As much as I am tempted to post the actual code, let me start with
>>> something more like a hint, which I fully expect you can carry
>>> forward to do the whole deal.
>>>
>>> (forward by 0) a b (forward by 1) b a+b (forward
>>> by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
>>> 4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
>>> 5a+8b 8a+13b
>>>
>>> and so forth.
>>
>> Isn't that exactly the same as what I posted just above, i.e. that
>> subsequent levels require pairs of multiplications by smaller fib
>> numbers?
>
> I didn't really follow what you were saying, but I'm pretty
> sure the two aren't the same, because I keep only two values,
> a and b, no matter how many levels of x-fold-speedup is done.
>
>> All these muls are of course independent, but since we only have a
>> very small number of multipliers in current CPUS, we cannot expect to
>> do more than a pair of them in each cycle (all with latency 3-5).
>>
>> Doing it with adds instead means that we have to generate
>> a,2*a,3*a,5*a,8*a,13*a etc.
>>
>> We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
>> gets us to the next 5 fib numbers all done in parallel and at latency
>> 1 or 2:
>>
>> All these can be done in parallel on a 5-wide CPU:
>> a5 = LEA(a,a*4);
>> b3 = LEA(b,b*2); b5 = LEA(b,b*4);
>> f(n) = a+b;
>> f(n+1) = LEA(b+a*2);
>>
>> These next operations can run in the second cycle, so we've generated
>> 5 new fib() numbers in two cycles:
>>
>> f(n+2) = f(n)+f(n+1);
>> f(n+3) = a5+b3;
>> f(n+4) = LEA(b5+a*8);
>>
>> In the same cycle we can generate a couple more aggregate multiples:
>>
>> a13 = LEA(a5,a*8); b13=LEA(b5,b*8);
>>
>> and then we get to spew out the 6th fib value in the third cycle
>>
>> f(n+5) = LEA(a13,b*8);
>>
>> but I get the feel that it is better to stay at 4 or 5 new numbers in
>> each step?
>
> In the code that follows, b is the current fibonacci number, and a
> is the previous fibonacci number.
>
> U64
> fib_example( unsigned n ){
> U64 b = n&1, a = b^1;
>
> if( n & 2 ) a += b, b += a;
> if( n & 4 ) a += b, b += a, a += b, b += a;
> if( n & 8 ){
> U64 na = 13*a+21*b, nb = 21*a+34*b;
> a = na, b = nb;
> }
>
> n >>= 4;
> while( n-- ){
> U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
> a = na, b = nb;
> }
>
> return b;
> }
>
> Note that fib(8) is 21, and fib(16) is 987.

You are skipping intermediate numbers!

>
> It should be easy to see how to modify this code to extend it one
> level and get a 32-fold speedup for the last stage, which does in
> fact give a significant performance improvement.

I should have realized that you were in fact doing so since you showed
your log2(n) algorithm, but I was in fact assuming that you had to
generate all intermediate values as well.
>
> Also I have no doubt that you can see ways to make the code run
> even faster. I'm interested to see what you come up with, with
> the stipulation that no lookup tables be used.
>
The four MULs will be the limiting factor here, but the key idea is that
there will only ever be four of them, so this could go however far you
want to take it.

With four MULs you should get a new (a,b) pair every 6-7 clock cycles
(assuming two fully pipelined MUL units), this isn't that much faster
than code which generates 5 new fib values every two cycles, and then
you actually get to see all intermediate values as well.

Still very nice idea. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD CPU funny

<86bk9yo4p8.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36658&group=comp.arch#36658

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Sat, 06 Jan 2024 07:35:15 -0800
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <86bk9yo4p8.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me> <557aa299a229d08e934e98e8503aae7e@news.novabbs.com> <189a652eb8b48230eaf2ba1fe099fd9e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="4696a462fee5af12b15e4f5cf7f06c2e";
logging-data="706658"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+4/sYhpIea3iDjwmsbzQylTUMcOa35pdA="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:hMIPVgVDE87Aw8MJ2wclwiRXX68=
sha1:tjZ/f+Qd0AwBk8w1Xfvdkkm8sz0=

by: Tim Rentsch - Sat, 6 Jan 2024 15:35 UTC

mitchalsup@aol.com (MitchAlsup) writes:

> MitchAlsup wrote:
>
>> Terje Mathisen wrote:
>
>>>
>>> To run even faster than this I seem to need actual multiplications,
>>> with small fibonacci numbers as the multipliers:
>>>
>>> f(n+1) = f(n)+f(n-1)
>>> f(n+2) = 2*f(n)+f(n-1)
>>> f(n+3) = 3*f(n)+2*f(n-1)
>>> f(n+4) = 5*f(n)+3*(f(n-1)
>>> ....
>>> f(n+8) = 34*(f(n)+21*f(n-1)
>>
>> Combining::
>>
>> unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
>> 144, 233, 377, 610, 987, 1597, 2584 };
>>
>> fib( unsigned n )
>> {
>> i = n & 15;
>> a = table[i ];
>> b = table[i+1];
>> for( n -= i; n ; n -= 16 )
>> {
>> i = n & 15;
>
> t = a*table[i ] + b*table[i+1];
>
>> b = a*table[i+1] + b*table[i+2];
>
> a = t;

After n -= i, the lower order four bits of n are zero, which means
the coefficients being used are 0, 1, and 1. Also the table is
missing the value 21.

U64
fib( unsigned n ){
static unsigned short table[17] = {
1, 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610,
};
U64 a = table[n&15], b = table[(n&15)+1];

for( n >>= 4; n--; ){
U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

Re: AMD CPU funny

<867ckmnzo7.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36662&group=comp.arch#36662

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Sat, 06 Jan 2024 09:23:52 -0800
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <867ckmnzo7.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <e2bb329488167ada6ece600c8f28d3ed@news.novabbs.com> <um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk> <3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com> <fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me> <um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me> <umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com> <umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <ums29d$1p6cc$1@dont-email.me> <8634vgrwc2.fsf@linuxsc.com> <un9d0i$7kl7$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="4696a462fee5af12b15e4f5cf7f06c2e";
logging-data="736826"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/b/JCB4VMIxWBd+Py+TmyJhHvD7VvXvRE="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:Wp93sasD/hz+2mb80oNI7AYVN3U=
sha1:w6ebMdG6TYCpEqmpZ+fyOzcgScM=

by: Tim Rentsch - Sat, 6 Jan 2024 17:23 UTC

BGB <cr88192@gmail.com> writes:

> On 1/2/2024 2:08 AM, Tim Rentsch wrote:
>
>> Daniel James <daniel@me.invalid> writes:
>>
>>> On 31/12/2023 09:12, Pancho wrote:
>>>
>>>> In fact, given that fib grows exponentially, so we only ever need to
>>>> calculate it for small values of n, the only ?bad? solution is the
>>>> naive double recursive one.
>>>
>>> Well, I might disagree.
>>>
>>> Given that we only ever need to calculate it for small values of n (as
>>> you say, and if that's true) its efficiency doesn't matter. If you
>>> need to compute it often enough that efficiency might become relevant
>>> just put the results in a table and use that. Given that fib goes over
>>> 10^12 after about 60 terms it needn't be a very big table.
>>
>> When the size of the table is more than ten times the size of
>> the code, and the code is reasonably fast, using a table is a
>> poor choice. Fibonacci functions don't have to be slow.
>>
>>> The naive doubly recursive solution has the huge merit of being
>>> human-readable.
>>
>> But unacceptably slow.

[..]

> But, say, if one wanted something less slow, say (untested):
> long fasterfib(int n)
> {
> long c0, c1, c2;
> if(n<2)
> return(1);
> c0=1;
> c1=1;
> c2=c0+c1;
> while(n>2)
> {
> c0=c1;
> c1=c2;
> c2=c0+c1;
> n--;
> }
> return(c2);
> }

The Fibonacci numbers start at 0: fib(0) == 0, fib(1) == 1, ...
The code above produces something else.

What you're showing is a standard iterative version, much like
the one posted by Thomas Koenig early on.

Re: AMD CPU funny

<86ttnqmihr.fsf@linuxsc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36671&group=comp.arch#36671

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Sat, 06 Jan 2024 10:20:16 -0800
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <86ttnqmihr.fsf@linuxsc.com>
References: <ulv7j4$l0o0$1@dont-email.me> <ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me> <bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com> <umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me> <86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de> <86y1d8qh1m.fsf@linuxsc.com> <un0hu1$2je9f$1@dont-email.me> <86plyjr2br.fsf@linuxsc.com> <un3n40$37ahl$1@dont-email.me> <c2d29ce394401627a285cd67760b773f@news.novabbs.com> <un4bs7$3a8u6$1@dont-email.me> <8877b685b1d6ff730ab2d1d1f5ffc79e@news.novabbs.com> <un5lfs$3iv05$1@dont-email.me> <un5sl2$3jouh$1@dont-email.me> <86r0ixnt6q.fsf@linuxsc.com> <un8gmv$3nm5$1@dont-email.me> <86il47oh3l.fsf@linuxsc.com> <un9jgr$8jbn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="4696a462fee5af12b15e4f5cf7f06c2e";
logging-data="752994"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18D2KrCpl7HJnISQkRMbDbdo8jiV6WzWMo="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:ybHQkyx16DTQDq/HNUM5sQj/Ny8=
sha1:Jp/a69yQP4O5sN7LCY9zy7O5hgY=

by: Tim Rentsch - Sat, 6 Jan 2024 18:20 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:

> Tim Rentsch wrote:

[...]

>> In the code that follows, b is the current fibonacci number, and a
>> is the previous fibonacci number.
>>
>> U64
>> fib_example( unsigned n ){
>> U64 b = n&1, a = b^1;
>>
>> if( n & 2 ) a += b, b += a;
>> if( n & 4 ) a += b, b += a, a += b, b += a;
>> if( n & 8 ){
>> U64 na = 13*a+21*b, nb = 21*a+34*b;
>> a = na, b = nb;
>> }
>>
>> n >>= 4;
>> while( n-- ){
>> U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
>> a = na, b = nb;
>> }
>>
>> return b;
>> }
>>
>> Note that fib(8) is 21, and fib(16) is 987.
>
> You are skipping intermediate numbers!

Oh yes. It never occurred to me to do anything but.

[...]

> Still very nice idea. :-)

Thank you. It seem obvious in retrospect, but that property
holds for lots of ideas that are not obvious ahead of time.

I have an interesting footnote. My first implementation of this
idea was written using a functional recursive style. I was doing
performance measurements using a relatively low optimization setting
(-Os). Just this morning I decided to turn up the level to -O3, and
was surprised to see that the functional version ran faster than
everything else, 25% faster than the closest imperative version.
I've seen this result before in other cases - code written in a
functional style actually runs faster than code written in an
imperative style.

Here is the code:

typedef unsigned long long U64;

static U64 ff1( U64, U64, unsigned );
static U64 ff2( U64, U64, unsigned );
static U64 ff3( U64, U64, unsigned );
static U64 ff4( U64, U64, unsigned );

U64
functional_fib( unsigned n ){
return
n < 8 ? 0xd8532110u >>(n*4) &15 :
n & 1 ? ff1( 0, 1, n/2 ) :
/*******/ ff1( 1, 0, n/2 );
}

U64
ff1( U64 a, U64 b, unsigned k ){
return k & 1 ? ff2( a+b, a+2*b, k/2 ) : ff2( a, b, k/2 );
}

U64
ff2( U64 a, U64 b, unsigned k ){
return k & 1 ? ff3( 2*a+3*b, 3*a+5*b, k/2 ) : ff3( a, b, k/2 );
}

U64
ff3( U64 a, U64 b, unsigned k ){
return k & 1 ? ff4( 13*a+21*b, 21*a+34*b, k/2 ) : ff4( a, b, k/2 );
}

U64
ff4( U64 a, U64 b, unsigned k ){
return k ? ff4( 610*a+987*b, 987*a+1597*b, k-1 ) : b;
}

Re: AMD CPU funny

<unh0kq$1ig5u$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36728&group=comp.arch#36728

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: AMD CPU funny
Date: Mon, 8 Jan 2024 09:25:28 -0500
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <unh0kq$1ig5u$1@dont-email.me>
References: <ulv7j4$l0o0$1@dont-email.me> <um1h21$13l52$1@dont-email.me>
<e2bb329488167ada6ece600c8f28d3ed@news.novabbs.com>
<um43of$1j3jj$1@dont-email.me> <eVh*Lrxyz@news.chiark.greenend.org.uk>
<3d2dcfa36ceca1ee856b9de3ab416a01@news.novabbs.com>
<fVh*Dgyyz@news.chiark.greenend.org.uk> <um7jom$27ikh$3@dont-email.me>
<um7o72$1bj9r$1@newsreader4.netcologne.de> <umc56l$32ucu$1@dont-email.me>
<umhbqp$3um9d$1@dont-email.me> <20231227183842.00001b4f@yahoo.com>
<umn15d$svun$6@dont-email.me> <kv8gt0Fca4nU6@mid.individual.net>
<ump7l3$19ud7$1@dont-email.me> <umpn9d$1c212$1@dont-email.me>
<bc7bd58cc9f25ca592dc5c62f415ceec@news.novabbs.com>
<umq8hi$1ebuo$1@dont-email.me> <umrb9a$1m3mj$1@dont-email.me>
<86cyuksoc0.fsf@linuxsc.com> <umvfc1$1qq5r$1@newsreader4.netcologne.de>
<86y1d8qh1m.fsf@linuxsc.com> <un0rpm$1rkc9$1@newsreader4.netcologne.de>
<b0a868b435300a4eb248824c47c81eb7@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 8 Jan 2024 14:25:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cf6e85af31a5eeecf43aa254c353136f";
logging-data="1654974"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oNESn4r8pwm4FrXJsbVUxT3Qv5fCIlFw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:FeBb/pYLxAsJ7VvmZ8PXCqPjo0s=
In-Reply-To: <b0a868b435300a4eb248824c47c81eb7@news.novabbs.com>

by: Paul A. Clayton - Mon, 8 Jan 2024 14:25 UTC

On 1/2/24 2:10 PM, MitchAlsup wrote:
> Thomas Koenig wrote:
>
>> Compilers often do not do a good job of replacing recursion
>> with iteration (is there a verb for that? Maybe it can be called
>> iterationizing?), so it is something to consider if an algorithm
>> is at all time critical or stack space is problematic.
>
>
> Tail recursion elimination.

I thought compilers generally were good at tail recursion
elimination but did not attempt "flattening" something like
flood-fill (which is not "tail recursion" but could avoid
some of the function call overhead). (I.e., I vaguely
remember over a decade ago noticing that this case was not
optimized — for the samegnome game which has a small board
so recursive depth and call overhead was not really significant.)

I also vaguely recall reading at some point that even
recursion that was "easy" to convert to tail recursion was
not handled by gcc (also many years ago).

Please go away.

devel / comp.arch / Re: AMD CPU funny

devel / comp.arch / Re: AMD CPU funny

Subject	Author
Re: AMD CPU funny	Theo
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	Scott Lurndal
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	Michael S
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Michael S
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Andy Burns
Re: AMD CPU funny	Chris M. Thomasson
Re: AMD CPU funny	Chris M. Thomasson
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	Theo
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Theo
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	Thomas Koenig
Re: AMD CPU funny	David Brown
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	David Brown
Re: AMD CPU funny	Michael S
Re: AMD CPU funny	Vir Campestris
Re: AMD CPU funny	Andy Burns
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	BGB
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	BGB
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Pancho
Re: AMD CPU funny	Daniel James
Re: AMD CPU funny	Pancho
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Pancho
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Pancho
Re: AMD CPU funny	Daniel James
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Pancho
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	BGB
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Thomas Koenig
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Thomas Koenig
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Thomas Koenig
Re: AMD CPU funny	Tim Rentsch
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	Paul A. Clayton
Re: AMD CPU funny	Anton Ertl
Re: AMD CPU funny	MitchAlsup
Re: AMD CPU funny	George Neuner
Re: AMD CPU funny	Anton Ertl
Re: AMD CPU funny	BGB-Alt
Re: AMD CPU funny	Terje Mathisen
Re: AMD CPU funny	Andy Burns
Re: AMD CPU funny	Andy Burns
Re: AMD CPU funny	MitchAlsup