Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

"You shouldn't make my toaster angry." -- Household security explained in "Johnny Quest"


devel / comp.arch / Re: Solving the Floating-Point Conundrum

SubjectAuthor
* Solving the Floating-Point ConundrumQuadibloc
+* Re: Solving the Floating-Point ConundrumStephen Fuld
|+* Re: Solving the Floating-Point ConundrumQuadibloc
||+- Re: Solving the Floating-Point ConundrumJohn Levine
||`- Re: Solving the Floating-Point ConundrumStephen Fuld
|`* Re: Solving the Floating-Point Conundrummac
| `- Re: Solving the Floating-Point ConundrumThomas Koenig
+* Re: Solving the Floating-Point ConundrumMitchAlsup
|+* Re: Solving the Floating-Point ConundrumQuadibloc
||+* Re: Solving the Floating-Point ConundrumMitchAlsup
|||`* Re: Solving the Floating-Point ConundrumQuadibloc
||| `* Re: Solving the Floating-Point ConundrumMitchAlsup
|||  `- Re: Solving the Floating-Point ConundrumQuadibloc
||`- Re: Solving the Floating-Point ConundrumJohn Dallman
|+- Re: Solving the Floating-Point ConundrumScott Lurndal
|`* Re: Solving the Floating-Point ConundrumQuadibloc
| +* Re: Solving the Floating-Point ConundrumMitchAlsup
| |`* Re: Solving the Floating-Point ConundrumBGB
| | +* Re: Solving the Floating-Point ConundrumScott Lurndal
| | |+* Re: Solving the Floating-Point ConundrumQuadibloc
| | ||+* Re: Solving the Floating-Point ConundrumMitchAlsup
| | |||`- Re: Solving the Floating-Point ConundrumTerje Mathisen
| | ||`* Re: Solving the Floating-Point ConundrumBGB
| | || `* Re: Solving the Floating-Point ConundrumStephen Fuld
| | ||  `* Re: Solving the Floating-Point ConundrumScott Lurndal
| | ||   `- Re: Solving the Floating-Point ConundrumMitchAlsup
| | |`* Re: Solving the Floating-Point ConundrumThomas Koenig
| | | `* Re: memory speeds, Solving the Floating-Point ConundrumJohn Levine
| | |  +- Re: memory speeds, Solving the Floating-Point ConundrumQuadibloc
| | |  +* Re: memory speeds, Solving the Floating-Point ConundrumScott Lurndal
| | |  |+* Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  ||+* Re: memory speeds, Solving the Floating-Point ConundrumEricP
| | |  |||+* Re: memory speeds, Solving the Floating-Point ConundrumScott Lurndal
| | |  ||||`* Re: memory speeds, Solving the Floating-Point ConundrumEricP
| | |  |||| `- Re: memory speeds, Solving the Floating-Point ConundrumScott Lurndal
| | |  |||+- Re: memory speeds, Solving the Floating-Point ConundrumQuadibloc
| | |  |||+* Re: memory speeds, Solving the Floating-Point ConundrumJohn Levine
| | |  ||||`* Re: memory speeds, Solving the Floating-Point ConundrumEricP
| | |  |||| `- Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  |||+- Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  |||`- Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  ||`* Re: memory speeds, Solving the Floating-Point ConundrumTimothy McCaffrey
| | |  || `- Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  |`* Re: memory speeds, Solving the Floating-Point ConundrumQuadibloc
| | |  | +- Re: memory speeds, Solving the Floating-Point ConundrumMitchAlsup
| | |  | `- Re: memory speeds, Solving the Floating-Point Conundrummoi
| | |  `* Re: memory speeds, Solving the Floating-Point ConundrumAnton Ertl
| | |   +* Re: memory speeds, Solving the Floating-Point ConundrumMichael S
| | |   |+* Re: memory speeds, Solving the Floating-Point ConundrumJohn Levine
| | |   ||+- Re: memory speeds, Solving the Floating-Point ConundrumLynn Wheeler
| | |   ||`* Re: memory speeds, Solving the Floating-Point ConundrumAnton Ertl
| | |   || +- Re: memory speeds, Solving the Floating-Point ConundrumEricP
| | |   || `- Re: memory speeds, Solving the Floating-Point ConundrumJohn Levine
| | |   |`* Re: memory speeds, Solving the Floating-Point ConundrumAnton Ertl
| | |   | `- Re: memory speeds, Solving the Floating-Point ConundrumStephen Fuld
| | |   `* Re: memory speeds, Solving the Floating-Point ConundrumThomas Koenig
| | |    `- Re: memory speeds, Solving the Floating-Point ConundrumAnton Ertl
| | +* Re: Solving the Floating-Point ConundrumQuadibloc
| | |`* Re: Solving the Floating-Point ConundrumBGB
| | | `- Re: Solving the Floating-Point ConundrumStephen Fuld
| | +- Re: Solving the Floating-Point ConundrumMitchAlsup
| | `- Re: Solving the Floating-Point ConundrumMitchAlsup
| +* Re: Solving the Floating-Point ConundrumQuadibloc
| |`* Re: Solving the Floating-Point ConundrumQuadibloc
| | `* Re: Solving the Floating-Point ConundrumBGB
| |  `- Re: Solving the Floating-Point ConundrumScott Lurndal
| `* Re: Solving the Floating-Point ConundrumTimothy McCaffrey
|  +- Re: Solving the Floating-Point ConundrumScott Lurndal
|  +- Re: Solving the Floating-Point ConundrumStephen Fuld
|  +* Re: Solving the Floating-Point ConundrumQuadibloc
|  |`* Re: Solving the Floating-Point ConundrumQuadibloc
|  | +* Re: Solving the Floating-Point ConundrumQuadibloc
|  | |`* Re: Solving the Floating-Point ConundrumThomas Koenig
|  | | `* Re: Solving the Floating-Point ConundrumQuadibloc
|  | |  `* Re: Solving the Floating-Point ConundrumThomas Koenig
|  | |   `* Re: Solving the Floating-Point ConundrumQuadibloc
|  | |    `- Re: Solving the Floating-Point ConundrumThomas Koenig
|  | +* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | |+- Re: Solving the Floating-Point ConundrumTerje Mathisen
|  | |`* Re: Solving the Floating-Point ConundrumQuadibloc
|  | | +* Re: Solving the Floating-Point ConundrumThomas Koenig
|  | | |+* Re: Solving the Floating-Point ConundrumJohn Dallman
|  | | ||+- Re: Solving the Floating-Point ConundrumQuadibloc
|  | | ||+* Re: Solving the Floating-Point ConundrumQuadibloc
|  | | |||+* Re: Solving the Floating-Point ConundrumMichael S
|  | | ||||+* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | | |||||`- Re: Solving the Floating-Point ConundrumQuadibloc
|  | | ||||`- Re: Solving the Floating-Point ConundrumQuadibloc
|  | | |||+* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | | ||||`- Re: Solving the Floating-Point ConundrumQuadibloc
|  | | |||`* Re: Solving the Floating-Point ConundrumTerje Mathisen
|  | | ||| `* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | | |||  +* Re: Solving the Floating-Point Conundrumrobf...@gmail.com
|  | | |||  |+- Re: Solving the Floating-Point ConundrumScott Lurndal
|  | | |||  |+* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | | |||  ||`- Re: Solving the Floating-Point ConundrumGeorge Neuner
|  | | |||  |+- Re: Solving the Floating-Point ConundrumThomas Koenig
|  | | |||  |`* Re: Solving the Floating-Point ConundrumTerje Mathisen
|  | | |||  | `- Re: Solving the Floating-Point ConundrumBGB
|  | | |||  `* Re: Solving the Floating-Point ConundrumTerje Mathisen
|  | | |||   +* Re: Solving the Floating-Point Conundrumcomp.arch
|  | | |||   `* Re: Solving the Floating-Point ConundrumMitchAlsup
|  | | ||`* Re: Solving the Floating-Point ConundrumQuadibloc
|  | | |`* Re: Solving the Floating-Point ConundrumJohn Levine
|  | | `- Re: Solving the Floating-Point ConundrumMitchAlsup
|  | +- Re: Solving the Floating-Point ConundrumQuadibloc
|  | `* Re: Solving the Floating-Point ConundrumStefan Monnier
|  +* Re: Solving the Floating-Point ConundrumBGB
|  `- Re: Solving the Floating-Point ConundrumThomas Koenig
+* Re: Solving the Floating-Point ConundrumMitchAlsup
`- Re: Solving the Floating-Point ConundrumQuadibloc

Pages:12345678910
Re: Solving the Floating-Point Conundrum

<memo.20230925074837.16292U@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34299&group=comp.arch#34299

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 07:48 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <memo.20230925074837.16292U@jgd.cix.co.uk>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="140a2d455b5dcb7feabce42417bbd894";
logging-data="1896913"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18z4MudWWtjbkAskTwtOly3UNN37oPrEQw="
Cancel-Lock: sha1:NjzzOGiLD+nWU0e9XenDEr1Que8=
 by: John Dallman - Mon, 25 Sep 2023 06:48 UTC

In article <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>,
jsavard@ecn.ab.ca (Quadibloc) wrote:

> hardware support for packed decimal
> hardware support for IBM System/360 hexadecimal floating point
>
> because people do run Hercules on their computers and so on.

I read the Hercules mailing list. Nobody on there uses it for serious
work. It seems to be mainly used to evoke memories of youth. Running it
on low-powered hardware, such as Raspberry Pi, attracts more notice than
running it on something powerful. Hercules is written in portable C,
because portability is considered more important than performance.

> I have now added, at the bottom of the page, a scheme, involving
> having dual-channel memory where each channel is 192 bits wide,
> that permits the operating system to allocate blocks of 384-bit
> wide memory, 288-bit wide memory, 240-bit wide memory, and 256-bit
> wide memory.

That's an interesting new way to have your system run short of the right
kind of memory.

John

Re: Solving the Floating-Point Conundrum

<a2b939de-b709-498f-ac87-52633801bd02n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34300&group=comp.arch#34300

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:6f04:b0:418:df1:de15 with SMTP id iv4-20020a05622a6f0400b004180df1de15mr29682qtb.10.1695638445276;
Mon, 25 Sep 2023 03:40:45 -0700 (PDT)
X-Received: by 2002:a9d:5e81:0:b0:6c4:c061:341c with SMTP id
f1-20020a9d5e81000000b006c4c061341cmr1907675otl.5.1695638444921; Mon, 25 Sep
2023 03:40:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Sep 2023 03:40:44 -0700 (PDT)
In-Reply-To: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:6449:f1b7:4a62:953;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:6449:f1b7:4a62:953
References: <ue788u$4u5l$1@newsreader4.netcologne.de> <memo.20230917185814.16292G@jgd.cix.co.uk>
<fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a2b939de-b709-498f-ac87-52633801bd02n@googlegroups.com>
Subject: Re: Solving the Floating-Point Conundrum
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 25 Sep 2023 10:40:45 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 13
 by: Quadibloc - Mon, 25 Sep 2023 10:40 UTC

On Sunday, September 24, 2023 at 11:59:41 AM UTC-6, Quadibloc wrote:

> On the page
>
> http://www.quadibloc.com/arch/per14.htm

Upon review of the contents of the two pages involved, I have now
moved the new material from that page to the bottom of this
page instead:

http://www.quadibloc.com/arch/per04.htm

John Savard

Re: Solving the Floating-Point Conundrum

<8ThQM.146454$bmw6.26202@fx10.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34301&group=comp.arch#34301

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Solving the Floating-Point Conundrum
Newsgroups: comp.arch
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com> <memo.20230925074837.16292U@jgd.cix.co.uk>
Lines: 34
Message-ID: <8ThQM.146454$bmw6.26202@fx10.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 25 Sep 2023 15:41:56 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 25 Sep 2023 15:41:56 GMT
X-Received-Bytes: 2271
 by: Scott Lurndal - Mon, 25 Sep 2023 15:41 UTC

jgd@cix.co.uk (John Dallman) writes:
>In article <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>,
>jsavard@ecn.ab.ca (Quadibloc) wrote:
>
>> hardware support for packed decimal
>> hardware support for IBM System/360 hexadecimal floating point
>>
>> because people do run Hercules on their computers and so on.
>
>I read the Hercules mailing list. Nobody on there uses it for serious
>work. It seems to be mainly used to evoke memories of youth. Running it
>on low-powered hardware, such as Raspberry Pi, attracts more notice than
>running it on something powerful. Hercules is written in portable C,
>because portability is considered more important than performance.
>
>> I have now added, at the bottom of the page, a scheme, involving
>> having dual-channel memory where each channel is 192 bits wide,
>> that permits the operating system to allocate blocks of 384-bit
>> wide memory, 288-bit wide memory, 240-bit wide memory, and 256-bit
>> wide memory.
>
>That's an interesting new way to have your system run short of the right
>kind of memory.

Indeed. It's not the path from memory to the core complex that is
currently most interesting (although 256-bit wide (and higher) mesh or crossbars
aren't uncommon), but rather the data path widths from I/O
subsystems. 512-bit wide paths from network controllers and on-board
non-coherent (or coherent, see CXL) coprocessors has become
common. Supporting 80gbit/sec of network traffic into memory
or the networking subsystem isn't trivial.

The memory bandwidth grows by adding controllers and striping across them
for the most part.

Re: Solving the Floating-Point Conundrum

<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34303&group=comp.arch#34303

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 12:30:11 -0400
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="13b9fbd8a7289dbef00e8dc30f1373d7";
logging-data="2118613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/n4MY3fVubjvMjPhyTYKJ/c+83WC2hrVc="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:BQQSDD2mwMGahycFyih0PKR9P3U=
sha1:9nHF8uhQ2jXzclfoYnTx4PYHFm0=
 by: Stefan Monnier - Mon, 25 Sep 2023 16:30 UTC

I think intermediate-sized FPs are unlikely to be worth the effort other
than in very niche places where the few percents gained can be justified
by a very large volume.

In memory management libraries (malloc, GCs, ...) we do try and minimize
the amount of memory that's "wasted" of course, but we generally
consider that having an upper bound of ~100% overhead is acceptable
(i.e. using twice as much memory as actually needed in the worst case).

Stefan

Re: Solving the Floating-Point Conundrum

<uesdbt$20rq4$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34304&group=comp.arch#34304

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 11:43:38 -0500
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <uesdbt$20rq4$1@dont-email.me>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Sep 2023 16:43:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c969acbcfa54111798871822e7758260";
logging-data="2125636"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9WP0O44FKJ9heJ9Svnb7d"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:xBPVBbi2SWIvQZLzUvI5Q1BmDTc=
Content-Language: en-US
In-Reply-To: <8ThQM.146454$bmw6.26202@fx10.iad>
 by: BGB - Mon, 25 Sep 2023 16:43 UTC

On 9/25/2023 10:41 AM, Scott Lurndal wrote:
> jgd@cix.co.uk (John Dallman) writes:
>> In article <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>,
>> jsavard@ecn.ab.ca (Quadibloc) wrote:
>>
>>> hardware support for packed decimal
>>> hardware support for IBM System/360 hexadecimal floating point
>>>
>>> because people do run Hercules on their computers and so on.
>>
>> I read the Hercules mailing list. Nobody on there uses it for serious
>> work. It seems to be mainly used to evoke memories of youth. Running it
>> on low-powered hardware, such as Raspberry Pi, attracts more notice than
>> running it on something powerful. Hercules is written in portable C,
>> because portability is considered more important than performance.
>>
>>> I have now added, at the bottom of the page, a scheme, involving
>>> having dual-channel memory where each channel is 192 bits wide,
>>> that permits the operating system to allocate blocks of 384-bit
>>> wide memory, 288-bit wide memory, 240-bit wide memory, and 256-bit
>>> wide memory.
>>
>> That's an interesting new way to have your system run short of the right
>> kind of memory.
>
> Indeed. It's not the path from memory to the core complex that is
> currently most interesting (although 256-bit wide (and higher) mesh or crossbars
> aren't uncommon), but rather the data path widths from I/O
> subsystems. 512-bit wide paths from network controllers and on-board
> non-coherent (or coherent, see CXL) coprocessors has become
> common. Supporting 80gbit/sec of network traffic into memory
> or the networking subsystem isn't trivial.
>
> The memory bandwidth grows by adding controllers and striping across them
> for the most part.

?...

AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
have 4 DIMM slots with DIMMs being filled in pairs.

This would seemingly imply that RAM would be mostly limited to a 128-bit
datapath (or 64-bit in unganged mode).

Similarly, PCIe slots are effectively multiple serial lanes running in
parallel, etc...

Typical onboard peripherals being connected either with PCIe lanes or
via an onboard LPC bus.

Similarly, a typical motherboard only has a single CPU socket.

....

Outside of the CPU itself, unclear where any of these wide interconnects
would be, or where they would be going.

Similarly, most things much smaller than a PC, will have 16-bit or
32-bit RAM interfaces (with typically 1 or 2 RAM chips soldered onto the
motherboard somewhere, often also with an eMMC Flash chip and other things).

Granted, on the other hand, I have some ~ 18 year old Xeon based rack
servers, which have a pair of CPUs surrounded by a small island of RAM
modules, and comparably still rather impressive multi-core "memcpy()"
performance.

Say, while each core individually is still limited to a few GB/sec, one
can have all of the cores running memcpy at the same time without any
significant drop (vs, say, on a PC where once the total exceeds around 8
to 10GB/sec or so, per-thread performance drops off).

Well, and both still significantly beat out the "memcpy()" performance
on a 20 year old laptop (as does a RasPi...).

....

Re: Solving the Floating-Point Conundrum

<56jQM.245879$2ph4.169306@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34305&group=comp.arch#34305

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!tncsrv06.tnetconsulting.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Solving the Floating-Point Conundrum
Newsgroups: comp.arch
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com> <memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad> <uesdbt$20rq4$1@dont-email.me>
Lines: 74
Message-ID: <56jQM.245879$2ph4.169306@fx14.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 25 Sep 2023 17:06:09 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 25 Sep 2023 17:06:09 GMT
X-Received-Bytes: 3573
 by: Scott Lurndal - Mon, 25 Sep 2023 17:06 UTC

BGB <cr88192@gmail.com> writes:
>On 9/25/2023 10:41 AM, Scott Lurndal wrote:
>> jgd@cix.co.uk (John Dallman) writes:

>>> That's an interesting new way to have your system run short of the right
>>> kind of memory.
>>
>> Indeed. It's not the path from memory to the core complex that is
>> currently most interesting (although 256-bit wide (and higher) mesh or crossbars
>> aren't uncommon), but rather the data path widths from I/O
>> subsystems. 512-bit wide paths from network controllers and on-board
>> non-coherent (or coherent, see CXL) coprocessors has become
>> common. Supporting 80gbit/sec of network traffic into memory
>> or the networking subsystem isn't trivial.
>>
>> The memory bandwidth grows by adding controllers and striping across them
>> for the most part.
>
>?...
>
>
>AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
>have 4 DIMM slots with DIMMs being filled in pairs.

"typical" in what context? Home desktops? That's certainly not
typical for the data center or cloud servers. One chip I'm aware of
has 20 dual-channel DDR5 memory controllers (one per every four
cores).

>
>This would seemingly imply that RAM would be mostly limited to a 128-bit
>datapath (or 64-bit in unganged mode).

That's sufficient for the home desktop windows user, I suppose. It's
certainly not sufficient for cloud servers, enterprise data center servers,
high-end networking appliances, et alia.

>
>Similarly, PCIe slots are effectively multiple serial lanes running in
>parallel, etc...

Gen 6 x16 has a boatload of bandwidth (128 gigabytes per second).
Note that the serial lanes are only downstream (towards the endpoint)
from the root complex. The root complex itself uses a parallel
interconnect to the cache/memory subsystem on the host side.

>
>Typical onboard peripherals being connected either with PCIe lanes or
>via an onboard LPC bus.

There is nothing useful (or high performance) connected to
LPC bus in decades. Even intel is deprecating it in modern chipsets
with a nod for backward compatability (i.e. supporting in/out
instructions to a subset of standard ISA peripherals like keyboard
controllers). They're even planning on getting rid of most of it entirely
in the future and boot the processor directly into long mode so
all the legacy compatibility stuff like the original PIC can
be removed.

>
>Similarly, a typical motherboard only has a single CPU socket.

Typical in what context?

>
>
>Outside of the CPU itself, unclear where any of these wide interconnects
>would be, or where they would be going.

Did you read the post you responded to? How do you get 40gbytes/sec
into your memory subsystem from an onboard 400gbit nic? Or 128 gbytes/sec
from a PCIe root complex? Or from a PCIe CXL-cache memory extender?

Re: Solving the Floating-Point Conundrum

<uesfe8$211tp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34306&group=comp.arch#34306

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 10:19:04 -0700
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <uesfe8$211tp$1@dont-email.me>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
<uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Sep 2023 17:19:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="244c56bc0792604550e92c0f8a5f3d2f";
logging-data="2131897"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/DXoAYwVHFmn3BjAOlVrq33GWKhoRhVhY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:OL+xgPcfCZqUlEU5hQDBBdEUj1E=
Content-Language: en-US
In-Reply-To: <56jQM.245879$2ph4.169306@fx14.iad>
 by: Stephen Fuld - Mon, 25 Sep 2023 17:19 UTC

On 9/25/2023 10:06 AM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:

>> AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
>> have 4 DIMM slots with DIMMs being filled in pairs.
>
> "typical" in what context? Home desktops? That's certainly not
> typical for the data center or cloud servers. One chip I'm aware of
> has 20 dual-channel DDR5 memory controllers (one per every four
> cores).

Wow! How many pins on the package? It must be massive. is there a
latency penalty for getting that many DIMMs "close" to the CPU?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Solving the Floating-Point Conundrum

<uesg1h$21bpm$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34307&group=comp.arch#34307

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 12:29:18 -0500
Organization: A noiseless patient Spider
Lines: 124
Message-ID: <uesg1h$21bpm$1@dont-email.me>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
<uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Sep 2023 17:29:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c969acbcfa54111798871822e7758260";
logging-data="2142006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Qvyp6ADqy1mcnaiBD9rrb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:yqeqDIEJPguzvs1Jr5poA5ZoNus=
In-Reply-To: <56jQM.245879$2ph4.169306@fx14.iad>
Content-Language: en-US
 by: BGB - Mon, 25 Sep 2023 17:29 UTC

On 9/25/2023 12:06 PM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 9/25/2023 10:41 AM, Scott Lurndal wrote:
>>> jgd@cix.co.uk (John Dallman) writes:
>
>>>> That's an interesting new way to have your system run short of the right
>>>> kind of memory.
>>>
>>> Indeed. It's not the path from memory to the core complex that is
>>> currently most interesting (although 256-bit wide (and higher) mesh or crossbars
>>> aren't uncommon), but rather the data path widths from I/O
>>> subsystems. 512-bit wide paths from network controllers and on-board
>>> non-coherent (or coherent, see CXL) coprocessors has become
>>> common. Supporting 80gbit/sec of network traffic into memory
>>> or the networking subsystem isn't trivial.
>>>
>>> The memory bandwidth grows by adding controllers and striping across them
>>> for the most part.
>>
>> ?...
>>
>>
>> AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
>> have 4 DIMM slots with DIMMs being filled in pairs.
>
> "typical" in what context? Home desktops? That's certainly not
> typical for the data center or cloud servers. One chip I'm aware of
> has 20 dual-channel DDR5 memory controllers (one per every four
> cores).
>

Mostly desktop PCs.

I don't have any data center or cloud servers, or personally know anyone
who does around here, ...

So, it seems reasonable to assume, most people don't have them, thus,
they are not typical.

>>
>> This would seemingly imply that RAM would be mostly limited to a 128-bit
>> datapath (or 64-bit in unganged mode).
>
> That's sufficient for the home desktop windows user, I suppose. It's
> certainly not sufficient for cloud servers, enterprise data center servers,
> high-end networking appliances, et alia.
>

It seemed like the context was "common", which would mostly imply "stuff
people actually have", not necessarily high-end data-center servers,
which pretty much no one has apart from the companies running the data
centers.

But, then if say, a datacenter has 1 sever per 10k users (each user with
a desktop PC or similar); this would mean still only mean 0.01% of the
total computers were servers.

And, your typical desktop PC is only going to have 1 CPU and 2 or 4 RAM
sticks.

>
>>
>> Similarly, PCIe slots are effectively multiple serial lanes running in
>> parallel, etc...
>
> Gen 6 x16 has a boatload of bandwidth (128 gigabytes per second).
> Note that the serial lanes are only downstream (towards the endpoint)
> from the root complex. The root complex itself uses a parallel
> interconnect to the cache/memory subsystem on the host side.
>

OK.

>>
>> Typical onboard peripherals being connected either with PCIe lanes or
>> via an onboard LPC bus.
>
> There is nothing useful (or high performance) connected to
> LPC bus in decades. Even intel is deprecating it in modern chipsets
> with a nod for backward compatability (i.e. supporting in/out
> instructions to a subset of standard ISA peripherals like keyboard
> controllers). They're even planning on getting rid of most of it entirely
> in the future and boot the processor directly into long mode so
> all the legacy compatibility stuff like the original PIC can
> be removed.
>

There is typically still an RS232 port and PS/2 keyboard and mouse ports
and similar...

But, yeah.

>>
>> Similarly, a typical motherboard only has a single CPU socket.
>
> Typical in what context?
>

Standard home desktop PC.

That is what most people are using, at least, excluding laptops,
tablets, and cell-phones.

>>
>>
>> Outside of the CPU itself, unclear where any of these wide interconnects
>> would be, or where they would be going.
>
> Did you read the post you responded to? How do you get 40gbytes/sec
> into your memory subsystem from an onboard 400gbit nic? Or 128 gbytes/sec
> from a PCIe root complex? Or from a PCIe CXL-cache memory extender?
>

Generally you don't...

I think, if one has a 1GbE Ethernet port, they could maybe get 120 MB/s
or similar if it is going "full tilt".

Re: Solving the Floating-Point Conundrum

<uesgjo$21bpm$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34308&group=comp.arch#34308

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 12:39:04 -0500
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <uesgjo$21bpm$2@dont-email.me>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Sep 2023 17:39:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c969acbcfa54111798871822e7758260";
logging-data="2142006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kgGHWlPihHjyztH+SkwP8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:7sJNpoxjc+hX0SV2zu81eQwnoL0=
Content-Language: en-US
In-Reply-To: <jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
 by: BGB - Mon, 25 Sep 2023 17:39 UTC

On 9/25/2023 11:30 AM, Stefan Monnier wrote:
> I think intermediate-sized FPs are unlikely to be worth the effort other
> than in very niche places where the few percents gained can be justified
> by a very large volume.
>
> In memory management libraries (malloc, GCs, ...) we do try and minimize
> the amount of memory that's "wasted" of course, but we generally
> consider that having an upper bound of ~100% overhead is acceptable
> (i.e. using twice as much memory as actually needed in the worst case).
>

I am now evaluating the possible use of a 48-bit floating-point format,
but this is (merely) in terms of memory storage (in registers, it will
still use Binary64).

Still to be determined whether being able to save 2-bytes in struct or
array storage (at the cost of a few extra clock-cycles per load/store
operation; and a non-standard C type) will be a worthwhile tradeoff.

Would also need to determine the relative cost tradeoff between having
dedicated 48-bit load/store ops, and "faking it" via multi-instruction
sequences.

Similarly, while I was at it, went and added a few "proper" byte-swap
instructions (to potentially help with the efficiency of endian
conversion, at least slightly...).

As for memory management, I generally aim for around a 25% space
overhead per small/medium "malloc()".

Sometimes, this does mean needing to switch between strategies, say:
Cell-oriented allocator for small objects (say, under 256 or 384B);
Memory-block-list allocator for medium objects;
Page-based allocation for larger objects.

....

Re: Solving the Floating-Point Conundrum

<0uh3hitcusunspog484ju79deoq6tbggt3@4ax.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34309&group=comp.arch#34309

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: gneuner2@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 13:55:30 -0400
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <0uh3hitcusunspog484ju79deoq6tbggt3@4ax.com>
References: <8a5563da-3be8-40f7-bfb9-39eb5e889c8an@googlegroups.com> <f097448b-e691-424b-b121-eab931c61d87n@googlegroups.com> <ue788u$4u5l$1@newsreader4.netcologne.de> <ue7nkh$ne0$1@gal.iecc.com> <9f5be6c2-afb2-452b-bd54-314fa5bed589n@googlegroups.com> <uefkrv$ag9f$1@newsreader4.netcologne.de> <deeae38d-da7a-4495-9558-f73a9f615f02n@googlegroups.com> <9141df99-f363-4d64-9ce3-3d3aaf0f5f40n@googlegroups.com> <fc7efd6b-3efc-46c0-9493-6ecd351f9636n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="61f3e1d00d4f707ff58f2c209a887d20";
logging-data="2151666"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wapnaUvCp14Xi8LVaUmzN5PANNc4xAG0="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:7UOB9HSvIIxPzgOjDtLcZU/JUKw=
 by: George Neuner - Mon, 25 Sep 2023 17:55 UTC

On Fri, 22 Sep 2023 08:10:27 -0700 (PDT), Quadibloc
<jsavard@ecn.ab.ca> wrote:

> ..., if you've got enough transistors to put multiple CPUs on one die,
>and you believe parallelism is far, far inferior to speeding up the clock or
>reducing latency by other means, then you might actually think that throwing
>transistors at reducing latency in this fashion is rational.
>
>Essentially, it is true that I don't buy the idea that programmers are just
>lazy, and if they did things right they could exploit parallelism effectively.
>I will grant that they can probably do much better in many cases, but I
>also feel there are fundamental limits.

Compute resources always are limited, and most programmers are far
better at figuring out what CAN be done in parallel than what SHOULD
be done in parallel given available resources.

Then there is Amdahl's law.

There always will be a need for faster single thread serial execution.

>John Savard

YMMV,
George

Re: Solving the Floating-Point Conundrum

<jwvil7yces1.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34310&group=comp.arch#34310

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!gandalf.srv.welterde.de!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 14:11:54 -0400
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
<uesgjo$21bpm$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="13b9fbd8a7289dbef00e8dc30f1373d7";
logging-data="2149225"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Q0syHSfp4As9DiBhm9WtiIX+mUlNIBtk="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:hy9n8qNzvLUv2AHMLZj7IyZVucA=
sha1:mNM5kyEnFRK96k2IE6csWTXiCzA=
 by: Stefan Monnier - Mon, 25 Sep 2023 18:11 UTC

> I am now evaluating the possible use of a 48-bit floating-point format, but
> this is (merely) in terms of memory storage (in registers, it will still use
> Binary64).

I suspect this is indeed the only sane way to go about it.
Also, I suspect that such 48bit floats would only be worthwhile when you
have some large vectors/matrices and care about the 33% bandwidth
overhead of using 64bit rather than 48bit. So maybe the focus should be
on "load 3 chunks, then spread turn them into 4" since the limiting
factor would presumably be the memory bandwidth.

E.g. load 3 chunks (C1, C2, and C3) of 256bits each using standard SIMD
load, and then add an instruction to turn C1+C2 into two 256bit vectors
of 4x64bit floats, and another to do the same with C2+C3 (basically, the
same instruction except it uses the other half of the bits of C2).

Stefan

Re: Solving the Floating-Point Conundrum

<4hkQM.297025$ZXz4.183665@fx18.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34311&group=comp.arch#34311

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Solving the Floating-Point Conundrum
Newsgroups: comp.arch
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com> <memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad> <uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad> <uesg1h$21bpm$1@dont-email.me>
Lines: 32
Message-ID: <4hkQM.297025$ZXz4.183665@fx18.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 25 Sep 2023 18:26:08 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 25 Sep 2023 18:26:08 GMT
X-Received-Bytes: 1986
 by: Scott Lurndal - Mon, 25 Sep 2023 18:26 UTC

BGB <cr88192@gmail.com> writes:
>On 9/25/2023 12:06 PM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 9/25/2023 10:41 AM, Scott Lurndal wrote:
>
>
>It seemed like the context was "common", which would mostly imply "stuff
>people actually have", not necessarily high-end data-center servers,
>which pretty much no one has apart from the companies running the data
>centers.
>
>But, then if say, a datacenter has 1 sever per 10k users (each user with
>a desktop PC or similar); this would mean still only mean 0.01% of the
>total computers were servers.

Last year, there were 29 million servers, and 290 million PCs shipped,
so servers are about 10%.

>>
>> Did you read the post you responded to? How do you get 40gbytes/sec
>> into your memory subsystem from an onboard 400gbit nic? Or 128 gbytes/sec
>> from a PCIe root complex? Or from a PCIe CXL-cache memory extender?
>>
>
>Generally you don't...
>
>I think, if one has a 1GbE Ethernet port, they could maybe get 120 MB/s
>or similar if it is going "full tilt".

2.5Gbit Ethernet is making it's way into desktop chipsets. It won't be
long before 10Gbit Ethernet is present in high-end PC mainboards.

Re: Solving the Floating-Point Conundrum

<76d818c1-ee00-456b-848f-bb43ec7739fan@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34312&group=comp.arch#34312

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:6694:b0:773:f1d3:681e with SMTP id qh20-20020a05620a669400b00773f1d3681emr56763qkn.10.1695677425557;
Mon, 25 Sep 2023 14:30:25 -0700 (PDT)
X-Received: by 2002:a9d:7f1a:0:b0:6c4:76b9:fe5a with SMTP id
j26-20020a9d7f1a000000b006c476b9fe5amr2308761otq.5.1695677425369; Mon, 25 Sep
2023 14:30:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Sep 2023 14:30:25 -0700 (PDT)
In-Reply-To: <uesgjo$21bpm$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com> <5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com> <edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com> <jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
<uesgjo$21bpm$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <76d818c1-ee00-456b-848f-bb43ec7739fan@googlegroups.com>
Subject: Re: Solving the Floating-Point Conundrum
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 25 Sep 2023 21:30:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3678
 by: robf...@gmail.com - Mon, 25 Sep 2023 21:30 UTC

On Monday, September 25, 2023 at 1:39:09 PM UTC-4, BGB wrote:
> On 9/25/2023 11:30 AM, Stefan Monnier wrote:
> > I think intermediate-sized FPs are unlikely to be worth the effort other
> > than in very niche places where the few percents gained can be justified
> > by a very large volume.
> >
> > In memory management libraries (malloc, GCs, ...) we do try and minimize
> > the amount of memory that's "wasted" of course, but we generally
> > consider that having an upper bound of ~100% overhead is acceptable
> > (i.e. using twice as much memory as actually needed in the worst case).
> >
> I am now evaluating the possible use of a 48-bit floating-point format,
> but this is (merely) in terms of memory storage (in registers, it will
> still use Binary64).
>
> Still to be determined whether being able to save 2-bytes in struct or
> array storage (at the cost of a few extra clock-cycles per load/store
> operation; and a non-standard C type) will be a worthwhile tradeoff.
>
> Would also need to determine the relative cost tradeoff between having
> dedicated 48-bit load/store ops, and "faking it" via multi-instruction
> sequences.

If your CPU supports unaligned data, you could store 64-bit floats every six bytes,
instead of eight bytes, overlapping previously stored values. But you would need
to be careful the order the stores take place.

With a 48-bit load/store, scaled indexed addressing for arrays would need to
scale by six.
>
>
>
> Similarly, while I was at it, went and added a few "proper" byte-swap
> instructions (to potentially help with the efficiency of endian
> conversion, at least slightly...).
>
> As for memory management, I generally aim for around a 25% space
> overhead per small/medium "malloc()".
>
> Sometimes, this does mean needing to switch between strategies, say:
> Cell-oriented allocator for small objects (say, under 256 or 384B);
> Memory-block-list allocator for medium objects;
> Page-based allocation for larger objects.
>
> ...

Re: Solving the Floating-Point Conundrum

<uesutv$242ut$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34313&group=comp.arch#34313

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 16:43:26 -0500
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <uesutv$242ut$1@dont-email.me>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> <uesgjo$21bpm$2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Sep 2023 21:43:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="337fb90be0238c4dd8b06ee67add00c5";
logging-data="2231261"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FV2yutObF8DpLfRFe35hVeZtbF73laPg="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:DpyBp3k9uf9ZGr+jrvLfY1XdWjk=
In-Reply-To: <jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: BGB-Alt - Mon, 25 Sep 2023 21:43 UTC

On 9/25/2023 1:11 PM, Stefan Monnier wrote:
>> I am now evaluating the possible use of a 48-bit floating-point format, but
>> this is (merely) in terms of memory storage (in registers, it will still use
>> Binary64).
>
> I suspect this is indeed the only sane way to go about it.
> Also, I suspect that such 48bit floats would only be worthwhile when you
> have some large vectors/matrices and care about the 33% bandwidth
> overhead of using 64bit rather than 48bit. So maybe the focus should be
> on "load 3 chunks, then spread turn them into 4" since the limiting
> factor would presumably be the memory bandwidth.
>

Yeah, memory bandwidth tends to be one of the major limiting factors for
performance in my experience for many algorithms.

This is partly why I had some wonk like 3x Float21 vectors (with 64-bit
storage). And, part of why I do a lot of stuff using Binary16 (where, in
this case, both Binary16 and Binary32 have the same latency).

Well, and for a lot of my 3D rendering is using RGB555 framebuffers,
16-bit Z-buffer, and texture compression...

As noted in some past examples, even desktop PCs are not immune to this,
and saving some memory via bit-twiddly can often be cheaper than using a
"less memory dense" strategy (that results in a higher number of L1 misses).

Ironically, this seems to run counter to the conventional wisdom of
saving calculations via lookup tables (depending on the algorithm, the
lookup table may only "win" if it is less than 1/4 or 1/8 the size of
the L1 cache).

Many people seem to evaluate efficiency solely in terms of how many
clock-cycles it would take to evaluate a sequence of instructions,
rather than necessarily how much memory is touched by the algorithm or
its probability of resulting in L1 misses and similar.

Granted, OoO CPUs can sort of hide away L1 miss costs to some extent
(they are a little more obvious with a strictly in-order CPU).

> E.g. load 3 chunks (C1, C2, and C3) of 256bits each using standard SIMD
> load, and then add an instruction to turn C1+C2 into two 256bit vectors
> of 4x64bit floats, and another to do the same with C2+C3 (basically, the
> same instruction except it uses the other half of the bits of C2).
>

I don't really have a good way to do 256-bit loads or work with 256-bit
vectors in my core.

But, yeah, this is how it works for things like the 3x Float21 (64-bit)
and 3x Float42 vectors (128-bit). Where, Float42 vectors would be split
up into three Binary64 values for doing math on them, and then repacked
later.

The Float48 format is basically special case Load/Store ops which pad
the value to 64 bits on Load, and narrow it to 48 bits on Store. These
would be more intended for scalar operations on arrays.

There is a separate penalty for using it in arrays, in that it needs an
extra LEA.W instruction to handle the 48-bit element size (well, along
with the drawback of the Disp5 direct-displacement ops only having a
60-byte range in this case).

But, emulating this without these special instructions, would take a
somewhat longer instruction sequence (and ends up needing to stomp R16
and R17 for the pseudo-instructions, ...), so special ops seem
justifiable. I also added the fallback case to BGBCC as well.

>
> Stefan

Re: Solving the Floating-Point Conundrum

<jwvedil530g.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34314&group=comp.arch#34314

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Mon, 25 Sep 2023 18:08:36 -0400
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <jwvedil530g.fsf-monnier+comp.arch@gnu.org>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org>
<uesgjo$21bpm$2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
<uesutv$242ut$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="b5ac28d950cd15a81e532b6b270f5cfa";
logging-data="2242301"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jbM6g4SCtiCDw6G8/exdaGdNhbotSZMA="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:ttxBPGAKo6r2RlGYvNt8LPg8qFs=
sha1:ARn0HNPGK3Wba/tkADKD38j4S50=
 by: Stefan Monnier - Mon, 25 Sep 2023 22:08 UTC

>> E.g. load 3 chunks (C1, C2, and C3) of 256bits each using standard SIMD
>> load, and then add an instruction to turn C1+C2 into two 256bit vectors
>> of 4x64bit floats, and another to do the same with C2+C3 (basically, the
>> same instruction except it uses the other half of the bits of C2).
> I don't really have a good way to do 256-bit loads or work with 256-bit
> vectors in my core.

Then do 3x 64bit loads which you then split into 4 64bit floats:

64bit 64bit 64bit
+---------+ +-------------------------+ +---------+
A0-31 B0-31 A32-47 B32-47 C32-47 D32-47 C0-31 D0-31

It's an admittedly unusual layout, but lets you load&store in standard
sized chunks, and hence full-bandwidth. And lets you "reshuffle" things
using only "2-in" operations: if your pipeline can do "2-in 2-out" you
can do it two (parallel) instructions and otherwise you can do it in
4 (parallel) instructions. Of course, if your pipeline can accommodate
"3-in 4-out", then you can use a more standard layout.

> The Float48 format is basically special case Load/Store ops which pad the
> value to 64 bits on Load, and narrow it to 48 bits on Store.

So you're wasting the extra L1 bandwidth since you're using a load which
can fetch 64bit but only use 48 of those bits (you admittedly still
gain w.r.t to the bandwidth of higher levels of the memory hierarchy).

Stefan

Re: Solving the Floating-Point Conundrum

<53oQM.190829$Hih7.120587@fx11.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34315&group=comp.arch#34315

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Solving the Floating-Point Conundrum
Newsgroups: comp.arch
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com> <memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad> <uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad> <uesfe8$211tp$1@dont-email.me>
Lines: 17
Message-ID: <53oQM.190829$Hih7.120587@fx11.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 25 Sep 2023 22:44:17 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 25 Sep 2023 22:44:17 GMT
X-Received-Bytes: 1440
 by: Scott Lurndal - Mon, 25 Sep 2023 22:44 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 9/25/2023 10:06 AM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>
>>> AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
>>> have 4 DIMM slots with DIMMs being filled in pairs.
>>
>> "typical" in what context? Home desktops? That's certainly not
>> typical for the data center or cloud servers. One chip I'm aware of
>> has 20 dual-channel DDR5 memory controllers (one per every four
>> cores).
>
>Wow! How many pins on the package? It must be massive. is there a
>latency penalty for getting that many DIMMs "close" to the CPU?

Further deponent sayeth not.

Re: Solving the Floating-Point Conundrum

<c6f7d78c-b586-413a-a551-9987a486a39bn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34316&group=comp.arch#34316

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b33:b0:656:ee2:49c0 with SMTP id w19-20020a0562140b3300b006560ee249c0mr67148qvj.11.1695686819436;
Mon, 25 Sep 2023 17:06:59 -0700 (PDT)
X-Received: by 2002:a05:6808:200c:b0:3a7:805:f419 with SMTP id
q12-20020a056808200c00b003a70805f419mr4764346oiw.6.1695686818947; Mon, 25 Sep
2023 17:06:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Sep 2023 17:06:58 -0700 (PDT)
In-Reply-To: <uesdbt$20rq4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.14.162; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.14.162
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad> <uesdbt$20rq4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c6f7d78c-b586-413a-a551-9987a486a39bn@googlegroups.com>
Subject: Re: Solving the Floating-Point Conundrum
From: jim.brakefield@ieee.org (JimBrakefield)
Injection-Date: Tue, 26 Sep 2023 00:06:59 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 104
 by: JimBrakefield - Tue, 26 Sep 2023 00:06 UTC

On Monday, September 25, 2023 at 11:43:45 AM UTC-5, BGB wrote:
> On 9/25/2023 10:41 AM, Scott Lurndal wrote:
> > j...@cix.co.uk (John Dallman) writes:
> >> In article <fbed57b4-1553-4b63...@googlegroups.com>,
> >> jsa...@ecn.ab.ca (Quadibloc) wrote:
> >>
> >>> hardware support for packed decimal
> >>> hardware support for IBM System/360 hexadecimal floating point
> >>>
> >>> because people do run Hercules on their computers and so on.
> >>
> >> I read the Hercules mailing list. Nobody on there uses it for serious
> >> work. It seems to be mainly used to evoke memories of youth. Running it
> >> on low-powered hardware, such as Raspberry Pi, attracts more notice than
> >> running it on something powerful. Hercules is written in portable C,
> >> because portability is considered more important than performance.
> >>
> >>> I have now added, at the bottom of the page, a scheme, involving
> >>> having dual-channel memory where each channel is 192 bits wide,
> >>> that permits the operating system to allocate blocks of 384-bit
> >>> wide memory, 288-bit wide memory, 240-bit wide memory, and 256-bit
> >>> wide memory.
> >>
> >> That's an interesting new way to have your system run short of the right
> >> kind of memory.
> >
> > Indeed. It's not the path from memory to the core complex that is
> > currently most interesting (although 256-bit wide (and higher) mesh or crossbars
> > aren't uncommon), but rather the data path widths from I/O
> > subsystems. 512-bit wide paths from network controllers and on-board
> > non-coherent (or coherent, see CXL) coprocessors has become
> > common. Supporting 80gbit/sec of network traffic into memory
> > or the networking subsystem isn't trivial.
> >
> > The memory bandwidth grows by adding controllers and striping across them
> > for the most part.
> ?...
>
>
> AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
> have 4 DIMM slots with DIMMs being filled in pairs.
>
> This would seemingly imply that RAM would be mostly limited to a 128-bit
> datapath (or 64-bit in unganged mode).
>
> Similarly, PCIe slots are effectively multiple serial lanes running in
> parallel, etc...
>
> Typical onboard peripherals being connected either with PCIe lanes or
> via an onboard LPC bus.
>
> Similarly, a typical motherboard only has a single CPU socket.
>
> ...
>
>
> Outside of the CPU itself, unclear where any of these wide interconnects
> would be, or where they would be going.
>
>
> Similarly, most things much smaller than a PC, will have 16-bit or
> 32-bit RAM interfaces (with typically 1 or 2 RAM chips soldered onto the
> motherboard somewhere, often also with an eMMC Flash chip and other things).
>
>
> Granted, on the other hand, I have some ~ 18 year old Xeon based rack
> servers, which have a pair of CPUs surrounded by a small island of RAM
> modules, and comparably still rather impressive multi-core "memcpy()"
> performance.
>
> Say, while each core individually is still limited to a few GB/sec, one
> can have all of the cores running memcpy at the same time without any
> significant drop (vs, say, on a PC where once the total exceeds around 8
> to 10GB/sec or so, per-thread performance drops off).
>
> Well, and both still significantly beat out the "memcpy()" performance
> on a 20 year old laptop (as does a RasPi...).
>
> ...

Ugh,
|> This would seemingly imply that RAM would be mostly limited to a 128-bit
|> datapath (or 64-bit in unganged mode).

https://en.wikipedia.org/wiki/CAS_latency
Has a table of DIMM performance for various DIMM generations and various burst lengths.
And by extension to some number of directly soldered DRAM chips which each can have up to 16 data pins.
(e.g. two chips DRAM with 32 total data pins and a burst size of eight = 256 bit cache line)

Re: Solving the Floating-Point Conundrum

<uetsi7$2cf8i$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34317&group=comp.arch#34317

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Tue, 26 Sep 2023 08:09:10 +0200
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <uetsi7$2cf8i$1@dont-email.me>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> <uesgjo$21bpm$2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org> <uesutv$242ut$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Sep 2023 06:09:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f19b7534d597581dc6e54cf021977d65";
logging-data="2506002"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Ce1EZe1Ux4A5oW7zRxmra4MLr/UvymXXomItcdG2FJA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:xPG2E0ZTIJnAwS1zo9vp/rwaqOw=
In-Reply-To: <uesutv$242ut$1@dont-email.me>
 by: Terje Mathisen - Tue, 26 Sep 2023 06:09 UTC

BGB-Alt wrote:
> On 9/25/2023 1:11 PM, Stefan Monnier wrote:
>>> I am now evaluating the possible use of a 48-bit floating-point
>>> format, but
>>> this is (merely) in terms of memory storage (in registers, it will
>>> still use
>>> Binary64).
>>
>> I suspect this is indeed the only sane way to go about it.
>> Also, I suspect that such 48bit floats would only be worthwhile when you
>> have some large vectors/matrices and care about the 33% bandwidth
>> overhead of using 64bit rather than 48bit.  So maybe the focus should be
>> on "load 3 chunks, then spread turn them into 4" since the limiting
>> factor would presumably be the memory bandwidth.
>>
>
> Yeah, memory bandwidth tends to be one of the major limiting factors for
> performance in my experience for many algorithms.
>
> This is partly why I had some wonk like 3x Float21 vectors (with 64-bit
> storage). And, part of why I do a lot of stuff using Binary16 (where, in
> this case, both Binary16 and Binary32 have the same latency).
>
> Well, and for a lot of my 3D rendering is using RGB555 framebuffers,
> 16-bit Z-buffer, and texture compression...
>
>
>
> As noted in some past examples, even desktop PCs are not immune to this,
> and saving some memory via bit-twiddly can often be cheaper than using a
> "less memory dense" strategy (that results in a higher number of L1
> misses).
>
> Ironically, this seems to run counter to the conventional wisdom of
> saving calculations via lookup tables (depending on the algorithm, the
> lookup table may only "win" if it is less than 1/4 or 1/8 the size of
> the L1 cache).

I used to be the master of lookup tables (i.e. a Word Count clone that
ran in 1.5 clock cycles/byte on a Pentium, using two dependent lookup
tables and zero branching), but more lately vector operations and the
increasing cost of storage means that re-calculating stuff can be faster
and/or more power-efficient.
>
> Many people seem to evaluate efficiency solely in terms of how many
> clock-cycles it would take to evaluate a sequence of instructions,
> rather than necessarily how much memory is touched by the algorithm or
> its probability of resulting in L1 misses and similar.

BTDT, lots of times!

Any serious optimization effort have to measure how each intended
improvement actually work as part of the full program, and not in
isolation. I have been in a situation where I really wanted to beat the
best closed-source ogg vorbis decoder, and I was getting desperate near
the end: About 9 out of 10 new ideas that would all work in
theory/isolation ended up either a wash or with a slowdown in the full
program.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Solving the Floating-Point Conundrum

<ueurpl$2i8gg$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34319&group=comp.arch#34319

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tl@none.invalid (Torbjorn Lindgren)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Tue, 26 Sep 2023 15:02:13 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <ueurpl$2i8gg$1@dont-email.me>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com> <56jQM.245879$2ph4.169306@fx14.iad> <uesg1h$21bpm$1@dont-email.me> <4hkQM.297025$ZXz4.183665@fx18.iad>
Injection-Date: Tue, 26 Sep 2023 15:02:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="214e61b1d480fb2b868897654afdad6c";
logging-data="2695696"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fSlISMPahhBcaUdHV9C1+P8qcMVRATEg="
Cancel-Lock: sha1:X5m4ObaFj5AVC4r+C3KakUGLqjQ=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Torbjorn Lindgren - Tue, 26 Sep 2023 15:02 UTC

Scott Lurndal <slp53@pacbell.net> wrote:
>BGB <cr88192@gmail.com> writes:
>>On 9/25/2023 12:06 PM, Scott Lurndal wrote:
>>> Did you read the post you responded to? How do you get 40gbytes/sec
>>> into your memory subsystem from an onboard 400gbit nic? Or 128 gbytes/sec
>>> from a PCIe root complex? Or from a PCIe CXL-cache memory extender?
>>
>>Generally you don't...
>>
>>I think, if one has a 1GbE Ethernet port, they could maybe get 120 MB/s
>>or similar if it is going "full tilt".
>
>2.5Gbit Ethernet is making it's way into desktop chipsets. It won't be
>long before 10Gbit Ethernet is present in high-end PC mainboards.

10G is and has been a common feature in high-end PC motherboards for a
few years now (as part of the "differentiate from the cheap stuff").
"Creator" and certain other PC motherboard segments also often include
them.

Today I'd argue it's NOT a high-end PC motherboard if it doesn't have
at least one 10G network controller, some even have a second network
controller, this used to be an Intel 1G (everyone has drivers) but
lately there's a lot of 10G+2.5G or 10G+5G combos. And some 2x10G.

A single 1G controller is mostly relegated to the cheapest
motherboards which sometimes can't even provide full power to the top
processors!

Recently even some budget models (as opposed to "cheapest possible")
have 2.5G and in the various segment above that it's very common and
becomes almost a given as you move up. The 2.5G network controller has
become *cheap*.

Obviously it'll take time for all PCs to be replaced and all network
equipment to be replaced... OTOH, 2.5G will run on (almost) all
wiring/connectors that can handle 1G.

Re: Solving the Floating-Point Conundrum

<uev8ks$kk4p$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34320&group=comp.arch#34320

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-5334-0-63be-5579-991f-ded7.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Tue, 26 Sep 2023 18:41:32 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uev8ks$kk4p$1@newsreader4.netcologne.de>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk>
<8ThQM.146454$bmw6.26202@fx10.iad> <uesdbt$20rq4$1@dont-email.me>
<56jQM.245879$2ph4.169306@fx14.iad>
Injection-Date: Tue, 26 Sep 2023 18:41:32 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-5334-0-63be-5579-991f-ded7.ipv6dyn.netcologne.de:2001:4dd7:5334:0:63be:5579:991f:ded7";
logging-data="675993"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 26 Sep 2023 18:41 UTC

Scott Lurndal <scott@slp53.sl.home> schrieb:
> One chip I'm aware of
> has 20 dual-channel DDR5 memory controllers (one per every four
> cores).

That is seriously beefy.

Hm, let's see. For Power10, Wikipedia gives 410 GB/s for DDR4
and 800 GB/s for DDR6, DDR5 is probably in the middle, let's
call it 600 GB/s.

So, Power10 seems to have more memory bandwidth per core (equivalent
to ~12 DDR5 channels for 15 cores), unless you factor in the
eight-way SMT.

Still, that system you describe is _seriously_ beefy.

Re: Solving the Floating-Point Conundrum

<d25c95b3-6d5a-4572-9cda-2ebc9500543cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34321&group=comp.arch#34321

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e22:0:b0:65b:11fe:bdb with SMTP id dm2-20020ad44e22000000b0065b11fe0bdbmr308qvb.10.1695761027051;
Tue, 26 Sep 2023 13:43:47 -0700 (PDT)
X-Received: by 2002:a05:6808:f07:b0:3a7:392a:7405 with SMTP id
m7-20020a0568080f0700b003a7392a7405mr56823oiw.2.1695761026688; Tue, 26 Sep
2023 13:43:46 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 26 Sep 2023 13:43:46 -0700 (PDT)
In-Reply-To: <uev8ks$kk4p$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:ec40:1ddb:d3b5:36a3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:ec40:1ddb:d3b5:36a3
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
<uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad> <uev8ks$kk4p$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d25c95b3-6d5a-4572-9cda-2ebc9500543cn@googlegroups.com>
Subject: Re: Solving the Floating-Point Conundrum
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Tue, 26 Sep 2023 20:43:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2699
 by: Michael S - Tue, 26 Sep 2023 20:43 UTC

On Tuesday, September 26, 2023 at 9:41:36 PM UTC+3, Thomas Koenig wrote:
> Scott Lurndal <sc...@slp53.sl.home> schrieb:
> > One chip I'm aware of
> > has 20 dual-channel DDR5 memory controllers (one per every four
> > cores).
> That is seriously beefy.
>
> Hm, let's see. For Power10, Wikipedia gives 410 GB/s for DDR4
> and 800 GB/s for DDR6,

By now, DDR6 is not finalized even as specification. First modules are
~2 years away. Something as slow moving as POWER? I don't believe we
will see something even full 3 years from now.

Wikipadia is talking about GDDR6 rather than DDR6. GDDR6 is high bandwidth,
high latency, low capacity variant of DDR4 intended primarily for GPUs.

I would think that in case of POWER10 GDDR6 is applicable only for
rare specialty systems.

> DDR5 is probably in the middle, let's
> call it 600 GB/s.
>
> So, Power10 seems to have more memory bandwidth per core (equivalent
> to ~12 DDR5 channels for 15 cores), unless you factor in the
> eight-way SMT.
>
> Still, that system you describe is _seriously_ beefy.

If you want high bandwidth, don't care about high capacity or field upgradability,
but still want DDR5-like latency then LDDR5 looks like better path than GDDR7.

Re: Solving the Floating-Point Conundrum

<03c0ba95-7051-48e0-a39f-4ffd6cd3ee38n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34322&group=comp.arch#34322

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:22ce:b0:774:18ee:ca00 with SMTP id o14-20020a05620a22ce00b0077418eeca00mr95532qki.13.1695761140994;
Tue, 26 Sep 2023 13:45:40 -0700 (PDT)
X-Received: by 2002:a05:6870:9566:b0:1dc:4b32:eb14 with SMTP id
v38-20020a056870956600b001dc4b32eb14mr61827oal.4.1695761140822; Tue, 26 Sep
2023 13:45:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 26 Sep 2023 13:45:40 -0700 (PDT)
In-Reply-To: <d25c95b3-6d5a-4572-9cda-2ebc9500543cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:ec40:1ddb:d3b5:36a3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:ec40:1ddb:d3b5:36a3
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
<uesdbt$20rq4$1@dont-email.me> <56jQM.245879$2ph4.169306@fx14.iad>
<uev8ks$kk4p$1@newsreader4.netcologne.de> <d25c95b3-6d5a-4572-9cda-2ebc9500543cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <03c0ba95-7051-48e0-a39f-4ffd6cd3ee38n@googlegroups.com>
Subject: Re: Solving the Floating-Point Conundrum
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Tue, 26 Sep 2023 20:45:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2960
 by: Michael S - Tue, 26 Sep 2023 20:45 UTC

On Tuesday, September 26, 2023 at 11:43:48 PM UTC+3, Michael S wrote:
> On Tuesday, September 26, 2023 at 9:41:36 PM UTC+3, Thomas Koenig wrote:
> > Scott Lurndal <sc...@slp53.sl.home> schrieb:
> > > One chip I'm aware of
> > > has 20 dual-channel DDR5 memory controllers (one per every four
> > > cores).
> > That is seriously beefy.
> >
> > Hm, let's see. For Power10, Wikipedia gives 410 GB/s for DDR4
> > and 800 GB/s for DDR6,
> By now, DDR6 is not finalized even as specification. First modules are
> ~2 years away. Something as slow moving as POWER? I don't believe we
> will see something even full 3 years from now.
>
> Wikipadia is talking about GDDR6 rather than DDR6. GDDR6 is high bandwidth,
> high latency, low capacity variant of DDR4 intended primarily for GPUs.
>
> I would think that in case of POWER10 GDDR6 is applicable only for
> rare specialty systems.
> > DDR5 is probably in the middle, let's
> > call it 600 GB/s.
> >
> > So, Power10 seems to have more memory bandwidth per core (equivalent
> > to ~12 DDR5 channels for 15 cores), unless you factor in the
> > eight-way SMT.
> >
> > Still, that system you describe is _seriously_ beefy.
> If you want high bandwidth, don't care about high capacity or field upgradability,
> but still want DDR5-like latency then LDDR5 looks like better path than GDDR7.

LPDDR5, of course.

Re: Solving the Floating-Point Conundrum

<uf1h1r$35160$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34323&group=comp.arch#34323

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Wed, 27 Sep 2023 10:17:14 -0500
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <uf1h1r$35160$1@dont-email.me>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> <uesgjo$21bpm$2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org> <uesutv$242ut$1@dont-email.me>
<jwvedil530g.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 27 Sep 2023 15:17:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6935b40e1c6a57f58b8c92381fe63bdf";
logging-data="3310784"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+k01HEYBbz+eHGiNLE/0EK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:eheunIhWPHI9bjaStmwcwsLWRrM=
In-Reply-To: <jwvedil530g.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: BGB - Wed, 27 Sep 2023 15:17 UTC

On 9/25/2023 5:08 PM, Stefan Monnier wrote:
>>> E.g. load 3 chunks (C1, C2, and C3) of 256bits each using standard SIMD
>>> load, and then add an instruction to turn C1+C2 into two 256bit vectors
>>> of 4x64bit floats, and another to do the same with C2+C3 (basically, the
>>> same instruction except it uses the other half of the bits of C2).
>> I don't really have a good way to do 256-bit loads or work with 256-bit
>> vectors in my core.
>
> Then do 3x 64bit loads which you then split into 4 64bit floats:
>
> 64bit 64bit 64bit
> +---------+ +-------------------------+ +---------+
> A0-31 B0-31 A32-47 B32-47 C32-47 D32-47 C0-31 D0-31
>
> It's an admittedly unusual layout, but lets you load&store in standard
> sized chunks, and hence full-bandwidth. And lets you "reshuffle" things
> using only "2-in" operations: if your pipeline can do "2-in 2-out" you
> can do it two (parallel) instructions and otherwise you can do it in
> 4 (parallel) instructions. Of course, if your pipeline can accommodate
> "3-in 4-out", then you can use a more standard layout.
>

Possible at least.

>> The Float48 format is basically special case Load/Store ops which pad the
>> value to 64 bits on Load, and narrow it to 48 bits on Store.
>
> So you're wasting the extra L1 bandwidth since you're using a load which
> can fetch 64bit but only use 48 of those bits (you admittedly still
> gain w.r.t to the bandwidth of higher levels of the memory hierarchy).
>

This is little different from normal load/store:
The L1 cache is wide enough to support 128 bits in 1 cycle;
But, anything narrower than 128 bits, still takes 1 cycle.

Typically, 64 and 32 bit access tends to dominate, followed by 128 bit,
with 8 and 16 bit access a little further down the list.

The Load path basically just performs an unaligned 64-bit load and then
extracts 48 bits from it. The Store path needed to add a special case to
support a 48-bit store.

Theoretically, 96-bit could also be possible, but would need some
additional logic (thus cost).

At least if moving data around with 64 or 128 bit operations, the
limiting factor tends to be L1<->L2 transfers, but if doing 8 or 16 bit
Load/Store, then the loads/stores themselves can become the bottleneck.

Doing stuff with 32-bit ops can go either way.

Re: Solving the Floating-Point Conundrum

<uf1hij$35160$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34324&group=comp.arch#34324

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Wed, 27 Sep 2023 10:26:10 -0500
Organization: A noiseless patient Spider
Lines: 113
Message-ID: <uf1hij$35160$2@dont-email.me>
References: <fbed57b4-1553-4b63-b39e-c130754b3aa8n@googlegroups.com>
<memo.20230925074837.16292U@jgd.cix.co.uk> <8ThQM.146454$bmw6.26202@fx10.iad>
<uesdbt$20rq4$1@dont-email.me>
<c6f7d78c-b586-413a-a551-9987a486a39bn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 27 Sep 2023 15:26:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6935b40e1c6a57f58b8c92381fe63bdf";
logging-data="3310784"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PjolSl5znY9+S9XhOTQPj"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:t2YA/GcvUHzURBa3n3uJP0u2IvA=
In-Reply-To: <c6f7d78c-b586-413a-a551-9987a486a39bn@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 27 Sep 2023 15:26 UTC

On 9/25/2023 7:06 PM, JimBrakefield wrote:
> On Monday, September 25, 2023 at 11:43:45 AM UTC-5, BGB wrote:
>> On 9/25/2023 10:41 AM, Scott Lurndal wrote:
>>> j...@cix.co.uk (John Dallman) writes:
>>>> In article <fbed57b4-1553-4b63...@googlegroups.com>,
>>>> jsa...@ecn.ab.ca (Quadibloc) wrote:
>>>>
>>>>> hardware support for packed decimal
>>>>> hardware support for IBM System/360 hexadecimal floating point
>>>>>
>>>>> because people do run Hercules on their computers and so on.
>>>>
>>>> I read the Hercules mailing list. Nobody on there uses it for serious
>>>> work. It seems to be mainly used to evoke memories of youth. Running it
>>>> on low-powered hardware, such as Raspberry Pi, attracts more notice than
>>>> running it on something powerful. Hercules is written in portable C,
>>>> because portability is considered more important than performance.
>>>>
>>>>> I have now added, at the bottom of the page, a scheme, involving
>>>>> having dual-channel memory where each channel is 192 bits wide,
>>>>> that permits the operating system to allocate blocks of 384-bit
>>>>> wide memory, 288-bit wide memory, 240-bit wide memory, and 256-bit
>>>>> wide memory.
>>>>
>>>> That's an interesting new way to have your system run short of the right
>>>> kind of memory.
>>>
>>> Indeed. It's not the path from memory to the core complex that is
>>> currently most interesting (although 256-bit wide (and higher) mesh or crossbars
>>> aren't uncommon), but rather the data path widths from I/O
>>> subsystems. 512-bit wide paths from network controllers and on-board
>>> non-coherent (or coherent, see CXL) coprocessors has become
>>> common. Supporting 80gbit/sec of network traffic into memory
>>> or the networking subsystem isn't trivial.
>>>
>>> The memory bandwidth grows by adding controllers and striping across them
>>> for the most part.
>> ?...
>>
>>
>> AFAIK, typical DIMMs have a 64-bit wide interface, and typical MOBOs
>> have 4 DIMM slots with DIMMs being filled in pairs.
>>
>> This would seemingly imply that RAM would be mostly limited to a 128-bit
>> datapath (or 64-bit in unganged mode).
>>
>> Similarly, PCIe slots are effectively multiple serial lanes running in
>> parallel, etc...
>>
>> Typical onboard peripherals being connected either with PCIe lanes or
>> via an onboard LPC bus.
>>
>> Similarly, a typical motherboard only has a single CPU socket.
>>
>> ...
>>
>>
>> Outside of the CPU itself, unclear where any of these wide interconnects
>> would be, or where they would be going.
>>
>>
>> Similarly, most things much smaller than a PC, will have 16-bit or
>> 32-bit RAM interfaces (with typically 1 or 2 RAM chips soldered onto the
>> motherboard somewhere, often also with an eMMC Flash chip and other things).
>>
>>
>> Granted, on the other hand, I have some ~ 18 year old Xeon based rack
>> servers, which have a pair of CPUs surrounded by a small island of RAM
>> modules, and comparably still rather impressive multi-core "memcpy()"
>> performance.
>>
>> Say, while each core individually is still limited to a few GB/sec, one
>> can have all of the cores running memcpy at the same time without any
>> significant drop (vs, say, on a PC where once the total exceeds around 8
>> to 10GB/sec or so, per-thread performance drops off).
>>
>> Well, and both still significantly beat out the "memcpy()" performance
>> on a 20 year old laptop (as does a RasPi...).
>>
>> ...
>
> Ugh,
> |> This would seemingly imply that RAM would be mostly limited to a 128-bit
> |> datapath (or 64-bit in unganged mode).
>
> https://en.wikipedia.org/wiki/CAS_latency
> Has a table of DIMM performance for various DIMM generations and various burst lengths.
> And by extension to some number of directly soldered DRAM chips which each can have up to 16 data pins.
> (e.g. two chips DRAM with 32 total data pins and a burst size of eight = 256 bit cache line)

I was talking about the width of the datapath, not the size of a burst
transfer.

So, for example, with a DDR3 chip with a 16-bit interface, 8x burst, you
can move 16B in 4 cycles (from the RAM chip's POV); but this doesn't
change that the interface is only 16-bit wide.

Internally, in my case, there is a 512-bit wide interface between my DDR
RAM module and L2 cache, but I don't really count it as it is fairly
special purpose: Move data in big enough chunks that I can get full RAM
bandwidth, with a DDR controller that can only issue one request at a
time at a 50MHz RAM clock (logic internally operating at 100MHz and thus
driving the high and low sides as separate clock-cycles as far as the
FPGA is concerned).

If I were able to FIFO the requests in the RAM controller, wouldn't
necessarily need the 512-bit bursts; but there are annoyances (and
latency) here, due to crossing clock domains.

Re: Solving the Floating-Point Conundrum

<uf1ij1$35dqi$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34325&group=comp.arch#34325

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Solving the Floating-Point Conundrum
Date: Wed, 27 Sep 2023 10:43:26 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <uf1ij1$35dqi$1@dont-email.me>
References: <57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com>
<a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com>
<c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com>
<43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> <uesgjo$21bpm$2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org> <uesutv$242ut$1@dont-email.me>
<uetsi7$2cf8i$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 27 Sep 2023 15:43:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6935b40e1c6a57f58b8c92381fe63bdf";
logging-data="3323730"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m6ZvrSgce62RqQ18Qj3Ae"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:804CTbVdZOd8Af64VnCNMOPj0lE=
Content-Language: en-US
In-Reply-To: <uetsi7$2cf8i$1@dont-email.me>
 by: BGB - Wed, 27 Sep 2023 15:43 UTC

On 9/26/2023 1:09 AM, Terje Mathisen wrote:
> BGB-Alt wrote:
>> On 9/25/2023 1:11 PM, Stefan Monnier wrote:
>>>> I am now evaluating the possible use of a 48-bit floating-point
>>>> format, but
>>>> this is (merely) in terms of memory storage (in registers, it will
>>>> still use
>>>> Binary64).
>>>
>>> I suspect this is indeed the only sane way to go about it.
>>> Also, I suspect that such 48bit floats would only be worthwhile when you
>>> have some large vectors/matrices and care about the 33% bandwidth
>>> overhead of using 64bit rather than 48bit.  So maybe the focus should be
>>> on "load 3 chunks, then spread turn them into 4" since the limiting
>>> factor would presumably be the memory bandwidth.
>>>
>>
>> Yeah, memory bandwidth tends to be one of the major limiting factors
>> for performance in my experience for many algorithms.
>>
>> This is partly why I had some wonk like 3x Float21 vectors (with
>> 64-bit storage). And, part of why I do a lot of stuff using Binary16
>> (where, in this case, both Binary16 and Binary32 have the same latency).
>>
>> Well, and for a lot of my 3D rendering is using RGB555 framebuffers,
>> 16-bit Z-buffer, and texture compression...
>>
>>
>>
>> As noted in some past examples, even desktop PCs are not immune to
>> this, and saving some memory via bit-twiddly can often be cheaper than
>> using a "less memory dense" strategy (that results in a higher number
>> of L1 misses).
>>
>> Ironically, this seems to run counter to the conventional wisdom of
>> saving calculations via lookup tables (depending on the algorithm, the
>> lookup table may only "win" if it is less than 1/4 or 1/8 the size of
>> the L1 cache).
>
> I used to be the master of lookup tables (i.e. a Word Count clone that
> ran in 1.5 clock cycles/byte on a Pentium, using two dependent lookup
> tables and zero branching), but more lately vector operations and the
> increasing cost of storage means that re-calculating stuff can be faster
> and/or more power-efficient.

Yeah.

Or, in my core, because of the comparably high cost of L1 misses...

Using lookup tables to sidestep a divide operation can be a win though.
Software divide loop is slow.
The hardware DIVx.x ops are also slow.

Except that DIVS.L / DIVU.L can optimize a similar subset to the lookup
tables (internally handling it as a multiply-by-reciprocal). So, say,
for divisors in the range of 0..63, the DIVS.L op can handle it in 3
cycles (vs 36 cycles for most everything else).

>>
>> Many people seem to evaluate efficiency solely in terms of how many
>> clock-cycles it would take to evaluate a sequence of instructions,
>> rather than necessarily how much memory is touched by the algorithm or
>> its probability of resulting in L1 misses and similar.
>
> BTDT, lots of times!
>
> Any serious optimization effort have to measure how each intended
> improvement actually work as part of the full program, and not in
> isolation. I have been in a situation where I really wanted to beat the
> best closed-source ogg vorbis decoder, and I was getting desperate near
> the end: About 9 out of 10 new ideas that would all work in
> theory/isolation ended up either a wash or with a slowdown in the full
> program.
>

Yeah.

This is related sort of to some of my codecs using Rice and other
non-standard entropy encoders (rather than Huffman), as while in
isolation, Huffman is short and fast, once has multiple contexts (and a
lot of L1 misses), performance quickly takes a hit.

So, there may be wonky encoders, like ranking all the symbols by
probability and then Rice-encoding the index (so, only need ~ 256 bytes
per context in this case), usually with the Rice-coding being
length-limited (usually, Q=7 as a limit, where a case of 8x 1-bits or
similar is followed by a raw symbol).

More complex logic, but fewer L1 misses, ...

Though, Huffman with a 12-bit length-limit is also a promising
compromise, as it is generally easier for an 8K lookup table to fit into
the L1 cache... And, 12-bit Huffman does still lead to better
compression than the permutation-table + Rice variants.

....

> Terje
>


devel / comp.arch / Re: Solving the Floating-Point Conundrum

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor