Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

"I have five dollars for each of you." -- Bernhard Goetz

Aprupt underflow mode

Subject	Author
Aprupt underflow mode	Thomas Koenig
Re: Abrupt underflow mode	John Dallman
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Anton Ertl
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Terje Mathisen
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Quadibloc
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	MitchAlsup
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	EricP
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	MitchAlsup
Re: Aprupt underflow mode	robf...@gmail.com
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	robf...@gmail.com
Re: Aprupt underflow mode	BGB
Re: Aprupt underflow mode	MitchAlsup
Re: Aprupt underflow mode	EricP
Re: Aprupt underflow mode	EricP
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Terje Mathisen
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Terje Mathisen
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Terje Mathisen
Re: Aprupt underflow mode	Thomas Koenig
Re: Aprupt underflow mode	Michael S
Re: Aprupt underflow mode	Michael S
80x87 and IEEE 754 (was: Aprupt underflow mode)	Anton Ertl
Re: 80x87 and IEEE 754 (was: Aprupt underflow mode)	MitchAlsup
Re: 80x87 and IEEE 754 (was: Aprupt underflow mode)	Michael S
Re: 80x87 and IEEE 754 (was: Aprupt underflow mode)	BGB
Re: 80x87 and IEEE 754 (was: Aprupt underflow mode)	BGB
Re: 80x87 and IEEE 754	Terje Mathisen
Re: 80x87 and IEEE 754	Michael S
Re: 80x87 and IEEE 754	Terje Mathisen
Re: 80x87 and IEEE 754	Michael S
Re: 80x87 and IEEE 754	MitchAlsup
Re: 80x87 and IEEE 754	Terje Mathisen
Re: 80x87 and IEEE 754	Thomas Koenig
Re: 80x87 and IEEE 754	MitchAlsup
Re: 80x87 and IEEE 754	EricP
Re: 80x87 and IEEE 754	EricP
Re: 80x87 and IEEE 754	MitchAlsup
Re: 80x87 and IEEE 754	EricP
Re: 80x87 and IEEE 754	MitchAlsup
Re: 80x87 and IEEE 754	EricP
Re: 80x87 and IEEE 754	MitchAlsup
Re: 80x87 and IEEE 754	EricP
Re: 80x87 and IEEE 754	Quadibloc
Re: 80x87 and IEEE 754	MitchAlsup
Re: Aprupt underflow mode	MitchAlsup
Re: Aprupt underflow mode	Michael S

Pages:12 3

Aprupt underflow mode

<u78vn4$3keqd$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32929&group=comp.arch#32929

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f9e9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Aprupt underflow mode
Date: Sun, 25 Jun 2023 09:00:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u78vn4$3keqd$1@newsreader4.netcologne.de>
Injection-Date: Sun, 25 Jun 2023 09:00:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f9e9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd4:f9e9:0:7285:c2ff:fe6c:992d";
logging-data="3816269"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 25 Jun 2023 09:00 UTC

Fortran 2023 will support an aprupt underflow mode for those
processors that support it. It's optional, and there is a hint
that it may be faster.

So, questions: Does anybody know which CPUs support setting
underflow mode? How much faster would normal execution be
if underflow is not actually used?

Here's the text from the standard draft (note that "processor" is
Fortran-standardese for all of the CPU, operating system, library,
and compiler).

# 17.5 Underflow mode

# 1 Some processors allow control during program execution of whether
# underflow produces a subnormal number in conformance with ISO/IEC
# 60559:2020 (gradual underflow) or produces zero instead (abrupt
# underflow). On some processors, floating-point performance
# is typically better in abrupt underflow mode than in gradual
# underflow mode.

# 2 Control over the underflow mode is exercised by invocation of
# IEEE_SET_UNDERFLOW_MODE. The subroutine IEEE_GET_UNDERFLOW_MODE
# can be used to get the underflow mode. The inquiry function
# IEEE_SUPPORT_UNDERFLOW_CONTROL can be used to inquire whether this
# facility is available. The initial underflow mode is processor
# dependent. In a procedure other than IEEE_SET_UNDERFLOW_MODE or
# IEEE_SET_STATUS, the processor shall not change the underflow mode
# on entry, and on return shall ensure that the underflow mode is
# the same as it was on entry.

# 3 The underflow mode affects only floating-point calculations whose
# type is that of an X for which IEEE_SUPPORT_UNDERFLOW_CONTROL
# returns true.

Re: Abrupt underflow mode

<memo.20230625105602.16808X@jgd.cix.co.uk>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32930&group=comp.arch#32930

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Abrupt underflow mode
Date: Sun, 25 Jun 2023 10:56 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <memo.20230625105602.16808X@jgd.cix.co.uk>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="d63294b461d4871b4907520b12728a1b";
logging-data="488288"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18b9BuMleQZV3mqk1VM+mtLHFzmzM3TQKI="
Cancel-Lock: sha1:SJWdr/ATam6RzFcGuVmpbPoGm1Q=

by: John Dallman - Sun, 25 Jun 2023 09:56 UTC

In article <u78vn4$3keqd$1@newsreader4.netcologne.de>,
tkoenig@netcologne.de (Thomas Koenig) wrote:

> So, questions: Does anybody know which CPUs support setting
> underflow mode? How much faster would normal execution be
> if underflow is not actually used?

Intel/AMD x86 supports this, for the SSE2 registers and instructions,
though not the old-style x87 r&i.

John

Re: Aprupt underflow mode

<2f013e8f-b426-4ca6-a93b-1fcd20e272a6n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32932&group=comp.arch#32932

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:398d:b0:763:df32:bdc with SMTP id ro13-20020a05620a398d00b00763df320bdcmr810029qkn.10.1687699267979;
Sun, 25 Jun 2023 06:21:07 -0700 (PDT)
X-Received: by 2002:aca:f041:0:b0:39c:8220:b3c9 with SMTP id
o62-20020acaf041000000b0039c8220b3c9mr6331294oih.0.1687699267709; Sun, 25 Jun
2023 06:21:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 25 Jun 2023 06:21:07 -0700 (PDT)
In-Reply-To: <u78vn4$3keqd$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f013e8f-b426-4ca6-a93b-1fcd20e272a6n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Sun, 25 Jun 2023 13:21:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3250

by: Michael S - Sun, 25 Jun 2023 13:21 UTC

On Sunday, June 25, 2023 at 12:00:24 PM UTC+3, Thomas Koenig wrote:
> Fortran 2023 will support an aprupt underflow mode for those
> processors that support it. It's optional, and there is a hint
> that it may be faster.
>
> So, questions: Does anybody know which CPUs support setting
> underflow mode?

iAMD64 (SSEn, AVX, AVX512), ARM64, ARMv7.
On PowerISA v3. it is optional, but does not appear to be
implemented on any IBM CPU.

> How much faster would normal execution be
> if underflow is not actually used?

Not faster at all on any processor that people could possibly
want to run Fortran.
On many of such processors it is not faster even when underflow
actually happens.

>
> Here's the text from the standard draft (note that "processor" is
> Fortran-standardese for all of the CPU, operating system, library,
> and compiler).
>
> # 17.5 Underflow mode
>
> # 1 Some processors allow control during program execution of whether
> # underflow produces a subnormal number in conformance with ISO/IEC
> # 60559:2020 (gradual underflow) or produces zero instead (abrupt
> # underflow). On some processors, floating-point performance
> # is typically better in abrupt underflow mode than in gradual
> # underflow mode.
>
> # 2 Control over the underflow mode is exercised by invocation of
> # IEEE_SET_UNDERFLOW_MODE. The subroutine IEEE_GET_UNDERFLOW_MODE
> # can be used to get the underflow mode. The inquiry function
> # IEEE_SUPPORT_UNDERFLOW_CONTROL can be used to inquire whether this
> # facility is available. The initial underflow mode is processor
> # dependent. In a procedure other than IEEE_SET_UNDERFLOW_MODE or
> # IEEE_SET_STATUS, the processor shall not change the underflow mode
> # on entry, and on return shall ensure that the underflow mode is
> # the same as it was on entry.
>
> # 3 The underflow mode affects only floating-point calculations whose
> # type is that of an X for which IEEE_SUPPORT_UNDERFLOW_CONTROL
> # returns true.

Re: Aprupt underflow mode

<2023Jun25.180112@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32935&group=comp.arch#32935

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Sun, 25 Jun 2023 16:01:12 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 48
Distribution: world
Message-ID: <2023Jun25.180112@mips.complang.tuwien.ac.at>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
Injection-Info: dont-email.me; posting-host="02424ac9374d9124883bf563c53b9e18";
logging-data="567405"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zxN4jdeSU8Dl+6Mgn4Zf5"
Cancel-Lock: sha1:Q/IdABPdFBYIXwzHIodFslJ/OIU=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 25 Jun 2023 16:01 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Fortran 2023 will support an aprupt underflow mode for those
>processors that support it. It's optional, and there is a hint
>that it may be faster.
>
>So, questions: Does anybody know which CPUs support setting
>underflow mode? How much faster would normal execution be
>if underflow is not actually used?

From <2017Dec25.171105@mips.complang.tuwien.ac.at>:

|>> (I doubt anyone has tried to run a
|>> denormal multiplication benchmark on Ryzen to see how it does... or maybe
|>> someone has, but I don't know where to look for the results.)
|>
|>Such a benchmark exists:
|>
|>https://sco.h-its.org/exelixis/web/software/fpn/index.html
| |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen (gcc-4.9):
| |
| 2.7GHz 4GHz 4GHz
| Sandy Skylake Zen
| i7-2620 i5-6600K R5 1600X
|micro
|Large: 0.051161 0.028260 0.018399
|Small: 0.016829 0.012012 0.011826
|micro-ignore (with flush-to-zero)
|Large: 0.029944 0.012030 0.017606
|Small: 0.017007 0.012000 0.011438
| |According to the README at
|<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
|Large data set produces denormals, but the same number of computations
|as the Small data set.
| |So for this benchmark denormals are about 3 times slower on Sandy
|Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
|Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
|Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
|better implementation of denormals, while Intel has invested in a
|better implementation of flush-to-zero.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Aprupt underflow mode

<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32937&group=comp.arch#32937

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1a14:b0:400:a9f5:beed with SMTP id f20-20020a05622a1a1400b00400a9f5beedmr160qtb.9.1687712739662;
Sun, 25 Jun 2023 10:05:39 -0700 (PDT)
X-Received: by 2002:a9d:754c:0:b0:6aa:e1b1:900c with SMTP id
b12-20020a9d754c000000b006aae1b1900cmr6153433otl.7.1687712739324; Sun, 25 Jun
2023 10:05:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 25 Jun 2023 10:05:39 -0700 (PDT)
In-Reply-To: <2023Jun25.180112@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Sun, 25 Jun 2023 17:05:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Michael S - Sun, 25 Jun 2023 17:05 UTC

On Sunday, June 25, 2023 at 7:06:21 PM UTC+3, Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
> >Fortran 2023 will support an aprupt underflow mode for those
> >processors that support it. It's optional, and there is a hint
> >that it may be faster.
> >
> >So, questions: Does anybody know which CPUs support setting
> >underflow mode? How much faster would normal execution be
> >if underflow is not actually used?
> From <2017Dec2...@mips.complang.tuwien.ac.at>:
>
> |>> (I doubt anyone has tried to run a
> |>> denormal multiplication benchmark on Ryzen to see how it does... or maybe
> |>> someone has, but I don't know where to look for the results.)
> |>
> |>Such a benchmark exists:
> |>
> |>https://sco.h-its.org/exelixis/web/software/fpn/index.html
> |
> |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen (gcc-4.9):
> |
> |
> | 2.7GHz 4GHz 4GHz
> | Sandy Skylake Zen
> | i7-2620 i5-6600K R5 1600X
> |micro
> |Large: 0.051161 0.028260 0.018399
> |Small: 0.016829 0.012012 0.011826
> |micro-ignore (with flush-to-zero)
> |Large: 0.029944 0.012030 0.017606
> |Small: 0.017007 0.012000 0.011438
> |
> |According to the README at
> |<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
> |Large data set produces denormals, but the same number of computations
> |as the Small data set.
> |
> |So for this benchmark denormals are about 3 times slower on Sandy
> |Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
> |Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
> |Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
> |better implementation of denormals, while Intel has invested in a
> |better implementation of flush-to-zero.
>
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Few months ago I looked for slow cases of handling of subnormal
by fmul/fadd on Zen3, but didn't find any. May be, I don't know what to look for.

On Skylake, on the other hand all multiplications with subnormal result were
slow. Additions with subnormal results were mostly slow too, with exception of
the case where at least one of the inputs is also subnormal.

Re: Aprupt underflow mode

<u7b8q8$qh6r$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32940&group=comp.arch#32940

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Mon, 26 Jun 2023 07:47:52 +0200
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <u7b8q8$qh6r$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 26 Jun 2023 05:47:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cecc420af91e20f52a6e6c71c097748a";
logging-data="869595"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/FOaoIvXN8tsdFbidr9W6TUilBQTeXucecXgp8vgKbgA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:v44/oABDKHPYA0ACjJW2dVWwaSo=
In-Reply-To: <b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>

by: Terje Mathisen - Mon, 26 Jun 2023 05:47 UTC

Michael S wrote:
> On Sunday, June 25, 2023 at 7:06:21â¯PM UTC+3, Anton Ertl wrote:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>> Fortran 2023 will support an aprupt underflow mode for those
>>> processors that support it. It's optional, and there is a hint
>>> that it may be faster.
>>>
>>> So, questions: Does anybody know which CPUs support setting
>>> underflow mode? How much faster would normal execution be
>>> if underflow is not actually used?
>> From <2017Dec2...@mips.complang.tuwien.ac.at>:
>>
>> |>> (I doubt anyone has tried to run a
>> |>> denormal multiplication benchmark on Ryzen to see how it does... or maybe
>> |>> someone has, but I don't know where to look for the results.)
>> |>
>> |>Such a benchmark exists:
>> |>
>> |>https://sco.h-its.org/exelixis/web/software/fpn/index.html
>> |
>> |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen (gcc-4.9):
>> |
>> |
>> | 2.7GHz 4GHz 4GHz
>> | Sandy Skylake Zen
>> | i7-2620 i5-6600K R5 1600X
>> |micro
>> |Large: 0.051161 0.028260 0.018399
>> |Small: 0.016829 0.012012 0.011826
>> |micro-ignore (with flush-to-zero)
>> |Large: 0.029944 0.012030 0.017606
>> |Small: 0.017007 0.012000 0.011438
>> |
>> |According to the README at
>> |<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
>> |Large data set produces denormals, but the same number of computations
>> |as the Small data set.
>> |
>> |So for this benchmark denormals are about 3 times slower on Sandy
>> |Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
>> |Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
>> |Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
>> |better implementation of denormals, while Intel has invested in a
>> |better implementation of flush-to-zero.
>>
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>
> Few months ago I looked for slow cases of handling of subnormal
> by fmul/fadd on Zen3, but didn't find any. May be, I don't know what to look for.
>
> On Skylake, on the other hand all multiplications with subnormal result were
> slow. Additions with subnormal results were mostly slow too, with exception of
> the case where at least one of the inputs is also subnormal.
>
I think your last point is crucial: Intel must have decided that in the
typical case subnormal occurs as part of optimalization/zero-finding,
and then the error term will only once pass from normal to subnormal, so
on the next iteration the core sees this subnormal input and starts a
slightly different sequence where subnormal outputs are expected. In the
normal input case the core instead predicts normal output and save
either power or time from doing this.

Mitch have shown us repeatedly that on any core with FMA, you can do
subnormal inputs and/or output with zero cycle cost and very slight
hardware cost, but it is possible that Intel looked at the same and
decided that average power would be better with a one-shot predictor in
the path?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Aprupt underflow mode

<u7cde2$10aie$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32946&group=comp.arch#32946

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Mon, 26 Jun 2023 11:12:48 -0500
Organization: A noiseless patient Spider
Lines: 142
Message-ID: <u7cde2$10aie$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 26 Jun 2023 16:12:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4b031edd091b9dccb6c70a22fda3986f";
logging-data="1059406"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IpRxkHgEDQLBltMvkSpNy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:O7G8AJdRzgQOTC4BiTcXcWEqg04=
In-Reply-To: <u7b8q8$qh6r$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 26 Jun 2023 16:12 UTC

On 6/26/2023 12:47 AM, Terje Mathisen wrote:
> Michael S wrote:
>> On Sunday, June 25, 2023 at 7:06:21â€¯PM UTC+3, Anton Ertl wrote:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>> Fortran 2023 will support an aprupt underflow mode for those
>>>> processors that support it. It's optional, and there is a hint
>>>> that it may be faster.
>>>>
>>>> So, questions: Does anybody know which CPUs support setting
>>>> underflow mode? How much faster would normal execution be
>>>> if underflow is not actually used?
>>> From <2017Dec2...@mips.complang.tuwien.ac.at>:
>>>
>>> |>> (I doubt anyone has tried to run a
>>> |>> denormal multiplication benchmark on Ryzen to see how it does...
>>> or maybe
>>> |>> someone has, but I don't know where to look for the results.)
>>> |>
>>> |>Such a benchmark exists:
>>> |>
>>> |>https://sco.h-its.org/exelixis/web/software/fpn/index.html
>>> |
>>> |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen
>>> (gcc-4.9):
>>> |
>>> |
>>> | 2.7GHz 4GHz 4GHz
>>> | Sandy Skylake Zen
>>> | i7-2620 i5-6600K R5 1600X
>>> |micro
>>> |Large: 0.051161 0.028260 0.018399
>>> |Small: 0.016829 0.012012 0.011826
>>> |micro-ignore (with flush-to-zero)
>>> |Large: 0.029944 0.012030 0.017606
>>> |Small: 0.017007 0.012000 0.011438
>>> |
>>> |According to the README at
>>> |<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
>>> |Large data set produces denormals, but the same number of computations
>>> |as the Small data set.
>>> |
>>> |So for this benchmark denormals are about 3 times slower on Sandy
>>> |Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
>>> |Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
>>> |Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
>>> |better implementation of denormals, while Intel has invested in a
>>> |better implementation of flush-to-zero.
>>>
>>> - anton
>>> --
>>> 'Anyone trying for "industrial quality" ISA should avoid undefined
>>> behavior.'
>>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>>
>> Few months ago I looked for slow cases of handling of subnormal
>> by fmul/fadd on Zen3, but didn't find any. May be, I don't know what
>> to look for.
>>
>> On Skylake, on the other hand all multiplications with subnormal
>> result were
>> slow. Additions with subnormal results were mostly slow too, with
>> exception of
>> the case where at least one of the inputs is also subnormal.
>>
> I think your last point is crucial: Intel must have decided that in the
> typical case subnormal occurs as part of optimalization/zero-finding,
> and then the error term will only once pass from normal to subnormal, so
> on the next iteration the core sees this subnormal input and starts a
> slightly different sequence where subnormal outputs are expected. In the
> normal input case the core instead predicts normal output and save
> either power or time from doing this.
>
> Mitch have shown us repeatedly that on any core with FMA, you can do
> subnormal inputs and/or output with zero cycle cost and very slight
> hardware cost, but it is possible that Intel looked at the same and
> decided that average power would be better with a one-shot predictor in
> the path?
>

Hmm:
I guess one possibility (for cheaper hardware) could be to not perform
subnormal handling in hardware, but then have a flag in a control
register that tells the FPU that if a subnormal result would be
generated, to raise a fault.

In this case software could handle it manually?... (Such as by emulating
the offending FPU instruction).

I guess related, could have a special exception for "An FPU instruction
was used but there is no FPU" rather than using the generic "Invalid
Opcode" code. Could maybe generate this in cases where FPU is present,
but is set to "Fault on Denormal" or similar.

Though, I would imagine most developers would probably not bother
enabling this (and just stick with DAZ/FTZ if this is the hardware's
default behavior?...).

In other news:
In my case, have effectively ended up reviving the original WEX 2-wide
configuration (with a 4R2W register file), as while it is "not as good"
as the 3-wide, and the cost difference "isn't that big", it does have
the advantage that it can fit more easily into an XC7S50 (leaving some
more LUTs for other stuff) and can still be made to support some "other
useful features" (like Jumbo encodings, 128-bit Load/Store and most
128-bit SIMD ops).

It still has the original limitation (one of the major things that was a
selling point for 3-wide) in that it does not allow store instructions
to be used in bundles.

Eg: "ADD R4, R5, R7 | MOV.L R6, (R4, 16)".
Is basically invalid in this mode.

There are some possible hacks that could be used to free up a register
port for displacement Load/Store, but these may well end up costing more
than just using a 6R2W configuration instead (which does still allow
stuff like this).

As-is, this 2-wide core only saves around 5k LUTs vs 3-wide...
But, this is worthwhile on the "slightly less spacious" FPGA.

As-is, this allows keeping:
96-bit Jumbo Ops
Only a subset of 96-bit encodings are allowed.
128-bit "MOV.X" / Load/Store Pair
128-bit SIMD ops that follow a 2R1W pattern.
128-bit SIMD FMAC would not be possible with this regfile.

Sort of useful, though had to go change some of my ASM code to not
explode in the 2-wide configuration (and fixed some "lint tests" in my
emulator which were "not actually working", ...).

Testing with Doom, I can note that 2-wide vs 3-wide does not seem to
significantly effect overall ILP (though performance is still slightly
worse).

....

Re: Aprupt underflow mode

<c09bf599-a9c5-4927-9913-67e08244c5dcn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32948&group=comp.arch#32948

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:1865:b0:634:da72:d67e with SMTP id eh5-20020a056214186500b00634da72d67emr300058qvb.8.1687809076150;
Mon, 26 Jun 2023 12:51:16 -0700 (PDT)
X-Received: by 2002:a9d:7696:0:b0:6b5:8a87:fc79 with SMTP id
j22-20020a9d7696000000b006b58a87fc79mr5248949otl.1.1687809075795; Mon, 26 Jun
2023 12:51:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 26 Jun 2023 12:51:15 -0700 (PDT)
In-Reply-To: <u7b8q8$qh6r$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:b927:8073:f891:4c45;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:b927:8073:f891:4c45
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c09bf599-a9c5-4927-9913-67e08244c5dcn@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Mon, 26 Jun 2023 19:51:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5927

by: Michael S - Mon, 26 Jun 2023 19:51 UTC

On Monday, June 26, 2023 at 8:47:56 AM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
> > On Sunday, June 25, 2023 at 7:06:21 PM UTC+3, Anton Ertl wrote:
> >> Thomas Koenig <tko...@netcologne.de> writes:
> >>> Fortran 2023 will support an aprupt underflow mode for those
> >>> processors that support it. It's optional, and there is a hint
> >>> that it may be faster.
> >>>
> >>> So, questions: Does anybody know which CPUs support setting
> >>> underflow mode? How much faster would normal execution be
> >>> if underflow is not actually used?
> >> From <2017Dec2...@mips.complang.tuwien.ac.at>:
> >>
> >> |>> (I doubt anyone has tried to run a
> >> |>> denormal multiplication benchmark on Ryzen to see how it does... or maybe
> >> |>> someone has, but I don't know where to look for the results.)
> >> |>
> >> |>Such a benchmark exists:
> >> |>
> >> |>https://sco.h-its.org/exelixis/web/software/fpn/index.html
> >> |
> >> |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen (gcc-4.9):
> >> |
> >> |
> >> | 2.7GHz 4GHz 4GHz
> >> | Sandy Skylake Zen
> >> | i7-2620 i5-6600K R5 1600X
> >> |micro
> >> |Large: 0.051161 0.028260 0.018399
> >> |Small: 0.016829 0.012012 0.011826
> >> |micro-ignore (with flush-to-zero)
> >> |Large: 0.029944 0.012030 0.017606
> >> |Small: 0.017007 0.012000 0.011438
> >> |
> >> |According to the README at
> >> |<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
> >> |Large data set produces denormals, but the same number of computations
> >> |as the Small data set.
> >> |
> >> |So for this benchmark denormals are about 3 times slower on Sandy
> >> |Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
> >> |Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
> >> |Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
> >> |better implementation of denormals, while Intel has invested in a
> >> |better implementation of flush-to-zero.
> >>
> >> - anton
> >> --
> >> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
> >
> > Few months ago I looked for slow cases of handling of subnormal
> > by fmul/fadd on Zen3, but didn't find any. May be, I don't know what to look for.
> >
> > On Skylake, on the other hand all multiplications with subnormal result were
> > slow. Additions with subnormal results were mostly slow too, with exception of
> > the case where at least one of the inputs is also subnormal.
> >
> I think your last point is crucial: Intel must have decided that in the
> typical case subnormal occurs as part of optimalization/zero-finding,
> and then the error term will only once pass from normal to subnormal, so
> on the next iteration the core sees this subnormal input and starts a
> slightly different sequence where subnormal outputs are expected. In the
> normal input case the core instead predicts normal output and save
> either power or time from doing this.
>

I don't believe that it was a result of deep thought on Intel's part.
They just did what was easy for them to do.
Multiplication on Skylake is equally slow for almost any combination of
subnormal inputs or outputs with sole exception of result = zero.
And as far as I remember there is no rule or common pattern in numeric
computations that say that result of addition never serves as an operand
to multiplication :(

> Mitch have shown us repeatedly that on any core with FMA, you can do
> subnormal inputs and/or output with zero cycle cost and very slight
> hardware cost, but it is possible that Intel looked at the same and
> decided that average power would be better with a one-shot predictor in
> the path?
>

That could be.
But it is equally possible that Haswell, Broadwell and Skylake were baked
in such quick succession that designers had no time for rework of parts
of FPU that are commonly considered unimportant.

It would be interesting to test on Icelake/Tigerlake where designers had
more time and on Alder Lake where they not just had plenty of time, but
also had a new team leader from very beginning.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Aprupt underflow mode

<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32949&group=comp.arch#32949

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:290:b0:401:dfc4:6f8f with SMTP id z16-20020a05622a029000b00401dfc46f8fmr614852qtw.13.1687809446844;
Mon, 26 Jun 2023 12:57:26 -0700 (PDT)
X-Received: by 2002:aca:bcd5:0:b0:39e:dd62:ce25 with SMTP id
m204-20020acabcd5000000b0039edd62ce25mr5889252oif.9.1687809446563; Mon, 26
Jun 2023 12:57:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 26 Jun 2023 12:57:26 -0700 (PDT)
In-Reply-To: <u7cde2$10aie$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:b927:8073:f891:4c45;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:b927:8073:f891:4c45
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Mon, 26 Jun 2023 19:57:26 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5854

by: Michael S - Mon, 26 Jun 2023 19:57 UTC

On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
> On 6/26/2023 12:47 AM, Terje Mathisen wrote:
> > Michael S wrote:
> >> On Sunday, June 25, 2023 at 7:06:21â€¯PM UTC+3, Anton Ertl wrote:
> >>> Thomas Koenig <tko...@netcologne.de> writes:
> >>>> Fortran 2023 will support an aprupt underflow mode for those
> >>>> processors that support it. It's optional, and there is a hint
> >>>> that it may be faster.
> >>>>
> >>>> So, questions: Does anybody know which CPUs support setting
> >>>> underflow mode? How much faster would normal execution be
> >>>> if underflow is not actually used?
> >>> From <2017Dec2...@mips.complang.tuwien.ac.at>:
> >>>
> >>> |>> (I doubt anyone has tried to run a
> >>> |>> denormal multiplication benchmark on Ryzen to see how it does...
> >>> or maybe
> >>> |>> someone has, but I don't know where to look for the results.)
> >>> |>
> >>> |>Such a benchmark exists:
> >>> |>
> >>> |>https://sco.h-its.org/exelixis/web/software/fpn/index.html
> >>> |
> >>> |Ok, here are results for Sandy Bridge (gcc-4.5), and Skylake and Zen
> >>> (gcc-4.9):
> >>> |
> >>> |
> >>> | 2.7GHz 4GHz 4GHz
> >>> | Sandy Skylake Zen
> >>> | i7-2620 i5-6600K R5 1600X
> >>> |micro
> >>> |Large: 0.051161 0.028260 0.018399
> >>> |Small: 0.016829 0.012012 0.011826
> >>> |micro-ignore (with flush-to-zero)
> >>> |Large: 0.029944 0.012030 0.017606
> >>> |Small: 0.017007 0.012000 0.011438
> >>> |
> >>> |According to the README at
> >>> |<https://github.com/stamatak/denormalizedFloatingPointNumbers>, the
> >>> |Large data set produces denormals, but the same number of computations
> >>> |as the Small data set.
> >>> |
> >>> |So for this benchmark denormals are about 3 times slower on Sandy
> >>> |Bridge, 2.3 times slower on Skylake, and 1.56 times slower on Zen.
> >>> |Using flush-to-zero reduces this to a factor 1.8 on Sandy Bridge, 1 on
> >>> |Skylake, and 1.54 on Zen. So it looks like AMD has invested in a
> >>> |better implementation of denormals, while Intel has invested in a
> >>> |better implementation of flush-to-zero.
> >>>
> >>> - anton
> >>> --
> >>> 'Anyone trying for "industrial quality" ISA should avoid undefined
> >>> behavior.'
> >>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
> >>
> >> Few months ago I looked for slow cases of handling of subnormal
> >> by fmul/fadd on Zen3, but didn't find any. May be, I don't know what
> >> to look for.
> >>
> >> On Skylake, on the other hand all multiplications with subnormal
> >> result were
> >> slow. Additions with subnormal results were mostly slow too, with
> >> exception of
> >> the case where at least one of the inputs is also subnormal.
> >>
> > I think your last point is crucial: Intel must have decided that in the
> > typical case subnormal occurs as part of optimalization/zero-finding,
> > and then the error term will only once pass from normal to subnormal, so
> > on the next iteration the core sees this subnormal input and starts a
> > slightly different sequence where subnormal outputs are expected. In the
> > normal input case the core instead predicts normal output and save
> > either power or time from doing this.
> >
> > Mitch have shown us repeatedly that on any core with FMA, you can do
> > subnormal inputs and/or output with zero cycle cost and very slight
> > hardware cost, but it is possible that Intel looked at the same and
> > decided that average power would be better with a one-shot predictor in
> > the path?
> >
> Hmm:
> I guess one possibility (for cheaper hardware) could be to not perform
> subnormal handling in hardware, but then have a flag in a control
> register that tells the FPU that if a subnormal result would be
> generated, to raise a fault.
>

That was done, multiple times, and not just on cheap hardware, but also
on pretty expensive HW, like first couple of generations of DEC Alpha.
But never on x86.
On x86 designers that do not want to do all work in HW just fire a microtrap
and complete the job in microcode.

>
> In this case software could handle it manually?... (Such as by emulating
> the offending FPU instruction).
>

Re: Aprupt underflow mode

<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32950&group=comp.arch#32950

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:26a8:b0:763:a3e9:b8b0 with SMTP id c40-20020a05620a26a800b00763a3e9b8b0mr2469420qkp.11.1687840163949;
Mon, 26 Jun 2023 21:29:23 -0700 (PDT)
X-Received: by 2002:a05:6870:b4a4:b0:1b0:60ff:b755 with SMTP id
y36-20020a056870b4a400b001b060ffb755mr44311oap.3.1687840162699; Mon, 26 Jun
2023 21:29:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 26 Jun 2023 21:29:22 -0700 (PDT)
In-Reply-To: <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:21f3:4baf:18d0:e411;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:21f3:4baf:18d0:e411
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 27 Jun 2023 04:29:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2551

by: Quadibloc - Tue, 27 Jun 2023 04:29 UTC

On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
> On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:

> > I guess one possibility (for cheaper hardware) could be to not perform
> > subnormal handling in hardware, but then have a flag in a control
> > register that tells the FPU that if a subnormal result would be
> > generated, to raise a fault.

> That was done, multiple times, and not just on cheap hardware, but also
> on pretty expensive HW, like first couple of generations of DEC Alpha.
> But never on x86.

Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
is because in the 8087, floats were always internally stored in "temporary
real" form. So floating-point arithmetic didn't involve denormals; that bit
of IEEE 754 complexity was dealt with and banished during memory loads
and stores, and, indeed, it would have been difficult to _tell_ when a float
entered the denormal range.

The DEC Alpha didn't work that way, so denormals actually cost it cycles,
and could be readily detected and trapped.

John Savard

Re: Aprupt underflow mode

<u7dr9h$1adv7$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32951&group=comp.arch#32951

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Tue, 27 Jun 2023 00:15:26 -0500
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <u7dr9h$1adv7$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 27 Jun 2023 05:15:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f949094e013551065d35749fb3f0f003";
logging-data="1390567"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WtA+VDwKGpKP5yjzVOLGL"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:nyRCIAt3hsK+eyxr3qC/OOOBujs=
In-Reply-To: <780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 27 Jun 2023 05:15 UTC

On 6/26/2023 11:29 PM, Quadibloc wrote:
> On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
>> On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
>
>>> I guess one possibility (for cheaper hardware) could be to not perform
>>> subnormal handling in hardware, but then have a flag in a control
>>> register that tells the FPU that if a subnormal result would be
>>> generated, to raise a fault.
>
>> That was done, multiple times, and not just on cheap hardware, but also
>> on pretty expensive HW, like first couple of generations of DEC Alpha.
>> But never on x86.
>
> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
> is because in the 8087, floats were always internally stored in "temporary
> real" form. So floating-point arithmetic didn't involve denormals; that bit
> of IEEE 754 complexity was dealt with and banished during memory loads
> and stores, and, indeed, it would have been difficult to _tell_ when a float
> entered the denormal range.
>
> The DEC Alpha didn't work that way, so denormals actually cost it cycles,
> and could be readily detected and trapped.
>

If it seemed worthwhile in my case, wouldn't be too hard in premise to
run a signal from the FPU to somewhere where a fault could be raised.

This would probably involve running the signal over to the logic for
managing the L1 cache TLBs, as this also holds some of the relevant
fault-raising logic.

Also the relevant timing-related behavior is similar between the FPU and
L1 cache (from the perspective of the pipeline, in both cases the
exception would be raised in the EX2 stage); relevant mostly for getting
the interrupt to land on the correct instruction.

Though, the main thing is whether there is enough of a use case "in real
world use-cases" to justify the cost and hassle of adding it.

I guess related would be to have emulation traps for unsupported FPU
instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could still
fake them by using traps.

....

Re: Aprupt underflow mode

<a6dfef7a-1df1-4535-afab-79852232e8acn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32952&group=comp.arch#32952

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:8706:0:b0:762:496c:a7f4 with SMTP id j6-20020a378706000000b00762496ca7f4mr3755006qkd.15.1687882186252;
Tue, 27 Jun 2023 09:09:46 -0700 (PDT)
X-Received: by 2002:a05:6870:b510:b0:1b0:5141:4c74 with SMTP id
v16-20020a056870b51000b001b051414c74mr346529oap.6.1687882185905; Tue, 27 Jun
2023 09:09:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 27 Jun 2023 09:09:45 -0700 (PDT)
In-Reply-To: <u7dr9h$1adv7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:48d6:2dea:4a0c:de30;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:48d6:2dea:4a0c:de30
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a6dfef7a-1df1-4535-afab-79852232e8acn@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 27 Jun 2023 16:09:46 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 66

by: MitchAlsup - Tue, 27 Jun 2023 16:09 UTC

On Tuesday, June 27, 2023 at 12:15:33 AM UTC-5, BGB wrote:
> On 6/26/2023 11:29 PM, Quadibloc wrote:
> > On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
> >> On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
> >
> >>> I guess one possibility (for cheaper hardware) could be to not perform
> >>> subnormal handling in hardware, but then have a flag in a control
> >>> register that tells the FPU that if a subnormal result would be
> >>> generated, to raise a fault.
> >
> >> That was done, multiple times, and not just on cheap hardware, but also
> >> on pretty expensive HW, like first couple of generations of DEC Alpha.
> >> But never on x86.
> >
> > Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
> > is because in the 8087, floats were always internally stored in "temporary
> > real" form. So floating-point arithmetic didn't involve denormals; that bit
> > of IEEE 754 complexity was dealt with and banished during memory loads
> > and stores, and, indeed, it would have been difficult to _tell_ when a float
> > entered the denormal range.
> >
> > The DEC Alpha didn't work that way, so denormals actually cost it cycles,
> > and could be readily detected and trapped.
> >
> If it seemed worthwhile in my case, wouldn't be too hard in premise to
> run a signal from the FPU to somewhere where a fault could be raised.
>
> This would probably involve running the signal over to the logic for
> managing the L1 cache TLBs, as this also holds some of the relevant
> fault-raising logic.
<
Nah, just extend the size of the result bus to carry an exception field.
When non-zero, it indicates which FP error was detected. Retire logic
uses this to context switch the front end to a handler if enabled.
>
> Also the relevant timing-related behavior is similar between the FPU and
> L1 cache (from the perspective of the pipeline, in both cases the
> exception would be raised in the EX2 stage); relevant mostly for getting
> the interrupt to land on the correct instruction.
>
>
> Though, the main thing is whether there is enough of a use case "in real
> world use-cases" to justify the cost and hassle of adding it.
<
Posits do not have (or need) this feature.
IEEE 754 machines will always have this.
>
> I guess related would be to have emulation traps for unsupported FPU
> instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could still
> fake them by using traps.
<
At great expense.
>
> ...

Re: Aprupt underflow mode

<u7f4h6$1eqgv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32953&group=comp.arch#32953

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Tue, 27 Jun 2023 11:59:16 -0500
Organization: A noiseless patient Spider
Lines: 117
Message-ID: <u7f4h6$1eqgv$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<u7dr9h$1adv7$1@dont-email.me>
<a6dfef7a-1df1-4535-afab-79852232e8acn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 27 Jun 2023 16:59:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f949094e013551065d35749fb3f0f003";
logging-data="1534495"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kM3xbw4UYa0t4BGBItKUd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:aSrLHtqjXb26pFl/zYV91g9ifEY=
Content-Language: en-US
In-Reply-To: <a6dfef7a-1df1-4535-afab-79852232e8acn@googlegroups.com>

by: BGB - Tue, 27 Jun 2023 16:59 UTC

On 6/27/2023 11:09 AM, MitchAlsup wrote:
> On Tuesday, June 27, 2023 at 12:15:33 AM UTC-5, BGB wrote:
>> On 6/26/2023 11:29 PM, Quadibloc wrote:
>>> On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
>>>> On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
>>>
>>>>> I guess one possibility (for cheaper hardware) could be to not perform
>>>>> subnormal handling in hardware, but then have a flag in a control
>>>>> register that tells the FPU that if a subnormal result would be
>>>>> generated, to raise a fault.
>>>
>>>> That was done, multiple times, and not just on cheap hardware, but also
>>>> on pretty expensive HW, like first couple of generations of DEC Alpha.
>>>> But never on x86.
>>>
>>> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
>>> is because in the 8087, floats were always internally stored in "temporary
>>> real" form. So floating-point arithmetic didn't involve denormals; that bit
>>> of IEEE 754 complexity was dealt with and banished during memory loads
>>> and stores, and, indeed, it would have been difficult to _tell_ when a float
>>> entered the denormal range.
>>>
>>> The DEC Alpha didn't work that way, so denormals actually cost it cycles,
>>> and could be readily detected and trapped.
>>>
>> If it seemed worthwhile in my case, wouldn't be too hard in premise to
>> run a signal from the FPU to somewhere where a fault could be raised.
>>
>> This would probably involve running the signal over to the logic for
>> managing the L1 cache TLBs, as this also holds some of the relevant
>> fault-raising logic.
> <
> Nah, just extend the size of the result bus to carry an exception field.
> When non-zero, it indicates which FP error was detected. Retire logic
> uses this to context switch the front end to a handler if enabled.

There is not currently any logic in place for initiating faults during
the EX3 or WB stages. The closest I have is EX2, but most of the logic
is tied to the L1 caches and TLB (the main source for interrupts,
external interrupt signals are also routed via the TLB handler) so this
is where I would need to route the signal.

The logic in the main pipeline mostly manages initiating an interrupt,
but not really the generation of interrupts. The Mem/L1 ring logic also
holds some of the logic for "keep an interrupt signal held active for
long enough that the pipeline can see and respond to it" logic
(otherwise, if an interrupt signal is generated while the pipeline is
stalled, the ISR dispatch mechanism will not trigger correctly).

So, the interrupt will need to be handled more like one from the L1 or
TLB, which tend to arrive when the pipeline is stalled (in the EX2
stage), which is also where the FPU interrupt would need to be
generated/initiated (if I want much hope of getting the "Saved PC"
register to point at the correct instruction).

But, granted, it is possible that other people have ways of
initiating/handling interrupts that is not an awful mess.

>>
>> Also the relevant timing-related behavior is similar between the FPU and
>> L1 cache (from the perspective of the pipeline, in both cases the
>> exception would be raised in the EX2 stage); relevant mostly for getting
>> the interrupt to land on the correct instruction.
>>
>>
>> Though, the main thing is whether there is enough of a use case "in real
>> world use-cases" to justify the cost and hassle of adding it.
> <
> Posits do not have (or need) this feature.
> IEEE 754 machines will always have this.

Well, for "proper" / "conforming" IEEE 754.
Which, as-is, the BJX2 core is not (and does DAZ/FTZ by default).

As-is, the cost of a fully conforming FPU would be too expensive.

Something like "fake fully conforming floating-point semantics via
software emulation" would be possible, but more expensive.

Granted, even with a denormal trap, this would still not address the
"inexact rounding hackery" issues, "which branch of a compare a NaN goes
down", ...

Well, and even on such, on a few FPGAs I can barely afford FPU at all
(or not at all: Can't really fit an FPU into a core on an XC7S25 or
similar).

So, it is likely that "full IEEE semantics" would make software
emulation almost inescapable...

>>
>> I guess related would be to have emulation traps for unsupported FPU
>> instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could still
>> fake them by using traps.
> <
> At great expense.

For FDIV/FSQRT they would be slower than using a runtime call...
But, nothing exactly new here... (as my attempts to handle these in
hardware border on "boat anchor" territory).

So, there is a reason in my case that these are disabled by default in
my compiler and typically runtime calls are used instead.

Granted, probably 1 kilocycle to handle an FDIV using a trap or similar
would be, pushing it... More sort of a "can technically allow binary
compatibility, but don't use it if it can be avoided." feature.

Well, and/or invest the cost in a faster interrupt handling mechanism.

Re: Aprupt underflow mode

<2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32954&group=comp.arch#32954

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:31a7:b0:767:15f4:7a81 with SMTP id bi39-20020a05620a31a700b0076715f47a81mr3005qkb.10.1687886045622;
Tue, 27 Jun 2023 10:14:05 -0700 (PDT)
X-Received: by 2002:a9d:7a44:0:b0:6b8:687d:1956 with SMTP id
z4-20020a9d7a44000000b006b8687d1956mr389869otm.5.1687886045350; Tue, 27 Jun
2023 10:14:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 27 Jun 2023 10:14:05 -0700 (PDT)
In-Reply-To: <780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Tue, 27 Jun 2023 17:14:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Michael S - Tue, 27 Jun 2023 17:14 UTC

On Tuesday, June 27, 2023 at 7:29:25 AM UTC+3, Quadibloc wrote:
> On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
> > On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
>
> > > I guess one possibility (for cheaper hardware) could be to not perform
> > > subnormal handling in hardware, but then have a flag in a control
> > > register that tells the FPU that if a subnormal result would be
> > > generated, to raise a fault.
>
> > That was done, multiple times, and not just on cheap hardware, but also
> > on pretty expensive HW, like first couple of generations of DEC Alpha.
> > But never on x86.
> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,

Of course, in the post above is 'x86' means SSE and later.

x87 is irrelevant because it predates IEEE 754 and, except for formats of the
data, never had aspirations to be fully compatible with IEEE 754.

DEC Alpha, on the other hand, is relevant because it shipped 7 years after
publication of the first edition of the IEEE 754 Standard and it did claim full
compatibility even if one had to be achieved by co-operation of HW with
compilers and language run time support libraries.

> is because in the 8087, floats were always internally stored in "temporary
> real" form. So floating-point arithmetic didn't involve denormals; that bit
> of IEEE 754 complexity was dealt with and banished during memory loads
> and stores, and, indeed, it would have been difficult to _tell_ when a float
> entered the denormal range.
>
> The DEC Alpha didn't work that way, so denormals actually cost it cycles,
> and could be readily detected and trapped.
>
> John Savard

Re: Aprupt underflow mode

<cdHmM.6688$KtZc.1812@fx08.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32955&group=comp.arch#32955

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx08.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at> <b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com> <780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me>
In-Reply-To: <u7dr9h$1adv7$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 78
Message-ID: <cdHmM.6688$KtZc.1812@fx08.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 27 Jun 2023 20:00:08 UTC
Date: Tue, 27 Jun 2023 15:58:06 -0400
X-Received-Bytes: 4725

by: EricP - Tue, 27 Jun 2023 19:58 UTC

BGB wrote:
>
> If it seemed worthwhile in my case, wouldn't be too hard in premise to
> run a signal from the FPU to somewhere where a fault could be raised.
>
> This would probably involve running the signal over to the logic for
> managing the L1 cache TLBs, as this also holds some of the relevant
> fault-raising logic.
>
> Also the relevant timing-related behavior is similar between the FPU and
> L1 cache (from the perspective of the pipeline, in both cases the
> exception would be raised in the EX2 stage); relevant mostly for getting
> the interrupt to land on the correct instruction.
>
>
> Though, the main thing is whether there is enough of a use case "in real
> world use-cases" to justify the cost and hassle of adding it.
>
> I guess related would be to have emulation traps for unsupported FPU
> instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could still
> fake them by using traps.

Even on the simplest in-order pipelined uArch there can be multiple
exceptions in flight at once in different stages. Fetch can page fault,
Decode detect an illegal instruction, FPU, ALU, MULDIV, LSQ.
The only one that counts is the oldest and only when that uOp
reaches WB-Retire. At WB-Retire the retire Instruction Pointer
is up to date and synchronous with the fault.

Remember that pending exceptions in flight can be purged by
an older branch mispredict or older exception.

The trick is getting the right exception information block to coincide
with the faulting uOp when it reaches Retire. The exception block contains
the exception instruction pointer EIP and exception code ECODE,
plus optionally some amount of auxiliary information such as
the page fault address and fault syndrome bits.

The way I did this in my simulator front end is when Fetch detects a
page fault it marks the parse buffer as faulting and copies the fetch IP
and faulting address and syndrome bits into the instruction buffer.
If Decode sees a fault parse buffer it generates a fault uOp
with ECODE and aux info and passes it down the pipeline.

Otherwise Decode checks the instructions and if it is illegal generates
an fault uOp with ECODE and passes it down the pipeline.
The fault uOps pass through the pipeline as though NOPs.

If this was a normal instruction uOp that encounters an exception,
such as a page fault on LD or ST or FPU fault then FU marks
the uOp as such and passes it along as before.

If a faulting uOp reaches WB-Retire then the retire IP must be
the one for this instruction and the uOp has the aux info.
Retire jams an exception handler IP into Fetch and purges the
pipeline just like a mispredicted branch. (There can also be a
Super/User privilege mode change there.)

So in this simplified scenario the cost is running a set of extra
jump exception handler IP address wires from Retire to Fetch and
triggering a full pipeline flush. And stashing the exception
information block someplace so you can find it later in the handler.

If the uArch was OoO or in-order with some concurrent FU's then the main
issue is determining which of the possible multiple exception sources is
the oldest so that you only need to keep the exception info block on it.
Note that it can't use the IP to determine the oldest instruction as
loops can have multiple instances of an IP in flight at once.
(I use my OoO Instruction Queue circular buffer pointers.)

My OoO Instruction Queue entry has only 1 bit XcpPending in it.
The exception manager XcpMgr holds a single copy of the exception
info block for just the oldest faulting instruction.
When Retire sees the IQ entry has XcpPending set it uses the
exception code to calculate a handler address, jams that into Fetch
and flushes the whole pipeline.

Re: Aprupt underflow mode

<u7g0do$1hlav$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32956&group=comp.arch#32956

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Tue, 27 Jun 2023 19:55:17 -0500
Organization: A noiseless patient Spider
Lines: 220
Message-ID: <u7g0do$1hlav$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<u7dr9h$1adv7$1@dont-email.me> <cdHmM.6688$KtZc.1812@fx08.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 28 Jun 2023 00:55:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e5173df34bc459bd426f730bbe40b986";
logging-data="1627487"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uIm8HcVL22n3O/8m7WZHR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:FaBXDIw8uyVuqN31iTfckuk6uU8=
Content-Language: en-US
In-Reply-To: <cdHmM.6688$KtZc.1812@fx08.iad>

by: BGB - Wed, 28 Jun 2023 00:55 UTC

On 6/27/2023 2:58 PM, EricP wrote:
> BGB wrote:
>>
>> If it seemed worthwhile in my case, wouldn't be too hard in premise to
>> run a signal from the FPU to somewhere where a fault could be raised.
>>
>> This would probably involve running the signal over to the logic for
>> managing the L1 cache TLBs, as this also holds some of the relevant
>> fault-raising logic.
>>
>> Also the relevant timing-related behavior is similar between the FPU
>> and L1 cache (from the perspective of the pipeline, in both cases the
>> exception would be raised in the EX2 stage); relevant mostly for
>> getting the interrupt to land on the correct instruction.
>>
>>
>> Though, the main thing is whether there is enough of a use case "in
>> real world use-cases" to justify the cost and hassle of adding it.
>>
>> I guess related would be to have emulation traps for unsupported FPU
>> instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could still
>> fake them by using traps.
>
> Even on the simplest in-order pipelined uArch there can be multiple
> exceptions in flight at once in different stages. Fetch can page fault,
> Decode detect an illegal instruction, FPU, ALU, MULDIV, LSQ.
> The only one that counts is the oldest and only when that uOp
> reaches WB-Retire. At WB-Retire the retire Instruction Pointer
> is up to date and synchronous with the fault.
>
> Remember that pending exceptions in flight can be purged by
> an older branch mispredict or older exception.
>
> The trick is getting the right exception information block to coincide
> with the faulting uOp when it reaches Retire. The exception block contains
> the exception instruction pointer EIP and exception code ECODE,
> plus optionally some amount of auxiliary information such as
> the page fault address and fault syndrome bits.
>

At least in my core, I was typically triggering interrupts relative to
the EX2 stage.

Where, pipeline looks sorta like:
PF IF ID1 ID2 EX1 EX2 EX3 WB

So, an illegal instruction will get detected during decode, and will
turn into an "Invalid Opcode" instruction, with a few sub-cases being
signaled. It hits EX1, which then emits a signal to throw an exception.
When the exception mechanism triggers, the offending instruction is in
the EX2 stage.

For TLB misses or access faults, these also arrive when the instruction
in question is in the EX2 stage. Albeit, the core is typically stalled
at this time, so a mechanism is needed to "hold" the exception active
until the dispatch mechanism can start doing its thing (clearing the
exception once the relevant bits in the Status Register change to
reflecting being inside an ISR).

Handling of exceptions or interrupts is mostly "first come, first
serve", so everything following the first-arriving exception is ignored
(and no further exceptions can be handled until the handler finishes).

Once the mechanism triggers, it tries to figure out which pipeline stage
to capture the exception state from (usually EX2 or EX3); goal being
that once the handler returns, one will return to the state of the
instruction which triggered the exception.

Well, except for the SYSCALL instruction, where the handler needs to
adjust the return address to the following instruction. Otherwise,
SYSCALL would turn into an infinite loop of exception dispatching...

The mechanism is all "sort of like a branch, but a lot more evil...".
Well, also branches also initiate during the EX2 stage in my case.

Branches are also sort of evil though, as now one has to invalidate
every pipeline stage "in the future" relative to the branch instruction.

The Inter-ISA Branch is a special case partway between a normal branch
and an exception handler. It needs to add a little extra latency (vs a
normal branch) to be sure that the L1 I$ can fetch the instructions in
the correct CPU mode.

Though, can note that early on, the EX3 stage did not exist. The main
reason that EX3 was added was that this allowed for Load/Store to be
pipelined.

Say:
EX1: Calculate address and hand it off to L1 D$;
Access L1 Array (Clock Edge)
EX2: Check hit or miss, stall pipeline on miss;
EX3: Produce final result (Load) or perform a Store.
EX3 may forward its data to EX2 if they alias (optional).
This forwarding helps performance but eats a lot of LUTs.
Else, we need to stall the pipeline to finish the Store.

Externally, the memory exception cases all arrive while the main
pipeline is stalled. Once an Exception begins, the L1 caches begin a
process to purge whatever request they were sitting on to unstall the
pipeline such that it can start moving again (with a few extra cycles so
that the pipeline can "settle" and the L1 I$ can start fetching the
ISR's entry point).

But, alas, I don't really know how other people had implemented these
mechanisms.

Initiating branches and exceptions during the WB stage is possible, but
I had not considered it.

> The way I did this in my simulator front end is when Fetch detects a
> page fault it marks the parse buffer as faulting and copies the fetch IP
> and faulting address and syndrome bits into the instruction buffer.
> If Decode sees a fault parse buffer it generates a fault uOp
> with ECODE and aux info and passes it down the pipeline.
>
> Otherwise Decode checks the instructions and if it is illegal generates
> an fault uOp with ECODE and passes it down the pipeline.
> The fault uOps pass through the pipeline as though NOPs.
>
> If this was a normal instruction uOp that encounters an exception,
> such as a page fault on LD or ST or FPU fault then FU marks
> the uOp as such and passes it along as before.
>
> If a faulting uOp reaches WB-Retire then the retire IP must be
> the one for this instruction and the uOp has the aux info.
> Retire jams an exception handler IP into Fetch and purges the
> pipeline just like a mispredicted branch. (There can also be a
> Super/User privilege mode change there.)
>
> So in this simplified scenario the cost is running a set of extra
> jump exception handler IP address wires from Retire to Fetch and
> triggering a full pipeline flush. And stashing the exception
> information block someplace so you can find it later in the handler.
>

OK.

I handled it a little differently.

> If the uArch was OoO or in-order with some concurrent FU's then the main
> issue is determining which of the possible multiple exception sources is
> the oldest so that you only need to keep the exception info block on it.
> Note that it can't use the IP to determine the oldest instruction as
> loops can have multiple instances of an IP in flight at once.
> (I use my OoO Instruction Queue circular buffer pointers.)
>
> My OoO Instruction Queue entry has only 1 bit XcpPending in it.
> The exception manager XcpMgr holds a single copy of the exception
> info block for just the oldest faulting instruction.
> When Retire sees the IQ entry has XcpPending set it uses the
> exception code to calculate a handler address, jams that into Fetch
> and flushes the whole pipeline.
>
>

I use a strict in-order pipeline design.

Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
stalls.

Well, except for interlock stalls, where EX1/2/3 and WB advance, but
PF/IF/ID1/ID2 are stalled.

Avoiding an interlock stall mechanism would, however, require that
machine-code include NOPs and similar as needed to account for
instruction latency (where the output value of each pipeline stage also
includes a flag to say whether the value is ready yet).

Well, and for a while had a bug, that I eventually realized was because
if (during a branch) the instruction in ID2 stage triggered an interlock
stall while the branch was initiating, it would cause the branch to fail
to initiate (as the IF stage simply fails to see the branch-target's
updated value for the PC register).

Needed to add logic such that instructions in flushed pipeline stages
can't trigger pipeline interlocks...

There are potentially higher performance designs, but "there be dragons"...

And, "better performance" is in a near constant struggle with "passes
timing" and "doesn't need too much LUT budget".

On the bigger FPGAs, passing timing is a problem.
And, on the smaller FPGAs, it is LUT budget.

Because, say:
XC7S50 vs XC7A100T vs XC7A200T, all at a -1 speed grade, for each time
the FPGA gets bigger, its "net delay" (and thus its ability to easily
pass timing) seemingly gets worse (more so when one makes the overall
design bigger to make use of the bigger LUT budget...).

Granted... I have no idea about the whole "floorplanning" thing, and
have not messed with it. But, apparently some people use this to try to
optimize timing (say, clustering related things close together, rather
than the FPGA tools putting it wherever, and typically failing on paths
that end up meandering across a large part of the FPGA).

Click here to read the complete article

Re: Aprupt underflow mode

<15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32957&group=comp.arch#32957

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1896:b0:400:9079:6e29 with SMTP id v22-20020a05622a189600b0040090796e29mr720896qtc.4.1687918636853;
Tue, 27 Jun 2023 19:17:16 -0700 (PDT)
X-Received: by 2002:a05:687c:e:b0:1b0:15ca:47b with SMTP id
yf14-20020a05687c000e00b001b015ca047bmr3983519oab.7.1687918636664; Tue, 27
Jun 2023 19:17:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 27 Jun 2023 19:17:16 -0700 (PDT)
In-Reply-To: <u7g0do$1hlav$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e471:4c7c:b2b3:db66;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e471:4c7c:b2b3:db66
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me>
<cdHmM.6688$KtZc.1812@fx08.iad> <u7g0do$1hlav$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 28 Jun 2023 02:17:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3449

by: MitchAlsup - Wed, 28 Jun 2023 02:17 UTC

On Tuesday, June 27, 2023 at 7:55:24 PM UTC-5, BGB wrote:
> On 6/27/2023 2:58 PM, EricP wrote:
> > BGB wrote:
>
> I use a strict in-order pipeline design.
>
> Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
> stalls.
<
This could be a cause on one of more of your speed paths. We found
it difficult in the 1-wide and 2-wide in-order days to take register
specifiers from the instruction, CAM all of the potential results in
the pipeline, find-first the youngest one for each operand register,
and use that as unary select input to the forwarding multiplexer.
Only after you [perform all the forwarding do you get to the point
where you can address whether all the potential instructions
actually issued or stalled, and exactly where to cut the pipeline.
>
> Well, except for interlock stalls, where EX1/2/3 and WB advance, but
> PF/IF/ID1/ID2 are stalled.
>
> Avoiding an interlock stall mechanism would, however, require that
> machine-code include NOPs and similar as needed to account for
> instruction latency (where the output value of each pipeline stage also
> includes a flag to say whether the value is ready yet).
>
You can build a less rigid pipeline.
>
> Well, and for a while had a bug, that I eventually realized was because
> if (during a branch) the instruction in ID2 stage triggered an interlock
> stall while the branch was initiating, it would cause the branch to fail
> to initiate (as the IF stage simply fails to see the branch-target's
> updated value for the PC register).
>
> Needed to add logic such that instructions in flushed pipeline stages
> can't trigger pipeline interlocks...
>
Yep, that too.....
>
> There are potentially higher performance designs, but "there be dragons"....
>

Re: Aprupt underflow mode

<ec97396d-6f4c-4ddd-884c-bd1d0457bcc4n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32958&group=comp.arch#32958

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:2dc4:b0:635:e9f6:9470 with SMTP id nc4-20020a0562142dc400b00635e9f69470mr742qvb.5.1687920669236;
Tue, 27 Jun 2023 19:51:09 -0700 (PDT)
X-Received: by 2002:a05:6808:1a06:b0:3a1:ee4f:77ce with SMTP id
bk6-20020a0568081a0600b003a1ee4f77cemr237493oib.1.1687920668894; Tue, 27 Jun
2023 19:51:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 27 Jun 2023 19:51:08 -0700 (PDT)
In-Reply-To: <15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me>
<cdHmM.6688$KtZc.1812@fx08.iad> <u7g0do$1hlav$1@dont-email.me> <15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ec97396d-6f4c-4ddd-884c-bd1d0457bcc4n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 28 Jun 2023 02:51:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4613

by: robf...@gmail.com - Wed, 28 Jun 2023 02:51 UTC

On Tuesday, June 27, 2023 at 10:17:18 PM UTC-4, MitchAlsup wrote:
> On Tuesday, June 27, 2023 at 7:55:24 PM UTC-5, BGB wrote:
> > On 6/27/2023 2:58 PM, EricP wrote:
> > > BGB wrote:
> >
> > I use a strict in-order pipeline design.
> >
> > Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
> > stalls.
> <
> This could be a cause on one of more of your speed paths. We found
> it difficult in the 1-wide and 2-wide in-order days to take register
> specifiers from the instruction, CAM all of the potential results in
> the pipeline, find-first the youngest one for each operand register,
> and use that as unary select input to the forwarding multiplexer.
> Only after you [perform all the forwarding do you get to the point
> where you can address whether all the potential instructions
> actually issued or stalled, and exactly where to cut the pipeline.
> >
> > Well, except for interlock stalls, where EX1/2/3 and WB advance, but
> > PF/IF/ID1/ID2 are stalled.
> >
> > Avoiding an interlock stall mechanism would, however, require that
> > machine-code include NOPs and similar as needed to account for
> > instruction latency (where the output value of each pipeline stage also
> > includes a flag to say whether the value is ready yet).
> >
> You can build a less rigid pipeline.
> >
> > Well, and for a while had a bug, that I eventually realized was because
> > if (during a branch) the instruction in ID2 stage triggered an interlock
> > stall while the branch was initiating, it would cause the branch to fail
> > to initiate (as the IF stage simply fails to see the branch-target's
> > updated value for the PC register).
> >
> > Needed to add logic such that instructions in flushed pipeline stages
> > can't trigger pipeline interlocks...
> >
> Yep, that too.....
> >
> > There are potentially higher performance designs, but "there be dragons"...
> >

Thor avoids having to find the oldest exception by recording exceptions
in the pipeline buffers which end up in order at the commit stage.

A nine-bit cause code field in the pipeline buffers gets filled with the cause
of an exception only if it is empty. This tracks the first exception that occurred
on the instruction. The pipeline buffers propagate to the writeback stage
where any cause code set in the buffer is then processed so exception
causes are processed in order. Processing a cause code generally causes a
pipeline flush meaning subsequent exception causes are not processed.

To handle hardware interrupts, a special IRQ instruction displaces the fetched
instruction(s) then propagates down the pipeline. Until an IRQ is recognized
at the writeback stage, all incoming instructions are replaced with the IRQ
instruction. When the IRQ is processed, the pipeline is flushed, so all the extra
IRQ instructions disappear.

Re: Aprupt underflow mode

<u7gha9$1mms4$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32959&group=comp.arch#32959

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Wed, 28 Jun 2023 07:43:36 +0200
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <u7gha9$1mms4$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Jun 2023 05:43:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0fba0abdfc18a55c1a7caea16ecb86d2";
logging-data="1792900"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oc5mOZlVlvGaCS2P3wYP71p8YuUtvLx7vuFbLZvAAWg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:Cd0P1sdq/Mowen9rOx1RpistUlo=
In-Reply-To: <2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>

by: Terje Mathisen - Wed, 28 Jun 2023 05:43 UTC

Michael S wrote:
> On Tuesday, June 27, 2023 at 7:29:25â¯AM UTC+3, Quadibloc wrote:
>> On Monday, June 26, 2023 at 1:57:28â¯PM UTC-6, Michael S wrote:
>>> On Monday, June 26, 2023 at 7:12:54â¯PM UTC+3, BGB wrote:
>>
>>>> I guess one possibility (for cheaper hardware) could be to not perform
>>>> subnormal handling in hardware, but then have a flag in a control
>>>> register that tells the FPU that if a subnormal result would be
>>>> generated, to raise a fault.
>>
>>> That was done, multiple times, and not just on cheap hardware, but also
>>> on pretty expensive HW, like first couple of generations of DEC Alpha.
>>> But never on x86.
>> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
>
> Of course, in the post above is 'x86' means SSE and later.
>
> x87 is irrelevant because it predates IEEE 754 and, except for formats of the
> data, never had aspirations to be fully compatible with IEEE 754.

Please check your history!

The 8087 was the inspiration for 754, i.e. the 754-1985 standards
development started with the '87, then added a few tweaks which got
included in versions of the FPU which was released after that point in
time, i.e. the 80387. From Wikipedia
<https://en.wikipedia.org/wiki/Intel_8087>:

IEEE floating-point standard
When Intel designed the 8087, it aimed to make a standard floating-point
format for future designs. An important aspect of the 8087 from a
historical perspective was that it became the basis for the IEEE 754
floating-point standard.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Aprupt underflow mode

<u7gmr5$1n5ge$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32960&group=comp.arch#32960

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Wed, 28 Jun 2023 02:17:56 -0500
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <u7gmr5$1n5ge$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<u7dr9h$1adv7$1@dont-email.me> <cdHmM.6688$KtZc.1812@fx08.iad>
<u7g0do$1hlav$1@dont-email.me>
<15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Jun 2023 07:17:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e5173df34bc459bd426f730bbe40b986";
logging-data="1807886"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+B9n4lni/ZanqJXjsvREDU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:ywIJ1vF3UPRvMRqmFRNd+g0XCQU=
Content-Language: en-US
In-Reply-To: <15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>

by: BGB - Wed, 28 Jun 2023 07:17 UTC

On 6/27/2023 9:17 PM, MitchAlsup wrote:
> On Tuesday, June 27, 2023 at 7:55:24 PM UTC-5, BGB wrote:
>> On 6/27/2023 2:58 PM, EricP wrote:
>>> BGB wrote:
>>
>> I use a strict in-order pipeline design.
>>
>> Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
>> stalls.
> <
> This could be a cause on one of more of your speed paths. We found
> it difficult in the 1-wide and 2-wide in-order days to take register
> specifiers from the instruction, CAM all of the potential results in
> the pipeline, find-first the youngest one for each operand register,
> and use that as unary select input to the forwarding multiplexer.
> Only after you [perform all the forwarding do you get to the point
> where you can address whether all the potential instructions
> actually issued or stalled, and exactly where to cut the pipeline.

The register forwarding logic seems to be one of those things that
scales poorly.

But there isn't really an obvious way around it short of adding a
similar restriction to some VLIW DSP where one can't use the value in
the register as an input until the whole bundle has passed out of the
pipeline (trying to use a register before it leaves the pipeline
resulting in a stale value).

As I saw it though, forwarding and interlocks were "necessary evils" for
general usability (short of the compiler needing to emit a bunch of NOPs
and similar).

>>
>> Well, except for interlock stalls, where EX1/2/3 and WB advance, but
>> PF/IF/ID1/ID2 are stalled.
>>
>> Avoiding an interlock stall mechanism would, however, require that
>> machine-code include NOPs and similar as needed to account for
>> instruction latency (where the output value of each pipeline stage also
>> includes a flag to say whether the value is ready yet).
>>
> You can build a less rigid pipeline.

Apparently MIPS originally went for a pipeline without interlock
handling; and then re-added it not long after...

I had noted that paths effecting the stall signals and also the SR.T bit
weigh heavily in terms of timing.

I had before considered adding "crash zones" into the pipeline, which
could allow delaying stall signals by 1 cycle.

Say:
PF IF ID CZ
RF CZ
EX1 EX2 EX3 WB

Where, say, if a stall or interlock is detected, the results of a stage
can be temporarily shunted into a crash-zone, and the crash-zone stage
is used as an input the next stage before that stage begins moving again
(with a 1-cycle constant delay).

My guess as (besides the effort of doing so) this would likely result in
an increase in LUT cost.

>>
>> Well, and for a while had a bug, that I eventually realized was because
>> if (during a branch) the instruction in ID2 stage triggered an interlock
>> stall while the branch was initiating, it would cause the branch to fail
>> to initiate (as the IF stage simply fails to see the branch-target's
>> updated value for the PC register).
>>
>> Needed to add logic such that instructions in flushed pipeline stages
>> can't trigger pipeline interlocks...
>>
> Yep, that too.....

For a while, I thought it was a bug specific to a Load+Branch, but then
found that it could happen with other ops with a 3-cycle latency.

It took an annoyingly long time to realize that it was due to interlocks
with an instruction following the branch.

So, say:
MOV.L (SP, 40), R4
BRA lbl
ADD R4, R2, R4
Would trigger the bug.

Though, there were some ugly workaround hacks until I figured out what
was going on here and added a more proper fix...

>>
>> There are potentially higher performance designs, but "there be dragons"...
>>
>

Re: Aprupt underflow mode

<u7hm5i$1qepg$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32961&group=comp.arch#32961

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Wed, 28 Jun 2023 11:12:31 -0500
Organization: A noiseless patient Spider
Lines: 225
Message-ID: <u7hm5i$1qepg$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<u7dr9h$1adv7$1@dont-email.me> <cdHmM.6688$KtZc.1812@fx08.iad>
<u7g0do$1hlav$1@dont-email.me>
<15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com>
<u7gmr5$1n5ge$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Jun 2023 16:12:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e5173df34bc459bd426f730bbe40b986";
logging-data="1915696"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hqn2/bNSVpFH0nMKaYOdR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:yAF7xnnjqIr6GhYrS6TAseyK+Hw=
Content-Language: en-US
In-Reply-To: <u7gmr5$1n5ge$1@dont-email.me>

by: BGB - Wed, 28 Jun 2023 16:12 UTC

On 6/28/2023 2:17 AM, BGB wrote:
> On 6/27/2023 9:17 PM, MitchAlsup wrote:
>> On Tuesday, June 27, 2023 at 7:55:24 PM UTC-5, BGB wrote:
>>> On 6/27/2023 2:58 PM, EricP wrote:
>>>> BGB wrote:
>>>
>>> I use a strict in-order pipeline design.
>>>
>>> Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
>>> stalls.
>> <
>> This could be a cause on one of more of your speed paths. We found
>> it difficult in the 1-wide and 2-wide in-order days to take register
>> specifiers from the instruction, CAM all of the potential results in
>> the pipeline, find-first the youngest one for each operand register,
>> and use that as unary select input to the forwarding multiplexer.
>> Only after you [perform all the forwarding do you get to the point
>> where you can address whether all the potential instructions
>> actually issued or stalled, and exactly where to cut the pipeline.
>
> The register forwarding logic seems to be one of those things that
> scales poorly.
>
>
> But there isn't really an obvious way around it short of adding a
> similar restriction to some VLIW DSP where one can't use the value in
> the register as an input until the whole bundle has passed out of the
> pipeline (trying to use a register before it leaves the pipeline
> resulting in a stale value).
>
> As I saw it though, forwarding and interlocks were "necessary evils" for
> general usability (short of the compiler needing to emit a bunch of NOPs
> and similar).
>
>
>>>
>>> Well, except for interlock stalls, where EX1/2/3 and WB advance, but
>>> PF/IF/ID1/ID2 are stalled.
>>>
>>> Avoiding an interlock stall mechanism would, however, require that
>>> machine-code include NOPs and similar as needed to account for
>>> instruction latency (where the output value of each pipeline stage also
>>> includes a flag to say whether the value is ready yet).
>>>
>> You can build a less rigid pipeline.
>
> Apparently MIPS originally went for a pipeline without interlock
> handling; and then re-added it not long after...
>
> I had noted that paths effecting the stall signals and also the SR.T bit
> weigh heavily in terms of timing.
>

As I can note:
The SR.T bit is used to predicate instructions in the EX1 state.
Another option would be to handle it in ID1 (along with the register
fetch), however this would require operations like CMPxx to effectively
have a 2-cycle latency (since the ALU emits a comparison-result in EX2,
and would effectively require an interlock check on updates to the SR.T
bit).

Say, if:
Predicated ops exist in ID2/RF
And, EX1 is a CMPxx or similar
Trigger an interlock stall.

This could potentially significantly reduce SR.T related fanout, so
could potentially be worthwhile (since it wouldn't have all the logic in
the EX1 stage hanging off of it).

The EX2 and EX3 stages use the predication results from EX1 stage (the
remapped opcode is carried over to the following stages).

In this case, predication is handled by remapping the opcode based on
the status:
Most ops: Normal op or NOP
Branch: Branch or No-Branch
Branch: Do a branch to destination depending on Branch Predictor
No-Branch: Branch to the following instruction if needed.
Branch predictor predicted a branch: Branch to next op.
Else: NOP
Uses same target as the new LR for function calls.

The instructions still end up triggering any interlocks as-if they had
been executed.

But, if I were doing it again, I might be more tempted to handle the
predication in ID2 along with the normal registers (well, or maybe call
this stage 'RF' instead).

Does leave open an issue for how to most efficiently handle branches in
this case (assuming the existing ALU latency):
CMPEQ + BT/BF : 2c + 2/8c
SUB + BRNE: 2c + 2/8c
(SUB latency still an issue)

One can argue for RISC-V style compare-and-branch, but this still either
has adverse effects on timing (current implementation) or requires
initiating a branch at a later pipeline stage (and thus adding branch
latency).

So, say:
PF
L1 I$ is given "current" PC.
IF
Do L1 I$ stuff.
Figure out instruction lengths and bundling.
ID
Decode instructions.
Deal with branch predictor.
Causes a 2c minimum branch latency.
Flags IF stage as "skipped" following a taken branch.
(Else, branch-delay-slot behavior results)
RF
Fetch registers, forwarding and interlocks
(Change) Handle predication here.
EX1
(As-is) Handle predication here.
ALU and CMP do their thing
(As-is) Branch initiated here.
EX2
ALU results ready.
SR.T and SR.S updated based on ALU result.
(As-is) Branch mechanism takes hold.
EX3
(Change) Branch initiated here.
WB
Results stored to register file.
(Change) Branch mechanism takes hold.

Possibly, could try to make the branch and exception-handling logic more
consistent with each other (trying to treat exception dispatch more like
a more advanced form of a subroutine call).

Would still need some way for the pipeline to signal to the L1 caches
that an exception is pending and they should unstick themselves.

....

More extreme could be to try to come up with a design where one does not
need to have stall handling in the EX stages. Instead, one could drop
instructions into EX1 whenever the inputs are ready.

One idea I had imagined before is that whenever a register is being
written to (in ID2/RF) it could be marked as "in flight" and then marked
as "clean" once the final result hits the WB stage.

Though, while this idea could avoid the cost of register interlocks and
forwarding, it would have the drawback of causing all instructions to
effectively have a 3-cycle latency (short of having some mechanism for
instructions to skip directly to WB).

If I went this route, might make sense to allow bundled instructions to
"split up" in the RF stage, and effectively travel down a 2 or 3 wide EX
pipe, with each having a "forward skip" path at each stage (jumping
forward if the final EX stage holds a NOP or similar).

High-latency ops would either need to have a longer total number of EX
pipeline stages, or some way to "divert elsewhere" and then re-join the
EX pipeline at a later time.

This would be very different from my existing core though.

>
> I had before considered adding "crash zones" into the pipeline, which
> could allow delaying stall signals by 1 cycle.
>
> Say:
> PF IF ID CZ
>            RF CZ
>               EX1 EX2 EX3 WB
>
> Where, say, if a stall or interlock is detected, the results of a stage
> can be temporarily shunted into a crash-zone, and the crash-zone stage
> is used as an input the next stage before that stage begins moving again
> (with a 1-cycle constant delay).
>
> My guess as (besides the effort of doing so) this would likely result in
> an increase in LUT cost.
>
>
>>>
>>> Well, and for a while had a bug, that I eventually realized was because
>>> if (during a branch) the instruction in ID2 stage triggered an interlock
>>> stall while the branch was initiating, it would cause the branch to fail
>>> to initiate (as the IF stage simply fails to see the branch-target's
>>> updated value for the PC register).
>>>
>>> Needed to add logic such that instructions in flushed pipeline stages
>>> can't trigger pipeline interlocks...
>>>
>> Yep, that too.....
>
> For a while, I thought it was a bug specific to a Load+Branch, but then
> found that it could happen with other ops with a 3-cycle latency.
>
> It took an annoyingly long time to realize that it was due to interlocks
> with an instruction following the branch.
>
> So, say:
> MOV.L (SP, 40), R4
> BRA    lbl
> ADD    R4, R2, R4
> Would trigger the bug.
>
> Though, there were some ugly workaround hacks until I figured out what
> was going on here and added a more proper fix...
>
>
>
>>>
>>> There are potentially higher performance designs, but "there be
>>> dragons"...
>>>
>>
>

Re: Aprupt underflow mode

<AkZmM.121085$Zq81.64234@fx15.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32962&group=comp.arch#32962

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at> <b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com> <780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me> <cdHmM.6688$KtZc.1812@fx08.iad> <u7g0do$1hlav$1@dont-email.me>
In-Reply-To: <u7g0do$1hlav$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 255
Message-ID: <AkZmM.121085$Zq81.64234@fx15.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 28 Jun 2023 16:36:48 UTC
Date: Wed, 28 Jun 2023 12:35:12 -0400
X-Received-Bytes: 12558

by: EricP - Wed, 28 Jun 2023 16:35 UTC

BGB wrote:
> On 6/27/2023 2:58 PM, EricP wrote:
>> BGB wrote:
>>>
>>> If it seemed worthwhile in my case, wouldn't be too hard in premise
>>> to run a signal from the FPU to somewhere where a fault could be raised.
>>>
>>> This would probably involve running the signal over to the logic for
>>> managing the L1 cache TLBs, as this also holds some of the relevant
>>> fault-raising logic.
>>>
>>> Also the relevant timing-related behavior is similar between the FPU
>>> and L1 cache (from the perspective of the pipeline, in both cases the
>>> exception would be raised in the EX2 stage); relevant mostly for
>>> getting the interrupt to land on the correct instruction.
>>>
>>>
>>> Though, the main thing is whether there is enough of a use case "in
>>> real world use-cases" to justify the cost and hassle of adding it.
>>>
>>> I guess related would be to have emulation traps for unsupported FPU
>>> instructions. Say, if a core lacks FDIV/FSQRT/FMAC/... one could
>>> still fake them by using traps.
>>
>> Even on the simplest in-order pipelined uArch there can be multiple
>> exceptions in flight at once in different stages. Fetch can page fault,
>> Decode detect an illegal instruction, FPU, ALU, MULDIV, LSQ.
>> The only one that counts is the oldest and only when that uOp
>> reaches WB-Retire. At WB-Retire the retire Instruction Pointer
>> is up to date and synchronous with the fault.
>>
>> Remember that pending exceptions in flight can be purged by
>> an older branch mispredict or older exception.
>>
>> The trick is getting the right exception information block to coincide
>> with the faulting uOp when it reaches Retire. The exception block
>> contains
>> the exception instruction pointer EIP and exception code ECODE,
>> plus optionally some amount of auxiliary information such as
>> the page fault address and fault syndrome bits.
>>
>
> At least in my core, I was typically triggering interrupts relative to
> the EX2 stage.

I use the terms exception, interrupt, and error handling
as distinct functions.

Exceptions are defined by the ISA, internal, synchronous and precise.
There are two kinds of exception: faults and traps.
Faults restore the instructions' starting state and leave the RIP
pointing at the faulting instruction,
and traps which trigger after the state and RIP have been updated.

So a memory access violation is a page fault and syscall is a trap.

>
> Where, pipeline looks sorta like:
> PF IF ID1 ID2 EX1 EX2 EX3 WB
>
>
> So, an illegal instruction will get detected during decode, and will
> turn into an "Invalid Opcode" instruction, with a few sub-cases being
> signaled. It hits EX1, which then emits a signal to throw an exception.
> When the exception mechanism triggers, the offending instruction is in
> the EX2 stage.
>
> For TLB misses or access faults, these also arrive when the instruction
> in question is in the EX2 stage. Albeit, the core is typically stalled
> at this time, so a mechanism is needed to "hold" the exception active
> until the dispatch mechanism can start doing its thing (clearing the
> exception once the relevant bits in the Status Register change to
> reflecting being inside an ISR).
>
> Handling of exceptions or interrupts is mostly "first come, first
> serve", so everything following the first-arriving exception is ignored
> (and no further exceptions can be handled until the handler finishes).
>
> Once the mechanism triggers, it tries to figure out which pipeline stage
> to capture the exception state from (usually EX2 or EX3); goal being
> that once the handler returns, one will return to the state of the
> instruction which triggered the exception.
>
> Well, except for the SYSCALL instruction, where the handler needs to
> adjust the return address to the following instruction. Otherwise,
> SYSCALL would turn into an infinite loop of exception dispatching...

Yeah, this is the advantage of doing exceptions at WB-Retire where
the decision to update registers and increment the RIP happens.
It is one stop shopping for faults (no WB, no RIP increment) and
traps (WB (if any), RIP increment), then snapshot RIP and flags,
flush pipeline, change state, redirect fetch.

It just seems to me to simplify the whole thing as you don't have to
worry about a phase delay between EX2, EX3 and WB and weird shit like
syscall needing to trigger an exception but also needing to increment RIP
so it has to both stall at EX3 and fork so RIP increment correctly at WB.
This should be simple.

> The mechanism is all "sort of like a branch, but a lot more evil...".
> Well, also branches also initiate during the EX2 stage in my case.
>
> Branches are also sort of evil though, as now one has to invalidate
> every pipeline stage "in the future" relative to the branch instruction.

I have a Valid bit on each in-order front end uOp stage register so
a flush just resets all the earlier Valid bits. It also halts the Fetch
state machine until someone tells it what to do (continue, jump).

For example, Decode sees a "BR reg" and it knows that Fetch continued
sequentially so it issues a minor pipeline flush to invalidate the fetch
stage register and halt the fetcher. Later Branch Control Unit (BCU)
processes the "BR reg", forwards the new RIP to fetch and restarts it.

Before BCU redirects Fetch, since the fetch buffer is not-valid Decode
just spits out a not-valid uOp each clock and fills the pipeline with
a bubble.

> The Inter-ISA Branch is a special case partway between a normal branch
> and an exception handler. It needs to add a little extra latency (vs a
> normal branch) to be sure that the L1 I$ can fetch the instructions in
> the correct CPU mode.
>
>
> Though, can note that early on, the EX3 stage did not exist. The main
> reason that EX3 was added was that this allowed for Load/Store to be
> pipelined.
>
> Say:
> EX1: Calculate address and hand it off to L1 D$;
> Access L1 Array (Clock Edge)
> EX2: Check hit or miss, stall pipeline on miss;
> EX3: Produce final result (Load) or perform a Store.
> EX3 may forward its data to EX2 if they alias (optional).
> This forwarding helps performance but eats a lot of LUTs.
> Else, we need to stall the pipeline to finish the Store.

??? Why doesn't WB initiate the store?
I'm confused why EX3 even exists.

> Externally, the memory exception cases all arrive while the main
> pipeline is stalled. Once an Exception begins, the L1 caches begin a
> process to purge whatever request they were sitting on to unstall the
> pipeline such that it can start moving again (with a few extra cycles so
> that the pipeline can "settle" and the L1 I$ can start fetching the
> ISR's entry point).

Sorry but... eeewwww :-)

> But, alas, I don't really know how other people had implemented these
> mechanisms.
>
> Initiating branches and exceptions during the WB stage is possible, but
> I had not considered it.

I didn't mean branches, just exceptions.
Delaying branches to WB would increase the branch bubble.

Branches know what their new fetch address is so they don't have to wait
for a stable state. Assuming they are processed at EX1 then just reset
the stage Valid bits backwards to IF, redirect IF and restart it if halted.

>> The way I did this in my simulator front end is when Fetch detects a
>> page fault it marks the parse buffer as faulting and copies the fetch IP
>> and faulting address and syndrome bits into the instruction buffer.
>> If Decode sees a fault parse buffer it generates a fault uOp
>> with ECODE and aux info and passes it down the pipeline.
>>
>> Otherwise Decode checks the instructions and if it is illegal generates
>> an fault uOp with ECODE and passes it down the pipeline.
>> The fault uOps pass through the pipeline as though NOPs.
>>
>> If this was a normal instruction uOp that encounters an exception,
>> such as a page fault on LD or ST or FPU fault then FU marks
>> the uOp as such and passes it along as before.
>>
>> If a faulting uOp reaches WB-Retire then the retire IP must be
>> the one for this instruction and the uOp has the aux info.
>> Retire jams an exception handler IP into Fetch and purges the
>> pipeline just like a mispredicted branch. (There can also be a
>> Super/User privilege mode change there.)
>>
>> So in this simplified scenario the cost is running a set of extra
>> jump exception handler IP address wires from Retire to Fetch and
>> triggering a full pipeline flush. And stashing the exception
>> information block someplace so you can find it later in the handler.
>>
>
> OK.
>
> I handled it a little differently.
>
>
>> If the uArch was OoO or in-order with some concurrent FU's then the main
>> issue is determining which of the possible multiple exception sources is
>> the oldest so that you only need to keep the exception info block on it.
>> Note that it can't use the IP to determine the oldest instruction as
>> loops can have multiple instances of an IP in flight at once.
>> (I use my OoO Instruction Queue circular buffer pointers.)
>>
>> My OoO Instruction Queue entry has only 1 bit XcpPending in it.
>> The exception manager XcpMgr holds a single copy of the exception
>> info block for just the oldest faulting instruction.
>> When Retire sees the IQ entry has XcpPending set it uses the
>> exception code to calculate a handler address, jams that into Fetch
>> and flushes the whole pipeline.
>>
>>
>
> I use a strict in-order pipeline design.
>
> Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
> stalls.
>
> Well, except for interlock stalls, where EX1/2/3 and WB advance, but
> PF/IF/ID1/ID2 are stalled.
>
> Avoiding an interlock stall mechanism would, however, require that
> machine-code include NOPs and similar as needed to account for
> instruction latency (where the output value of each pipeline stage also
> includes a flag to say whether the value is ready yet).

Click here to read the complete article

Re: Aprupt underflow mode

<d138dab4-a19a-4e0b-98af-780aa50ecdf4n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32963&group=comp.arch#32963

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:398a:b0:767:1deb:a2c5 with SMTP id ro10-20020a05620a398a00b007671deba2c5mr7413qkn.5.1687971149452;
Wed, 28 Jun 2023 09:52:29 -0700 (PDT)
X-Received: by 2002:a05:6870:d7a1:b0:1b0:1957:da49 with SMTP id
bd33-20020a056870d7a100b001b01957da49mr5637795oab.10.1687971148979; Wed, 28
Jun 2023 09:52:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 28 Jun 2023 09:52:28 -0700 (PDT)
In-Reply-To: <u7gha9$1mms4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>
<u7gha9$1mms4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d138dab4-a19a-4e0b-98af-780aa50ecdf4n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Wed, 28 Jun 2023 16:52:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4073

by: Michael S - Wed, 28 Jun 2023 16:52 UTC

On Wednesday, June 28, 2023 at 8:43:41 AM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
> > On Tuesday, June 27, 2023 at 7:29:25 AM UTC+3, Quadibloc wrote:
> >> On Monday, June 26, 2023 at 1:57:28 PM UTC-6, Michael S wrote:
> >>> On Monday, June 26, 2023 at 7:12:54 PM UTC+3, BGB wrote:
> >>
> >>>> I guess one possibility (for cheaper hardware) could be to not perform
> >>>> subnormal handling in hardware, but then have a flag in a control
> >>>> register that tells the FPU that if a subnormal result would be
> >>>> generated, to raise a fault.
> >>
> >>> That was done, multiple times, and not just on cheap hardware, but also
> >>> on pretty expensive HW, like first couple of generations of DEC Alpha..
> >>> But never on x86.
> >> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
> >
> > Of course, in the post above is 'x86' means SSE and later.
> >
> > x87 is irrelevant because it predates IEEE 754 and, except for formats of the
> > data, never had aspirations to be fully compatible with IEEE 754.
> Please check your history!
>
> The 8087 was the inspiration for 754, i.e. the 754-1985 standards
> development started with the '87, then added a few tweaks which got
> included in versions of the FPU which was released after that point in
> time, i.e. the 80387. From Wikipedia
> <https://en.wikipedia.org/wiki/Intel_8087>:
>
> IEEE floating-point standard
> When Intel designed the 8087, it aimed to make a standard floating-point
> format for future designs. An important aspect of the 8087 from a
> historical perspective was that it became the basis for the IEEE 754
> floating-point standard.
> Terje
>
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

There is no contradiction: x87 served as inspiration for IEEE 754-85, however
after the standard was published, developers of next generations of
x87-compatible FPUs had no aspirations to provide optional standard
compliance.

According to my semi-educated estimate, providing such option would
not be too much of the burden on i486 silicon budget. Despite that they didn't
do it. May be, because they were forward looking and knew that in coming
Pentium, due to fully-pipelined FPU, any complication *would be* a significant
burden. Or more likely because they didn't consider it important.

Re: Aprupt underflow mode

<u7i0r1$1rhn2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32964&group=comp.arch#32964

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Aprupt underflow mode
Date: Wed, 28 Jun 2023 21:14:41 +0200
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <u7i0r1$1rhn2$1@dont-email.me>
References: <u78vn4$3keqd$1@newsreader4.netcologne.de>
<2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com>
<u7b8q8$qh6r$1@dont-email.me> <u7cde2$10aie$1@dont-email.me>
<7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com>
<2304c8fb-20cd-40e9-a9e4-0149d4d39a87n@googlegroups.com>
<u7gha9$1mms4$1@dont-email.me>
<d138dab4-a19a-4e0b-98af-780aa50ecdf4n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 28 Jun 2023 19:14:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0fba0abdfc18a55c1a7caea16ecb86d2";
logging-data="1951458"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5srjnZsKFKX/oIDK9tFei3VfimANmCqtVjTLMhG5Tug=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:W13TrvAESZPAsYxSPFHTeDHTHj4=
In-Reply-To: <d138dab4-a19a-4e0b-98af-780aa50ecdf4n@googlegroups.com>

by: Terje Mathisen - Wed, 28 Jun 2023 19:14 UTC

Michael S wrote:
> On Wednesday, June 28, 2023 at 8:43:41â¯AM UTC+3, Terje Mathisen wrote:
>> Michael S wrote:
>>> On Tuesday, June 27, 2023 at 7:29:25â¯AM UTC+3, Quadibloc wrote:
>>>> On Monday, June 26, 2023 at 1:57:28â¯PM UTC-6, Michael S wrote:
>>>>> On Monday, June 26, 2023 at 7:12:54â¯PM UTC+3, BGB wrote:
>>>>
>>>>>> I guess one possibility (for cheaper hardware) could be to not perform
>>>>>> subnormal handling in hardware, but then have a flag in a control
>>>>>> register that tells the FPU that if a subnormal result would be
>>>>>> generated, to raise a fault.
>>>>
>>>>> That was done, multiple times, and not just on cheap hardware, but also
>>>>> on pretty expensive HW, like first couple of generations of DEC Alpha.
>>>>> But never on x86.
>>>> Of course, the reason _why_ this wasn't done on x86, leaving out SSE,
>>>
>>> Of course, in the post above is 'x86' means SSE and later.
>>>
>>> x87 is irrelevant because it predates IEEE 754 and, except for formats of the
>>> data, never had aspirations to be fully compatible with IEEE 754.
>> Please check your history!
>>
>> The 8087 was the inspiration for 754, i.e. the 754-1985 standards
>> development started with the '87, then added a few tweaks which got
>> included in versions of the FPU which was released after that point in
>> time, i.e. the 80387. From Wikipedia
>> <https://en.wikipedia.org/wiki/Intel_8087>:
>>
>> IEEE floating-point standard
>> When Intel designed the 8087, it aimed to make a standard floating-point
>> format for future designs. An important aspect of the 8087 from a
>> historical perspective was that it became the basis for the IEEE 754
>> floating-point standard.
>> Terje
>>
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"
>
> There is no contradiction: x87 served as inspiration for IEEE 754-85, however
> after the standard was published, developers of next generations of
> x87-compatible FPUs had no aspirations to provide optional standard
> compliance.
>
> According to my semi-educated estimate, providing such option would
> not be too much of the burden on i486 silicon budget. Despite that they didn't
> do it. May be, because they were forward looking and knew that in coming
> Pentium, due to fully-pipelined FPU, any complication *would be* a significant
> burden. Or more likely because they didn't consider it important.

What specifically is your beef? The 80387 was in fact fully compliant
with the 754 spec as it existed at the time. Yes, it was different from
prety much all others due to the internal support for extended/80-bit
float, but that is still explicitly allowed by the standard.

OTOH, since SSE, the Intel/AMD FPU has reverted to something much closer
to industry norm, making things like Java's requirement for bit-by-bit
identical results from all operations much more feasible.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Aprupt underflow mode

<0fd56f3f-fd72-4dba-905a-22640fc847c6n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32965&group=comp.arch#32965

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:a44:b0:635:dbbe:7a6d with SMTP id ee4-20020a0562140a4400b00635dbbe7a6dmr315231qvb.13.1687980966845;
Wed, 28 Jun 2023 12:36:06 -0700 (PDT)
X-Received: by 2002:a05:6808:3099:b0:3a1:d4c4:ac6a with SMTP id
bl25-20020a056808309900b003a1d4c4ac6amr4042290oib.9.1687980966541; Wed, 28
Jun 2023 12:36:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 28 Jun 2023 12:36:06 -0700 (PDT)
In-Reply-To: <u7gmr5$1n5ge$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:d87e:ca76:38fe:ed83;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:d87e:ca76:38fe:ed83
References: <u78vn4$3keqd$1@newsreader4.netcologne.de> <2023Jun25.180112@mips.complang.tuwien.ac.at>
<b613de1c-7401-4e0d-959b-d8d51f75abd6n@googlegroups.com> <u7b8q8$qh6r$1@dont-email.me>
<u7cde2$10aie$1@dont-email.me> <7467e4f5-f727-4d46-b491-8052af8fbcb2n@googlegroups.com>
<780c43c5-d79f-4bfe-91dc-6443b3397325n@googlegroups.com> <u7dr9h$1adv7$1@dont-email.me>
<cdHmM.6688$KtZc.1812@fx08.iad> <u7g0do$1hlav$1@dont-email.me>
<15c07e9f-fc84-4b25-9cc7-2234b7a153c7n@googlegroups.com> <u7gmr5$1n5ge$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0fd56f3f-fd72-4dba-905a-22640fc847c6n@googlegroups.com>
Subject: Re: Aprupt underflow mode
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 28 Jun 2023 19:36:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 119

by: robf...@gmail.com - Wed, 28 Jun 2023 19:36 UTC

On Wednesday, June 28, 2023 at 3:18:01 AM UTC-4, BGB wrote:
> On 6/27/2023 9:17 PM, MitchAlsup wrote:
> > On Tuesday, June 27, 2023 at 7:55:24 PM UTC-5, BGB wrote:
> >> On 6/27/2023 2:58 PM, EricP wrote:
> >>> BGB wrote:
> >>
> >> I use a strict in-order pipeline design.
> >>
> >> Stuff either advances in lock-step at 1 stage per cycle, or the pipeline
> >> stalls.
> > <
> > This could be a cause on one of more of your speed paths. We found
> > it difficult in the 1-wide and 2-wide in-order days to take register
> > specifiers from the instruction, CAM all of the potential results in
> > the pipeline, find-first the youngest one for each operand register,
> > and use that as unary select input to the forwarding multiplexer.
> > Only after you [perform all the forwarding do you get to the point
> > where you can address whether all the potential instructions
> > actually issued or stalled, and exactly where to cut the pipeline.
> The register forwarding logic seems to be one of those things that
> scales poorly.

If you can accept that the forwarding logic will be what limits the
design, then performance can be improved somewhat by adding
more complex instructions as long as it is within the timing limit.

For instance three source operand instructions, fewer clocks for
multiplies and divides, performing combo instructions like
shift-plus-a second op.

Thor milestone reached: bluescreen displayed in FPGA.
>
>
> But there isn't really an obvious way around it short of adding a
> similar restriction to some VLIW DSP where one can't use the value in
> the register as an input until the whole bundle has passed out of the
> pipeline (trying to use a register before it leaves the pipeline
> resulting in a stale value).
>
> As I saw it though, forwarding and interlocks were "necessary evils" for
> general usability (short of the compiler needing to emit a bunch of NOPs
> and similar).
> >>
> >> Well, except for interlock stalls, where EX1/2/3 and WB advance, but
> >> PF/IF/ID1/ID2 are stalled.
> >>
> >> Avoiding an interlock stall mechanism would, however, require that
> >> machine-code include NOPs and similar as needed to account for
> >> instruction latency (where the output value of each pipeline stage also
> >> includes a flag to say whether the value is ready yet).
> >>
> > You can build a less rigid pipeline.
> Apparently MIPS originally went for a pipeline without interlock
> handling; and then re-added it not long after...
>
> I had noted that paths effecting the stall signals and also the SR.T bit
> weigh heavily in terms of timing.
>
>
> I had before considered adding "crash zones" into the pipeline, which
> could allow delaying stall signals by 1 cycle.
>
> Say:
> PF IF ID CZ
> RF CZ
> EX1 EX2 EX3 WB
>
> Where, say, if a stall or interlock is detected, the results of a stage
> can be temporarily shunted into a crash-zone, and the crash-zone stage
> is used as an input the next stage before that stage begins moving again
> (with a 1-cycle constant delay).
>
> My guess as (besides the effort of doing so) this would likely result in
> an increase in LUT cost.
> >>
> >> Well, and for a while had a bug, that I eventually realized was because
> >> if (during a branch) the instruction in ID2 stage triggered an interlock
> >> stall while the branch was initiating, it would cause the branch to fail
> >> to initiate (as the IF stage simply fails to see the branch-target's
> >> updated value for the PC register).
> >>
> >> Needed to add logic such that instructions in flushed pipeline stages
> >> can't trigger pipeline interlocks...
> >>
> > Yep, that too.....
> For a while, I thought it was a bug specific to a Load+Branch, but then
> found that it could happen with other ops with a 3-cycle latency.
>
> It took an annoyingly long time to realize that it was due to interlocks
> with an instruction following the branch.
>
> So, say:
> MOV.L (SP, 40), R4
> BRA lbl
> ADD R4, R2, R4
> Would trigger the bug.
>
> Though, there were some ugly workaround hacks until I figured out what
> was going on here and added a more proper fix...
> >>
> >> There are potentially higher performance designs, but "there be dragons"...
> >>
> >

Pages:12 3

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor