Rocksolid Light - comp.arch - Re: Benefit vs cost of zero-cycle register moves

Benefit vs cost of zero-cycle register moves

<umva2r$1qlp8$1@newsreader4.netcologne.de>

https://news.novabbs.org/devel/article-flat.php?id=36471&group=comp.arch#36471

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-594-0-bad0-cab6-b9f6-b3a2.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Benefit vs cost of zero-cycle register moves
Date: Mon, 1 Jan 2024 21:16:11 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <umva2r$1qlp8$1@newsreader4.netcologne.de>
Injection-Date: Mon, 1 Jan 2024 21:16:11 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-594-0-bad0-cab6-b9f6-b3a2.ipv6dyn.netcologne.de:2a0a:a540:594:0:bad0:cab6:b9f6:b3a2";
logging-data="1922856"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 1 Jan 2024 21:16 UTC

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit. This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

Re: Benefit vs cost of zero-cycle register moves

<af6c6d42bc22b54ab040c442c742ca61@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36476&group=comp.arch#36476

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Mon, 1 Jan 2024 23:59:35 +0000
Organization: novaBBS
Message-ID: <af6c6d42bc22b54ab040c442c742ca61@news.novabbs.com>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1946856"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$IRKK/fhlcPir3Ud4bj5v8eyDqABvDHL/Vn21fF5ld5GuJf3yfJo1.
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0

by: MitchAlsup - Mon, 1 Jan 2024 23:59 UTC

Thomas Koenig wrote:

> AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
> to their execution units; they are done directly via register renaming,
> up to a certain limit. This will, of course, decrease latencies,
> especially on an OoO machine.

> POWER is an exception (surprising to me); a dependency in an
> MR instruction will introduce two cycles of latency, the usual
> latency for an arithmetic instruction (also on Power10, I mesured
> that today).

> So, what are the tradeoffs? Will a zero-cycle register move make
> the pipeline deeper?

If you have 3 stages of register rename in your pipeline you can 0-cycle
MOVs (equivalent to 4-5 stages between Fetch and Issue).

If you have a thinner Decode pipeline (say 1 cycle) you cannot.

There is also a dependency on the style of register file you have.

A CAM read decoder with a binary write decoder cannot perform MOVs in
0-cycles, whereas reading the RF after reservation station launch can.

Mostly whether MOVs take 0-cycles or not does not show up with much
performance when the depth of the execution window is 16+ cycles or
when calculation latency takes multiple cycles (FP) or incurs memory
latency (pointer chasing, cache misses high).

Also note: x86 has more MOV instructions than most RISCs.

Re: Benefit vs cost of zero-cycle register moves

<2024Jan2.115140@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36492&group=comp.arch#36492

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Tue, 02 Jan 2024 10:51:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 55
Distribution: world
Message-ID: <2024Jan2.115140@mips.complang.tuwien.ac.at>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
Injection-Info: dont-email.me; posting-host="eb8a18fb12ab0cb8cb8a72ff805001d5";
logging-data="2790790"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5JGoI131Bht/hDC4ZN0/5"
Cancel-Lock: sha1:8kpDmCpOlOAg5WWnzxJl9wKi5W4=
X-newsreader: xrn 10.11

by: Anton Ertl - Tue, 2 Jan 2024 10:51 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>to their execution units; they are done directly via register renaming,
>up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

>This will, of course, decrease latencies,
>especially on an OoO machine.
>
>POWER is an exception (surprising to me); a dependency in an
>MR instruction will introduce two cycles of latency, the usual
>latency for an arithmetic instruction (also on Power10, I mesured
>that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

>So, what are the tradeoffs? Will a zero-cycle register move make
>the pipeline deeper?

Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.

The main tradeoff seems to be in putting the effort in to implement
this optimization. Even Gracemont (the current Intel E-Core) can
perform 5 dependent moves (but not constant adds) in one cycle, so it
probably does not cost much area or energy compared to its benefits.

My guess is that Power10 is designed more for throughput computing
where lots of instruction-level parallelism is available so you can
live with long latencies (fill it with independent instructions),
while Intel, AMD, ARM and Apple design also for code where latency
plays a bigger role. As expressed in the LaTeX benchmark (lower is
better) <https://www.complang.tuwien.ac.at/franz/latex-bench>:

Power 10 (3900 MHz) AlmaLinux 9.2 TeX Live 2020 0.468
Core i3-1315U, Gracemont 2600MHz, Ub.22.04 texlive-latex-base 0.388
Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27
Core i3-1315U, Golden Cove 3800MHz, Ub.22.04 texlive-latex-base 0.221
Ryzen 7 5800X, 4800MHz, Debian 11 (64-bit) texlive-latex-base 0.191
Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

I.e., a current Intel E-Core running (for unknown reasons) 700MHz
below its nominal speed is faster on this benchmark than Power10.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Benefit vs cost of zero-cycle register moves

<un0u76$1rm4h$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36494&group=comp.arch#36494

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-594-0-5856-a450-2160-967b.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Tue, 2 Jan 2024 12:05:58 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <un0u76$1rm4h$1@newsreader4.netcologne.de>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
<2024Jan2.115140@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 2 Jan 2024 12:05:58 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-594-0-5856-a450-2160-967b.ipv6dyn.netcologne.de:2a0a:a540:594:0:5856:a450:2160:967b";
logging-data="1955985"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 2 Jan 2024 12:05 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>>to their execution units; they are done directly via register renaming,
>>up to a certain limit.
>
> The limit in recent CPUs seems to be the width of the register renamer
> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
> includes constant adds in the range -1024..1023 with the intermediate
> sum not exceeding -4096..4095.
>
>>This will, of course, decrease latencies,
>>especially on an OoO machine.
>>
>>POWER is an exception (surprising to me); a dependency in an
>>MR instruction will introduce two cycles of latency, the usual
>>latency for an arithmetic instruction (also on Power10, I mesured
>>that today).
>
> Two cycles of latency for arithmetic instructions like integer adds?
> Ouch!

Yes, ouch. I don't know what they spend that extra cycle on.
Probably, their die just got too big, their timing too agressive,
or rather a combination of both.

By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
actually do register copying through the ALU, like architectures
of old.

>>So, what are the tradeoffs? Will a zero-cycle register move make
>>the pipeline deeper?
>
> Pipeline depths have not been published for Intel and AMD CPUs in
> recent years. ARM publishes its pipeline lengths. One could compare
> the last ARM of a line without this feature to the first with this
> feature, and get an indication whether it made the pipeline deeper.

Does anybody (Scott?) have an indication of which chips this
might be?

Re: Benefit vs cost of zero-cycle register moves

<20240102155826.000020d4@yahoo.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36495&group=comp.arch#36495

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Tue, 2 Jan 2024 15:58:26 +0200
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <20240102155826.000020d4@yahoo.com>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
<2024Jan2.115140@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="9f61b52514a1ac9aa5835dc58792c46b";
logging-data="2867983"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iq2rAe9LX/FvT5rcxudfoNWKCwX345rU="
Cancel-Lock: sha1:o6QfWfpkWShYUACV+EatjYzljto=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)

by: Michael S - Tue, 2 Jan 2024 13:58 UTC

On Tue, 02 Jan 2024 10:51:40 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Thomas Koenig <tkoenig@netcologne.de> writes:
> >AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
> >to their execution units; they are done directly via register
> >renaming, up to a certain limit.
>
> The limit in recent CPUs seems to be the width of the register renamer
> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
> includes constant adds in the range -1024..1023 with the intermediate
> sum not exceeding -4096..4095.
>
> >This will, of course, decrease latencies,
> >especially on an OoO machine.
> >
> >POWER is an exception (surprising to me); a dependency in an
> >MR instruction will introduce two cycles of latency, the usual
> >latency for an arithmetic instruction (also on Power10, I mesured
> >that today).
>
> Two cycles of latency for arithmetic instructions like integer adds?
> Ouch!
>

The same as all recent Apple 'performance' cores. Which didn't prevent
them from being pretty damn good 'latency' engines.

Re: Benefit vs cost of zero-cycle register moves

<un19b9$1rtrr$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36501&group=comp.arch#36501

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-594-0-5856-a450-2160-967b.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Tue, 2 Jan 2024 15:15:53 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <un19b9$1rtrr$1@newsreader4.netcologne.de>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
<2024Jan2.115140@mips.complang.tuwien.ac.at>
<20240102155826.000020d4@yahoo.com>
Injection-Date: Tue, 2 Jan 2024 15:15:53 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-594-0-5856-a450-2160-967b.ipv6dyn.netcologne.de:2a0a:a540:594:0:5856:a450:2160:967b";
logging-data="1963899"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 2 Jan 2024 15:15 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Tue, 02 Jan 2024 10:51:40 GMT
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> Thomas Koenig <tkoenig@netcologne.de> writes:
>> >AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>> >to their execution units; they are done directly via register
>> >renaming, up to a certain limit.
>>
>> The limit in recent CPUs seems to be the width of the register renamer
>> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
>> includes constant adds in the range -1024..1023 with the intermediate
>> sum not exceeding -4096..4095.
>>
>> >This will, of course, decrease latencies,
>> >especially on an OoO machine.
>> >
>> >POWER is an exception (surprising to me); a dependency in an
>> >MR instruction will introduce two cycles of latency, the usual
>> >latency for an arithmetic instruction (also on Power10, I mesured
>> >that today).
>>
>> Two cycles of latency for arithmetic instructions like integer adds?
>> Ouch!
>>
>
> The same as all recent Apple 'performance' cores. Which didn't prevent
> them from being pretty damn good 'latency' engines.

I speak only little ARM, but if I read
https://dougallj.github.io/applecpu/firestorm-int.html correctly,
then add is only two cycles if one of the operands needs to be
extended (at leat for the M1 chip). Was this changed in later
versions?

Re: Benefit vs cost of zero-cycle register moves

<20240102175704.0000205e@yahoo.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36502&group=comp.arch#36502

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Tue, 2 Jan 2024 17:57:04 +0200
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <20240102175704.0000205e@yahoo.com>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
<2024Jan2.115140@mips.complang.tuwien.ac.at>
<20240102155826.000020d4@yahoo.com>
<un19b9$1rtrr$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="9f61b52514a1ac9aa5835dc58792c46b";
logging-data="2867983"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18PrcWz0C8wR6juORq2ClF1g4Q5B4HV/60="
Cancel-Lock: sha1:d7l3N/JN3JopD9zsjJi20ckhHf0=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)

by: Michael S - Tue, 2 Jan 2024 15:57 UTC

On Tue, 2 Jan 2024 15:15:53 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> Michael S <already5chosen@yahoo.com> schrieb:
> > On Tue, 02 Jan 2024 10:51:40 GMT
> > anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >
> >> Thomas Koenig <tkoenig@netcologne.de> writes:
> >> >AFAIK, modern Intel, AMD and ARM CPUs do not forward register
> >> >moves to their execution units; they are done directly via
> >> >register renaming, up to a certain limit.
> >>
> >> The limit in recent CPUs seems to be the width of the register
> >> renamer (6 on Golden Cove and Zen3). For Golden Cove, that
> >> optimization includes constant adds in the range -1024..1023 with
> >> the intermediate sum not exceeding -4096..4095.
> >>
> >> >This will, of course, decrease latencies,
> >> >especially on an OoO machine.
> >> >
> >> >POWER is an exception (surprising to me); a dependency in an
> >> >MR instruction will introduce two cycles of latency, the usual
> >> >latency for an arithmetic instruction (also on Power10, I mesured
> >> >that today).
> >>
> >> Two cycles of latency for arithmetic instructions like integer
> >> adds? Ouch!
> >>
> >
> > The same as all recent Apple 'performance' cores. Which didn't
> > prevent them from being pretty damn good 'latency' engines.
>
> I speak only little ARM, but if I read
> https://dougallj.github.io/applecpu/firestorm-int.html correctly,
> then add is only two cycles if one of the operands needs to be
> extended (at leat for the M1 chip). Was this changed in later
> versions?

You are right. Somehow I misremembered.

Re: Benefit vs cost of zero-cycle register moves

<FVZkN.132605$83n7.119588@fx18.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36524&group=comp.arch#36524

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Benefit vs cost of zero-cycle register moves
Newsgroups: comp.arch
Distribution: world
References: <umva2r$1qlp8$1@newsreader4.netcologne.de> <2024Jan2.115140@mips.complang.tuwien.ac.at> <un0u76$1rm4h$1@newsreader4.netcologne.de>
Lines: 46
Message-ID: <FVZkN.132605$83n7.119588@fx18.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Tue, 02 Jan 2024 19:58:29 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Tue, 02 Jan 2024 19:58:29 GMT
X-Received-Bytes: 2652

by: Scott Lurndal - Tue, 2 Jan 2024 19:58 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tkoenig@netcologne.de> writes:
>>>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>>>to their execution units; they are done directly via register renaming,
>>>up to a certain limit.
>>
>> The limit in recent CPUs seems to be the width of the register renamer
>> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
>> includes constant adds in the range -1024..1023 with the intermediate
>> sum not exceeding -4096..4095.
>>
>>>This will, of course, decrease latencies,
>>>especially on an OoO machine.
>>>
>>>POWER is an exception (surprising to me); a dependency in an
>>>MR instruction will introduce two cycles of latency, the usual
>>>latency for an arithmetic instruction (also on Power10, I mesured
>>>that today).
>>
>> Two cycles of latency for arithmetic instructions like integer adds?
>> Ouch!
>
>Yes, ouch. I don't know what they spend that extra cycle on.
>Probably, their die just got too big, their timing too agressive,
>or rather a combination of both.
>
>By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
>actually do register copying through the ALU, like architectures
>of old.
>
>>>So, what are the tradeoffs? Will a zero-cycle register move make
>>>the pipeline deeper?
>>
>> Pipeline depths have not been published for Intel and AMD CPUs in
>> recent years. ARM publishes its pipeline lengths. One could compare
>> the last ARM of a line without this feature to the first with this
>> feature, and get an indication whether it made the pipeline deeper.
>
>Does anybody (Scott?) have an indication of which chips this
>might be?

I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.

https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

Re: Benefit vs cost of zero-cycle register moves

<a5dfcf0e975c0f53538473e625e8edb8@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36583&group=comp.arch#36583

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Thu, 4 Jan 2024 00:15:13 +0000
Organization: novaBBS
Message-ID: <a5dfcf0e975c0f53538473e625e8edb8@news.novabbs.com>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de> <2024Jan2.115140@mips.complang.tuwien.ac.at> <un0u76$1rm4h$1@newsreader4.netcologne.de> <FVZkN.132605$83n7.119588@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2183416"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$QT/hKQfjbRzUjtvScAOSWOL8uvWAunoiOdBZgWAjHTzsqe0o83GtO

by: MitchAlsup - Thu, 4 Jan 2024 00:15 UTC

Scott Lurndal wrote:

> I can't speak to anything non-public. The Wikipedia page for
> neoverse shows a pipeline depth of 10 cycles for the N2 family.

> https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

Along with the 4-cycle LD-use latency indicates a high frequency
wide-issue design, the 10-cycle pipeline depth indicates little
time for instruction fusing or register write elision.

Re: Benefit vs cost of zero-cycle register moves

<un5td6$1uusg$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36602&group=comp.arch#36602

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-594-0-619c-ea84-1fa6-4d1e.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Benefit vs cost of zero-cycle register moves
Date: Thu, 4 Jan 2024 09:22:46 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <un5td6$1uusg$1@newsreader4.netcologne.de>
References: <umva2r$1qlp8$1@newsreader4.netcologne.de>
<2024Jan2.115140@mips.complang.tuwien.ac.at>
<un0u76$1rm4h$1@newsreader4.netcologne.de>
<FVZkN.132605$83n7.119588@fx18.iad>
<a5dfcf0e975c0f53538473e625e8edb8@news.novabbs.com>
Injection-Date: Thu, 4 Jan 2024 09:22:46 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-594-0-619c-ea84-1fa6-4d1e.ipv6dyn.netcologne.de:2a0a:a540:594:0:619c:ea84:1fa6:4d1e";
logging-data="2063248"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 4 Jan 2024 09:22 UTC

MitchAlsup <mitchalsup@aol.com> schrieb:
> Scott Lurndal wrote:
>
>> I can't speak to anything non-public. The Wikipedia page for
>> neoverse shows a pipeline depth of 10 cycles for the N2 family.
>
>> https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2
>
> Along with the 4-cycle LD-use latency indicates a high frequency
> wide-issue design, the 10-cycle pipeline depth indicates little
> time for instruction fusing or register write elision.

The ARM Neoverse N2 Software Optimization Guide gives a one-cycle
execution latency for register to register moves (with four in
parallel). Constant loads take zero cycles; and simple register
moves are also listed under "Zero Latency MOVs" with the somehwat
less than illuminating caveat

"The last 3 instructions may not be executed with zero latency
under certain conditions".

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/aarch64/tuning_models/neoversen2.h;hb=HEAD

gives that cost as 1 (so presumably these conditions happen).

They also fuse some instructions for aarch64

CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
CMP (immediate) + CSEL
CMP (register) + CSEL
CMP (immediate) + CSET
CMP (register) + CSET
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond
NOP + Any instruction

plus for both 64-bit and 32-bit

AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)
AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond

where conditions apply which they actually spell out.

Don't hit the keys so hard, it hurts.

devel / comp.arch / Re: Benefit vs cost of zero-cycle register moves

Subject	Author
Benefit vs cost of zero-cycle register moves	Thomas Koenig
Re: Benefit vs cost of zero-cycle register moves	MitchAlsup
Re: Benefit vs cost of zero-cycle register moves	Anton Ertl
Re: Benefit vs cost of zero-cycle register moves	Thomas Koenig
Re: Benefit vs cost of zero-cycle register moves	Scott Lurndal
Re: Benefit vs cost of zero-cycle register moves	MitchAlsup
Re: Benefit vs cost of zero-cycle register moves	Thomas Koenig
Re: Benefit vs cost of zero-cycle register moves	Michael S
Re: Benefit vs cost of zero-cycle register moves	Thomas Koenig
Re: Benefit vs cost of zero-cycle register moves	Michael S