Message-ID:

The number of arguments is unimportant unless some of them are correct. -- Ralph Hartley

devel / comp.arch / Re: Load/Store with auto-increment

Re: Load/Store with auto-increment

<n6a7M.425169$wfQc.50935@fx43.iad>

https://news.novabbs.org/devel/article-flat.php?id=32169&group=comp.arch#32169

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Load/Store with auto-increment
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de> <2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad> <_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad> <2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad> <2023May11.120936@mips.complang.tuwien.ac.at>
In-Reply-To: <2023May11.120936@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 82
Message-ID: <n6a7M.425169$wfQc.50935@fx43.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 11 May 2023 18:04:03 UTC
Date: Thu, 11 May 2023 14:03:50 -0400
X-Received-Bytes: 4413

by: EricP - Thu, 11 May 2023 18:03 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Anton Ertl wrote:
> [..]
>> But trying to come up with a circuit that can choose the optimum
>> crossbar configuration in 1 clock proved too difficult at that time.
>
> How about taking >1 cycle? What would suffer in that case?

In the scenario I'm thinking of this is where Dispatch
(the last stage of the front end) hands off multiple uOps
to the various function unit reservation stations.

It could take multiple clocks but obviously that bottlenecks
the whole processor. Why have 4 lanes of Decode and Rename
if Dispatch effectively knocks that down to 2 lanes?
You might as well stick with 2 lanes in the front end.

But I had some new ideas on resource management since then
so I might look at this again with new eyes.

>>> Forwarding reduced the read-port requirements and writes to the same
>>> register (with only forwarding reads and no branch or possible trap in
>>> between) may reduce the write-port requirements, but the question is
>>> how much of that is known early enough in an OoO implementation to
>>> reduce the port and renamer requirements.
>> I don't think one can reduce write ports because each dest operand must
>> be written back to the physical register and forwarded at the same time.
>> It can't optimize away the write back because, for example, a replay
>> trap might occur and cause uOps to re-read their source operands,
>> which on the second read would be sourced from the reg file.
>
> You can do it in cases where there cannot be any kind of replay
> between the first and second write. E.g.,
>
> a = load(b)
> a = a+1
>
> This sequence needs to write "a" only once, in the second instruction.
> If the load traps, "a" has the value from before the load. The
> addition cannot trap or otherwise need a replay, so there is no need
> to be able to get the architectural state between the load and the
> addition.
>
> - anton

For that limited example. But with multiple loads and multiple consumers
with multiple operands one can produce a sequence that breaks.

r1 = load(addr1)
r2 = load(addr2)
r1 = r1 + 1
r2 = r2 + 1

If load of r2 replay traps it replays all instructions after it.
If the write of r1 was optimized away then the data is lost.

For this write back optimization to be possible there are some
filter rules that must be applied.

The arch register must have been renamed a second time, so the
consumers of original phy reg is limited to those already assigned.

Then all those consumer uOps have to make it out of the front end
and into reservation stations so we are sure that all consumers are
listening on the forwarding network BEFORE the producer uOp executes
and broadcasts its result. In particular, a front end stall could
prevent consumer uOps from being passed to their R.S. so they miss
the result broadcast.

Then we must ensure there are no overlapping sequences that could
replay trap and cause a data loss like the above example.
So we need some detector that there is such a sequence in-flight.

This decision is dynamic based on the state at the time of write back
and all the decision inputs must be communicated to each function unit.
The register write port has to already be available
is case the optimization does not work out.

So all this optimization does it inhibit its write when not filtered out.

Re: Load/Store with auto-increment

<2d04c0bc-b1a4-4414-8f12-078b8f26e87an@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32172&group=comp.arch#32172

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:410a:b0:757:711e:98c8 with SMTP id j10-20020a05620a410a00b00757711e98c8mr5538909qko.12.1683830455712;
Thu, 11 May 2023 11:40:55 -0700 (PDT)
X-Received: by 2002:a05:6870:54c2:b0:196:2a88:5d1e with SMTP id
g2-20020a05687054c200b001962a885d1emr4554709oan.9.1683830455429; Thu, 11 May
2023 11:40:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 11 May 2023 11:40:55 -0700 (PDT)
In-Reply-To: <n6a7M.425169$wfQc.50935@fx43.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b141:ed72:1f40:88ff;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b141:ed72:1f40:88ff
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de>
<2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad>
<_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad>
<2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad>
<2023May11.120936@mips.complang.tuwien.ac.at> <n6a7M.425169$wfQc.50935@fx43.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2d04c0bc-b1a4-4414-8f12-078b8f26e87an@googlegroups.com>
Subject: Re: Load/Store with auto-increment
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 11 May 2023 18:40:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7786

by: MitchAlsup - Thu, 11 May 2023 18:40 UTC

On Thursday, May 11, 2023 at 1:04:07 PM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> Anton Ertl wrote:
> > [..]
> >> But trying to come up with a circuit that can choose the optimum
> >> crossbar configuration in 1 clock proved too difficult at that time.
> >
> > How about taking >1 cycle? What would suffer in that case?
> In the scenario I'm thinking of this is where Dispatch
> (the last stage of the front end) hands off multiple uOps
> to the various function unit reservation stations.
>
> It could take multiple clocks but obviously that bottlenecks
> the whole processor. Why have 4 lanes of Decode and Rename
> if Dispatch effectively knocks that down to 2 lanes?
> You might as well stick with 2 lanes in the front end.
<
From Polpak library:: r8_erf(): polynomial evaluation::
<
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
<snip>
<
We have 11 instructions that are processed in the FMAC unit
{FMUL, FMAC, FDIV} essentially in a row {And the LDD of r4
should have been a constant, too}. You would like these to get
dumped into the FMAC reservation station in 2-cycles (6-wide)
or 3 cycles (4-wide), so you need a mechanism to perform said
dumping or the front end will stall.
>
> But I had some new ideas on resource management since then
> so I might look at this again with new eyes.
> >>> Forwarding reduced the read-port requirements and writes to the same
> >>> register (with only forwarding reads and no branch or possible trap in
> >>> between) may reduce the write-port requirements, but the question is
> >>> how much of that is known early enough in an OoO implementation to
> >>> reduce the port and renamer requirements.
> >> I don't think one can reduce write ports because each dest operand must
> >> be written back to the physical register and forwarded at the same time.
> >> It can't optimize away the write back because, for example, a replay
> >> trap might occur and cause uOps to re-read their source operands,
> >> which on the second read would be sourced from the reg file.
> >
> > You can do it in cases where there cannot be any kind of replay
> > between the first and second write. E.g.,
> >
> > a = load(b)
> > a = a+1
> >
> > This sequence needs to write "a" only once, in the second instruction.
> > If the load traps, "a" has the value from before the load. The
> > addition cannot trap or otherwise need a replay, so there is no need
> > to be able to get the architectural state between the load and the
> > addition.
> >
> > - anton
> For that limited example. But with multiple loads and multiple consumers
> with multiple operands one can produce a sequence that breaks.
>
> r1 = load(addr1)
> r2 = load(addr2)
> r1 = r1 + 1
> r2 = r2 + 1
>
> If load of r2 replay traps it replays all instructions after it.
> If the write of r1 was optimized away then the data is lost.
<
Which is why, in the pipeline, you mark the write to r1 from the LD
as "elided" but only so long as the second write to r1 from the ADD
survives. That is, you can still elide the write, but you have to be
able to see the overwrite for elision to be manifest.
>
> For this write back optimization to be possible there are some
> filter rules that must be applied.
>
> The arch register must have been renamed a second time, so the
> consumers of original phy reg is limited to those already assigned.
<
There are front-end organizations where instructions are grouped
in some fashion (packet cache, trace cache, "bundle stage in the
pipeline" ) and if any instruction in the group raises an exception,
the machine is backed up such that none of them have executed
and each one is tried in-situ until the exception is discovered.
<
{The primary reason for the grouping is to route the instructions
to function units, and to pack the register specifiers, eliding those
not necessary from the renamer. This means that the the instructions
in the group have been pre-renamed:: for example an operand might
be encodes as "I want the result from the 3rd instruction in the group"
While a result could be marked as "elided, don't allocate a register".
Such a result does not need a register to write, and will broadcast
its result not using physical register, but by checkpoint and FU.
Reservation station entries are setup to capture by <slot,ckID>
instead of <prn>.}
<
In the former case, you mark elision in the grouping, in the later
you have collapsed the machine to 1-wide while searching for the
exception and elide nothing.
>
> Then all those consumer uOps have to make it out of the front end
> and into reservation stations so we are sure that all consumers are
> listening on the forwarding network BEFORE the producer uOp executes
> and broadcasts its result. In particular, a front end stall could
> prevent consumer uOps from being passed to their R.S. so they miss
> the result broadcast.
>
> Then we must ensure there are no overlapping sequences that could
> replay trap and cause a data loss like the above example.
> So we need some detector that there is such a sequence in-flight.
>
> This decision is dynamic based on the state at the time of write back
> and all the decision inputs must be communicated to each function unit.
> The register write port has to already be available
> is case the optimization does not work out.
>
> So all this optimization does it inhibit its write when not filtered out.

Re: Load/Store with auto-increment

<7caf19a3-a768-4ecd-9b40-9f7f77b90155n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32178&group=comp.arch#32178

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5c88:0:b0:3f3:89cf:7f5f with SMTP id r8-20020ac85c88000000b003f389cf7f5fmr6720421qta.13.1683839261178;
Thu, 11 May 2023 14:07:41 -0700 (PDT)
X-Received: by 2002:a05:6870:7d05:b0:192:579f:31bb with SMTP id
os5-20020a0568707d0500b00192579f31bbmr8830849oab.10.1683839260983; Thu, 11
May 2023 14:07:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 11 May 2023 14:07:40 -0700 (PDT)
In-Reply-To: <2d04c0bc-b1a4-4414-8f12-078b8f26e87an@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b141:ed72:1f40:88ff;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b141:ed72:1f40:88ff
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de>
<2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad>
<_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad>
<2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad>
<2023May11.120936@mips.complang.tuwien.ac.at> <n6a7M.425169$wfQc.50935@fx43.iad>
<2d04c0bc-b1a4-4414-8f12-078b8f26e87an@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7caf19a3-a768-4ecd-9b40-9f7f77b90155n@googlegroups.com>
Subject: Re: Load/Store with auto-increment
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 11 May 2023 21:07:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4651

by: MitchAlsup - Thu, 11 May 2023 21:07 UTC

On Thursday, May 11, 2023 at 1:40:57 PM UTC-5, MitchAlsup wrote:
> From Polpak library:: r8_erf(): polynomial evaluation::
> <
> mov r3,#0x3FC7C7905A31C322 // a[4]
> fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
> fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
> ldd r4,[sp,104] // a[2]
> fmac r3,r2,r3,r4
> fadd r4,r2,#0x403799EE342FB2DE // b[0]
> fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
> fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
> fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
> fmul r1,r3,r1
> fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
> fdiv r2,r1,r2
> <snip>
> <
> We have 11 instructions that are processed in the FMAC unit
> {FMUL, FMAC, FDIV} essentially in a row {And the LDD of r4
> should have been a constant, too}. You would like these to get
> dumped into the FMAC reservation station in 2-cycles (6-wide)
> or 3 cycles (4-wide), so you need a mechanism to perform said
> dumping or the front end will stall.
<
I went looking and found a series of code from my compiler that
has 40-odd FMAC instructions in a ROW !!
<
From Navier_Stokes_exact_2D::resid_Taylor::
<
..LBB15_2: ; =>This Inner Loop Header: Depth=1
ldd r8,[r27,r7<<3,0]
fmul r9,r8,#0x400921FB54442D18
fcos r10,r9
ldd r11,[r28,r7<<3,0]
fmul r12,r11,#0x400921FB54442D18 // first one starting #1
fsin r13,r12
fmul r14,r13,-r10
fsin r9,r9
fmul r15,r9,#0x400921FB54442D18
fmul r15,r15,r13
fmul r22,r10,#0x4023BD3CC9BE45DE
fmul r22,r22,r13
fmul r21,r10,#0x400921FB54442D18
fcos r12,r12
fmul r21,r21,r12
fmul r10,r1,r10
fmul r10,r10,r13
fmul r20,r9,r12
fmul r18,r9,#0xC023BD3CC9BE45DE
fmul r18,r18,r12
fmul r17,r9,#0xC00921FB54442D18
fmul r13,r17,r13
fmul r9,r1,r9
fmul r9,r9,r12
fmul r8,r8,#0x401921FB54442D18
fmul r11,r11,#0x401921FB54442D18
fsin r8,r8
fmul r8,r2,r8
fsin r11,r11
fmul r11,r2,r11
fmul r12,r3,r14
fmul r14,r3,r15
fmul r15,r3,r22
fmul r22,r3,r21
fmul r21,r3,r20
fmul r20,r3,r18
fmul r13,r3,r13
fmul r9,r3,r9
fmul r8,r4,r8
fmul r11,r4,r11
fmul r18,r12,r14
fmac r10,r3,r10,r18
fmac r10,r22,-r21,r10
fmac r8,r5,r8,r10 // last one ending 40.
fadd r10,r15,r15
fmac r8,r26,-r10,r8
ldd r10,[r25,r7<<3,0]
fadd r8,r8,-r10
std r8,[r29,r7<<3,0]
fmac r8,r12,r22,-r9
fmac r8,r21,r13,r8
fmac r8,r5,r11,r8
fadd r9,r20,r20
fmac r8,r26,-r9,r8
ldd r9,[r24,r7<<3,0]
fadd r8,r8,-r9
std r8,[r30,r7<<3,0]
fadd r8,r14,r13
ldd r9,[r23,r7<<3,0]
fadd r8,r8,-r9
std r8,[r19,r7<<3,0]
add r7,r7,#1
cmp r8,r7,r6
bne r8,.LBB15_2
<
This is what happens when your ISA contains uniform constants,
all the other instructions that support FP calculations vanish....
{Instructions like paste-constant then LD based on the pasted
constant}....
<
Instructions that vanish do not waste power or other resources.

Re: Load/Store with auto-increment

<2023May12.093148@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32181&group=comp.arch#32181

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Load/Store with auto-increment
Date: Fri, 12 May 2023 07:31:48 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 93
Message-ID: <2023May12.093148@mips.complang.tuwien.ac.at>
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de> <2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad> <_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad> <2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad> <2023May11.120936@mips.complang.tuwien.ac.at> <n6a7M.425169$wfQc.50935@fx43.iad>
Injection-Info: dont-email.me; posting-host="4d781fc6959ca3d8c33cf5906c04d838";
logging-data="1543071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18n6Rib7VjCaOSqsdxZN5Cb"
Cancel-Lock: sha1:dDL8X9Zj9jm218WVLWjjPMFmHJI=
X-newsreader: xrn 10.11

by: Anton Ertl - Fri, 12 May 2023 07:31 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> Anton Ertl wrote:
>> [..]
>>> But trying to come up with a circuit that can choose the optimum
>>> crossbar configuration in 1 clock proved too difficult at that time.
>>
>> How about taking >1 cycle? What would suffer in that case?
>
>In the scenario I'm thinking of this is where Dispatch
>(the last stage of the front end) hands off multiple uOps
>to the various function unit reservation stations.
>
>It could take multiple clocks but obviously that bottlenecks
>the whole processor.

Not if you manage to pipeline this part. Making the front end longer
costs on branch mispredictions, but branch mispredictions are rare, so
a longer front end is not particularly bad.

>> You can do it in cases where there cannot be any kind of replay
>> between the first and second write. E.g.,
>>
>> a = load(b)
>> a = a+1
>>
>> This sequence needs to write "a" only once, in the second instruction.
>> If the load traps, "a" has the value from before the load. The
>> addition cannot trap or otherwise need a replay, so there is no need
>> to be able to get the architectural state between the load and the
>> addition.
>>
>> - anton
>
>For that limited example. But with multiple loads and multiple consumers
>with multiple operands one can produce a sequence that breaks.
>
> r1 = load(addr1)
> r2 = load(addr2)
> r1 = r1 + 1
> r2 = r2 + 1
>
>If load of r2 replay traps it replays all instructions after it.
>If the write of r1 was optimized away then the data is lost.

Sure, that's why this writeback cannot be eliminated. But the
writeback of the second load can be eliminated. So for this whole
sequence you can eliminate one writeback. If the compiler reorders
this as

r1 = load(addr1)
r1 = r1 + 1
r2 = load(addr2)
r2 = r2 + 1

two writebacks can be eliminated.

>For this write back optimization to be possible there are some
>filter rules that must be applied.
>
>The arch register must have been renamed a second time, so the
>consumers of original phy reg is limited to those already assigned.
>
>Then all those consumer uOps have to make it out of the front end
>and into reservation stations so we are sure that all consumers are
>listening on the forwarding network BEFORE the producer uOp executes
>and broadcasts its result. In particular, a front end stall could
>prevent consumer uOps from being passed to their R.S. so they miss
>the result broadcast.

Yes, that's a major hurdle.

>Then we must ensure there are no overlapping sequences that could
>replay trap and cause a data loss like the above example.

That's easy; you see this during decoding.

>This decision is dynamic based on the state at the time of write back
>and all the decision inputs must be communicated to each function unit.

I don't think it's practical to do it at that time. The question is
whether you can manage to do it earlier, maybe even in the front end,
maybe aided by the availability of additional buffers to deal with the
missed-writeback case.

Or maybe prepare it in the front end, and have a cheap final decision
in the OoO engine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Subject	Author
Re: Load/Store with auto-increment	EricP
Re: Load/Store with auto-increment	MitchAlsup
Re: Load/Store with auto-increment	MitchAlsup
Re: Load/Store with auto-increment	Anton Ertl