Message-ID:

You don't have to know how the computer works, just how to work the computer.

devel / comp.arch / Re: Load/Store with auto-increment

Re: Load/Store with auto-increment

<GCb7M.2700303$iS99.1096933@fx16.iad>

https://news.novabbs.org/devel/article-flat.php?id=32174&group=comp.arch#32174

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Load/Store with auto-increment
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de> <2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad> <_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad> <2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad> <2023May11.120936@mips.complang.tuwien.ac.at>
In-Reply-To: <2023May11.120936@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 62
Message-ID: <GCb7M.2700303$iS99.1096933@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 11 May 2023 19:46:46 UTC
Date: Thu, 11 May 2023 15:46:37 -0400
X-Received-Bytes: 3143

by: EricP - Thu, 11 May 2023 19:46 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Anton Ertl wrote:
> [..]
>> But trying to come up with a circuit that can choose the optimum
>> crossbar configuration in 1 clock proved too difficult at that time.
>
> How about taking >1 cycle? What would suffer in that case?

Last week I was working on an allocator for my physical register file.
In my uArch physical registers are allocated by Rename from a
bit vector indicating which registers are free.

The problem is how to do this concurrently for multiple lanes.

Originally I used a priority selector doing a Find-First-1 to select
the first free bit. The physical register file has status bits for
each pReg, one of which is Free. The priority selector scans the free
vector producing a one-hot output, and a one-hot to binary encoder
converts that to a free phy reg number.
A FF1 can be built having log_4 gate delay of the number of scanned bits,
so selecting a free reg from a set of 255 physical registers should
take about 4-5 NAND/NOR gate delays.

Originally to rename multiple lanes I had planned that the priority
selectors would be serially chained: the one-hot output of the first
FF1 masks out that bit from the input of the second FF1, and so on.

free bit vector
|---------------
v | |
FF1 | |
| v |
|------>MASK |
| v |
| FF1 v
| |--->MASK
| | v
| | FF1
v v v
encode encode encode
v v v
lane-0 lane-1 lane-2
free free free
reg# reg# reg#

The problem is the propagation delay for multiple lanes uses
up much of the available time in the Rename stage.
Besides the free register selector there is other gate delay overhead
in Rename so it really didn't look like this FF1 chaining approach
would be usable beyond 2 or maybe 3 lanes.

To rename 4 or more lanes I needed was a faster way to
allocate resources in parallel with some log_N gate delay growth.

Continued next msg...

Re: Load/Store with auto-increment

<4745b049-0649-4744-8934-d99093c2eb68n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32176&group=comp.arch#32176

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4490:b0:74e:7d6:e85 with SMTP id x16-20020a05620a449000b0074e07d60e85mr6768906qkp.11.1683836569640;
Thu, 11 May 2023 13:22:49 -0700 (PDT)
X-Received: by 2002:a05:6871:206:b0:18f:2b14:d686 with SMTP id
t6-20020a056871020600b0018f2b14d686mr9225024oad.8.1683836569377; Thu, 11 May
2023 13:22:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 11 May 2023 13:22:49 -0700 (PDT)
In-Reply-To: <GCb7M.2700303$iS99.1096933@fx16.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b141:ed72:1f40:88ff;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b141:ed72:1f40:88ff
References: <u35prk$2ssbq$1@dont-email.me> <u36fd2$121nc$1@newsreader4.netcologne.de>
<2023May9.111344@mips.complang.tuwien.ac.at> <UQt6M.233407$qpNc.65909@fx03.iad>
<_Qu6M.539024$Olad.404121@fx35.iad> <Y7y6M.233411$qpNc.12100@fx03.iad>
<2023May10.100025@mips.complang.tuwien.ac.at> <LHP6M.2840676$9sn9.1828478@fx17.iad>
<2023May11.120936@mips.complang.tuwien.ac.at> <GCb7M.2700303$iS99.1096933@fx16.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4745b049-0649-4744-8934-d99093c2eb68n@googlegroups.com>
Subject: Re: Load/Store with auto-increment
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 11 May 2023 20:22:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4632

by: MitchAlsup - Thu, 11 May 2023 20:22 UTC

On Thursday, May 11, 2023 at 2:48:55 PM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> Anton Ertl wrote:
> > [..]
> >> But trying to come up with a circuit that can choose the optimum
> >> crossbar configuration in 1 clock proved too difficult at that time.
> >
> > How about taking >1 cycle? What would suffer in that case?
<
> Last week I was working on an allocator for my physical register file.
> In my uArch physical registers are allocated by Rename from a
> bit vector indicating which registers are free.
>
> The problem is how to do this concurrently for multiple lanes.
<
When I did this (1991 Mc88120), we had a 96 instruction execution
unit and with 32 architectural registers, this gave us a physical register
file of 128 entries; we had a 6-wide machine, so we divided the PRF
into 6 segments; 2 segments got 22 registers, and 4 segments got 21.
<
Now the allocation problem is if the instruction is routed to FU[slot]
then it gets the FF1(PRF[slot]). This was all done in unary, and the
destination register of that instruction is written into the PRF CAM
at the allocated register. Subsequent accesses will match the CAM
and read the proper register.
<
One can make a FF1 as wide as 72-bits taking no more than 4 gate delays
using 4-input NANDs and 3-input NORs. But since we are scanning only a
22-bit field, the delay is 3 gates.
>
> Originally I used a priority selector doing a Find-First-1 to select
> the first free bit. The physical register file has status bits for
> each pReg, one of which is Free. The priority selector scans the free
> vector producing a one-hot output, and a one-hot to binary encoder
> converts that to a free phy reg number.
<
Check,
<
> A FF1 can be built having log_4 gate delay of the number of scanned bits,
> so selecting a free reg from a set of 255 physical registers should
> take about 4-5 NAND/NOR gate delays.
>
> Originally to rename multiple lanes I had planned that the priority
> selectors would be serially chained: the one-hot output of the first
> FF1 masks out that bit from the input of the second FF1, and so on.
>
> free bit vector
> |---------------
> v | |
> FF1 | |
> | v |
> |------>MASK |
> | v |
> | FF1 v
> | |--->MASK
> | | v
> | | FF1
> v v v
> encode encode encode
> v v v
> lane-0 lane-1 lane-2
> free free free
> reg# reg# reg#
>
Segmentation works easier.
>
> The problem is the propagation delay for multiple lanes uses
> up much of the available time in the Rename stage.
> Besides the free register selector there is other gate delay overhead
> in Rename so it really didn't look like this FF1 chaining approach
> would be usable beyond 2 or maybe 3 lanes.
>
> To rename 4 or more lanes I needed was a faster way to
> allocate resources in parallel with some log_N gate delay growth.
>
> Continued next msg...

Subject	Author
Re: Load/Store with auto-increment	EricP
Re: Load/Store with auto-increment	MitchAlsup