Rocksolid Light - comp.arch - Re: State of the art non-linear FETCH

Re: State of the art non-linear FETCH

<e9b07a74-5193-47a1-8028-c8351663f368n@googlegroups.com>

https://news.novabbs.org/devel/article-flat.php?id=34586&group=comp.arch#34586

X-Received: by 2002:ac8:7382:0:b0:41c:b9a0:17c0 with SMTP id t2-20020ac87382000000b0041cb9a017c0mr13374qtp.3.1697670662005;
Wed, 18 Oct 2023 16:11:02 -0700 (PDT)
X-Received: by 2002:a05:6808:3989:b0:3af:8f64:1810 with SMTP id
gq9-20020a056808398900b003af8f641810mr73358oib.2.1697670661789; Wed, 18 Oct
2023 16:11:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 18 Oct 2023 16:11:01 -0700 (PDT)
In-Reply-To: <wmVXM.21568$%WT8.17513@fx12.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:78f1:aabe:a874:335e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:78f1:aabe:a874:335e
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<HGCXM.41729$qoC8.14869@fx40.iad> <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
<wmVXM.21568$%WT8.17513@fx12.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e9b07a74-5193-47a1-8028-c8351663f368n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 18 Oct 2023 23:11:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 18 Oct 2023 23:11 UTC

On Wednesday, October 18, 2023 at 1:21:20 PM UTC-5, EricP wrote:
> robf...@gmail.com wrote:
> > On Tuesday, October 17, 2023 at 5:05:48 PM UTC-4, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
> >>>> On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
> >>>>> MitchAlsup wrote:
> >>>>>> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
> >>>>>>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> >>>>>>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf....@gmail..com wrote:
> >>>>>>>> <
> >>>>>>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
> >>>>>>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> >>>>>>>>> How many alternate branch paths are going to be supported?
> >>>>>>>> <
> >>>>>>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> >>>>>>>> to service the excursions from the sequential flow. What I don't want is to add cycles
> >>>>>>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
> >>>>>>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> >>>>>>>> issued into execution. I want this 6-wide machine to have only 1 more::
> >>>>>>>> <
> >>>>>>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> >>>>>>>> The output of Choose is 6 instructions,
> >>>>>>>> The output of Issue is 6 6instructions with register and constant operands at their
> >>>>>>>> .....respective instruction queue.
> >>>>>>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> >>>>>>> of choose? I would think the choice would be made after EXECUTE once the branch
> >>>>>>> outcome is determined.
> >>>>>> <
> >>>>>> In this thread we are anticipating 6-8-wide instructions per cycle.. Under these circumstances
> >>>>>> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> >>>>>> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> >>>>>> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> >>>>>> <
> >>>>>> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> >>>>>> Once you get to 8-wide you might have to start doing 2 predictions per cycle....
> >>>> <
> >>>>> A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
> >>>>> Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
> >>>>> fetch block per clock to sustain fetching with no bubbles.
> >>>> <
> >>>> Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
> >>>> power-of-2. Now we are in a position to fetch 8 words as::
> >>>> a) 0-1
> >>>> b) 1-2
> >>>> c) 2-3
> >>>> d) 3-0 of the next IP
> >>>> So, for an additional carry chain, you suffer no gross alignedness problems.
> >>>>> We also have to deal with the I$L1 read access latency.
> >>>>> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
> >>>> <
> >>>> Then you made it too big !! And why should not each SRAM in a cache line
> >>>> not be accessible independently (that is throughput = 4 or some higher power
> >>>> of 2).
> >>> <
> >>> I need to elaborate here::
> >>> <
> >>> A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> >>> logic in it that the I$ does not need--that is the byte alignment logic
> >>> that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> >>> the same metrics.
> >> Ok. I thought the I$L1 read pipeline would always be 3 stages
> >> as it didn't look like there would be enough slack in stage 3 to do
> >> much of anything else (that being instruction alignment and parse).
> >> - cache address decode, latch
> >> - word line drive, bit line drive, sense amp, latch
> >> - tag match, way select mux, drive to prefetch buffer, latch
> >>
> >> So just following my train of thought, it might as well read a whole
> >> cache line into the prefetch buffers for each cache access. Out of the
> >> prefetch buffer it reads one or two 32B blocks for consecutive
> >> virtual addresses.
> >>
> >> The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
> >> address bits to shift from 0 to 7 instruction words.
> >>
> >> That result feeds to the Log2 Parser which selects up to 6 instructions
> >> from those source bytes.
> >>
> >> Fetch Line Buf 0 (fully assoc index)
> >> Fetch Line Buf 1
> >> Fetch Line Buf 2
> >> Fetch Line Buf 3
> >> v v
> >> 32B Blk1 32B Blk0
> >> v v
> >> Alignment Shifter 8:1 muxes
> >> v
> >> Log2 Parser and Branch Detect
> >> v v v v v v
> >> I5 I4 I3 I2 I1 I0
> >>> <
> >>> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> >>> SRAMs only need the lower order ~16(-3) bits of address which comes out
> >>> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> >>> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> >>> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> >>> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> >>> which data goes into the Inst Buffer, while the IB is supplying instructions
> >>> into PARSE/DECODE. While in IB, instructions are scanned for branches and
> >>> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> >>> are getting close to the point where in spaghetti codes all instructions might
> >>> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> >>> to supply all instructions and missed instructions flowed through the C$. Here
> >>> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> >>> problems and would only deliver LDs that had resolved all dependencies.
> >> That was one of the reasons for suggesting a 32B block as the fetch unit.
> >> Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
> >> can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
> >> 32B block holds about 6 instructions, which is also about a basic block
> >> size and on average would contain 1 branch.
> >>
> >> So by attaching an optional next-alt-fetch physical address to each
> >> 32B block in a 64B cache line, then it can specify both the sequential
> >> and alt fetch path start to follow.
> >>
> >> That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
> >> so when the Log2 Parser sees a branch it can immediately switch to
> >> parsing instructions from the alt path on the next clock.
> >> If two cache lines from the alt path are loaded, that's 4 prefetch blocks
> >> which should be enough to cover the I$ read pipeline latency.
> >>
> >> The BTB and Return Stack Predictor are also providing virtual address
> >> prefetch predictions, so that has to be integrated in this too.
> >>> <
> >>> BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
> >>> of associativity did not hurt all that much. This cannot be replicated now
> >>> mainly due to wire delay and balancing the several hundred cycles to DRAM
> >>> with L2 and L3 caching.
> >
> > What if only some alt paths are supported? Could alt paths be available only for
> > forward branches? How does the number of alt paths supported impact
> > performance?
> For alt paths to be prefetched requires some kind of trigger.
> Current systems seem to use the BTB as a trigger and, from what I read,
> this can make the BTB structures quite large.
> For zero bubble fetching is that the trigger has to be
> set to go off far enough ahead to hide the bubble.
> More paths, father ahead, means larger structures, slower access.
> > How does the alt-path get set in the fetch block? When things are cold the alt path
> > may not be set correctly. So, it would need to be updated in the I$; I$ line invalidate
> > required then?
> Existing Intel and AMD x86 designs have fetch marking instructions
> with parser start and stop bit, so there are pathways to back propagate
> from fetch-parse to I$ in current designs.
>
> This might be done by when fetch reads a I$ [row,set,block] entry it comes
> with the coordinates with the instruction packet, which we save in fetch.
> This saves going through the index and tag structures to update that entry.
>
> Fetch could keep a 4-entry circular buffer with the [row,set,block] of
> 32B blocks it just fetched from. It would just write physical address
> of the current block to the auxiliary fields of block 4 clocks ago.
>
> This update would be stored in a pending update buffer in cache
> and written back when there was an unused I$ read or write cycle.
> (And if the cache data area is also write pipelined then multiple pending
> update buffers and coordination logic would be required).
>
> This might imply frequent write updates to I$ cache.
> The problem is finding an unused I$ cycle. If we are fetching both
> sequential and possibly 1 alternate path then that might fully saturate
> the I$ access bandwidth. And there is also prefetch writes to load
> cache lines into I$L1 from I$L2.
<
> > I have thought of placing the alt paths for the top four alternate
> > paths in a function header rather than for every block in the function. It would
> > reduce the I$ storage requirements. Then fetch from all four paths and use a
> > branch instruction to select the path. It would mean that only a limited number of
> > alt paths are supported but for many small functions a small number may be
> > adequate.
<
> The way I'm thinking, fetch buffers would be an expensive resource,
> fully assoc index, really a small I$L0 cache, so the buffers would
> be recycled quite quickly and not be prefetched too far ahead.
<
When big enough to cover the latency from arrival to insertion into reservation
station, you have enough buffers.

Click here to read the complete article

MitchAlsup wrote:
> On Wednesday, October 18, 2023 at 1:21:20 PM UTC-5, EricP wrote:

By the way, this article talks a bit about AMD and zero bubble fetching.

https://chipsandcheese.com/2023/10/08/zen-5s-leaked-slides/

They don't define whether "zero bubble" means, zero clock bubble or
zero instruction bubble, the difference matters when one is decoding
4 instructions/clock.

Re: State of the art non-linear FETCH

<bc407bed-1125-4d50-aa16-8e28b2ce8c8cn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34589&group=comp.arch#34589

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1434:b0:775:74c5:95ea with SMTP id k20-20020a05620a143400b0077574c595eamr57804qkj.1.1697738893933;
Thu, 19 Oct 2023 11:08:13 -0700 (PDT)
X-Received: by 2002:a05:6808:bc9:b0:3ad:29a4:f542 with SMTP id
o9-20020a0568080bc900b003ad29a4f542mr856379oik.5.1697738893696; Thu, 19 Oct
2023 11:08:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!nntp.comgw.net!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 19 Oct 2023 11:08:13 -0700 (PDT)
In-Reply-To: <s6aYM.102497$8fO.288@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7ce1:9319:ae08:837f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7ce1:9319:ae08:837f
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<HGCXM.41729$qoC8.14869@fx40.iad> <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
<wmVXM.21568$%WT8.17513@fx12.iad> <e9b07a74-5193-47a1-8028-c8351663f368n@googlegroups.com>
<s6aYM.102497$8fO.288@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bc407bed-1125-4d50-aa16-8e28b2ce8c8cn@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 19 Oct 2023 18:08:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Thu, 19 Oct 2023 18:08 UTC

On Thursday, October 19, 2023 at 8:24:45 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, October 18, 2023 at 1:21:20 PM UTC-5, EricP wrote:
> By the way, this article talks a bit about AMD and zero bubble fetching.
>
> https://chipsandcheese.com/2023/10/08/zen-5s-leaked-slides/
>
> They don't define whether "zero bubble" means, zero clock bubble or
> zero instruction bubble, the difference matters when one is decoding
> 4 instructions/clock.
<
Zero bubble means taken branches appear to take 1 cycle.
<
And AMD's front end seems to follow my general guestimate of how
I want to push forward.
<
Thanks EricP

EricP wrote:

> MitchAlsup wrote:
>>
> That result feeds to the Log2 Parser which selects up to 6 instructions
> from those source bytes.

> Fetch Line Buf 0 (fully assoc index)
> Fetch Line Buf 1
> Fetch Line Buf 2
> Fetch Line Buf 3
> v v
> 32B Blk1 32B Blk0
> v v
> Alignment Shifter 8:1 muxes
> v
> Log2 Parser and Branch Detect
> v v v v v v
> I5 I4 I3 I2 I1 I0

Been thinking of this in the background the last month. It seems to me that
a small fetch-predictor is in order.

This fetch-predictor makes use of the natural organization of the ICache as
a matrix of SRAM macros (of some given size:: say 2KB) each SRAM macro having
a ¼ Line access width. Let us call this the horizontal direction. In the
vertical direction we have sets (or ways if you prefer).

Each SRAM macro (2KB) has 128-bits and 128-words so we need a 7-bit index.
Each SRAM column has {2,3,4,...} SRAM macros {4=16KB ICache}; so we need
{2,3,..}-bits of set-index.

Putting 4 of these index sets together gives us a (7+3)×4 = 40-bit fetch-
predictor entry, add a few bits for state and control. {{We may need to
add a field used to access the fetch-predictor for the next cycle}}.

We are now in a position to access 4×¼ = 1 cache line (16 words) from the
matrix of SRAM macros.

Sequential access:
It is easy to see that one can access 16 words (16 potential instructions)
in a linear sequence even when the access crosses a cache line boundary.

Non-sequential access:
Given a 6-wide machine (and known instruction statistics wrt VLE utilization)
and the assumption of 1 taken branch per issue-width:: the fetch-predictor
accesses 4 SRAM macros indexing the macro with the 7-bit index, and choosing
the set from the 3-bit index. {We are accessing a set-associative cache as if
it were directly mapped.}

Doubly non-sequential access:
There are many occurrences where there are a number of instructions on the
sequential path, a conditional branch to a short number of instructions on
the alternate path ending with a direct branch to somewhere else. We use
the next fetch-predictor access field such that this direct branch does not
incur an additional cycle of fetch (or execute) latency. This direct branch
can be a {branch, call, or return}

Ramifications:
When instructions are written into the ICache, they are positioned in a set
which allows the fetch-predictor to access the sequential path of instructions
and the alternate path of instructions.

All instructions are always fetched from the ICache, which has been organized
for coherence by external SNOOP activities, so there is minimal excess state
and no surgery at context switching or the like.

ICache placement ends up dependent on the instructions being written in accord
with how control flow arrived at this point (satisfying the access method
above).

This organization satisfies several "hard" cases::

a) 3 ST instructions each 5 words in size: the ICache access supplies 16 words
all 4×¼ accesses are sequential but may span cache line boundaries and set
placements. These sequences are found in subroutine prologue setting up local
variables with static assignments on the stack. The proposed machine can only
perform 3 memory references per cycle, so this seems to be a reasonable balance.

b) One can process sequential instruction up to a call and several instructions
at the call-target in the same issue cycle. The same can transpire on return.

c) Should a return find a subsequent call (after a few instructions) the EXIT
instruction can be cut short and the ENTER instruction cut short because all
the preserved registers are already where they need to be on the call/return
stack; taking fewer cycles wandering around the call/return tree.

So:: the fetch-predictor contains 5 accesses, 4 to ICache of instructions and
1 to itself for the next fetch-prediction.

{ set[0] column[0] set[1] column[1] set[2] column[2] set[3] column[3] next}
| +-------+ | +-------+ | +-------+ | +-------+ | +-------+
| | | +-->| | | | | | | | +->| |
| +-------+ +-------+ | +-------+ | +-------+ +-------+
+--> | | | | | | | | | |
+-------+ +-------+ | +-------+ | +-------+
| | | | +--> | | +--> | |
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| | | |
V V V V
inst[0] inst[1] inst[2] inst[3]

The instruction groups still have to be "routed" into some semblance of order
but this can take place over the 2 or 3 decode cycles.

All of the ICache tag checking is performed "later" in the pipeline, taking
tag-check and selection multiplexing out of the instruction delivery path.

This screen intentionally left blank.

devel / comp.arch / Re: State of the art non-linear FETCH

Subject	Author
State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	Quadibloc
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	Thomas Koenig
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	EricP