Message-ID:

Clothes make the man. Naked people have little or no influence on society. -- Mark Twain

devel / comp.arch / State of the art non-linear FETCH

I am looking for modern papers on the process of fetching
non-linear sets of instructions where the first instructions
in an ISSUE group come from a sequential long fetch and
other instructions arrive from some other mechanism {such
as a branch target cache}.

In 1991 I did what we called a packet cache FETCH mechanism.
Instructions which were not in the packet cache, were fetched
linearly from the unified cache, and when a branch was encount-
ered, the branch target was also then fetched and the instructions
were pasted together and then placed en-massé into the packet
cache. The instructions were pre-routed to their function unit and
register fields were expanded by 1-bit so we could directly annotate
intro-packet forwarding, use once results, and a few other things.

In 2000 I did what we called a Trace Cache FETCH mechanism.
Unhappy with the instruction packing density of the packet cache
above (3.6 instructions per packet which holds 6) and here faced
with the myriad of x86-64-isms and the extreme speed (5GHz
in 65nm) we chose to fetch 8 instructions per cycle and issue
4 on the first clock and 4 on the second clock. We played a bunch
of routing games (sacrificing the vertical neighbor instruction to
be an associated long constant, and other packing games...).
This system worked rather well but still sucked for instruction
packing density. However, comparing the packet cache with
6-instructions and the trace cache with 8:: the trace cache needs
2 branch predictions per cycle while the packet cache only needs
1 branch prediction per cycle.

So, faced with a 6-8-wide issue machine, and wanting to avoid
having to pre-route instructions to FU and further wanting to
accept having several instructions target a single FU and still
be issued together in a single cycle, neither of the above means
look like they would have traction.

Now, remember this is for a RISC machine with variable length
instructions (but only constants extend the length of instructions)
slightly different variables are in play.

a) So to continuously issue 6-I/C I probably have to FETCH 8
on a linear path
b) fetch from a branch target buffer for a first alternate path
set of instructions
c) fetch from a branch target target buffer for a second
alternate path set of instructions
d) there is going to be some kind of buffer between issue and
the reservation stations -- so, when a stream of instructions
all target the FMAC unit, they get issued 6-at a time and we
"go find" subsequent instructions not targeting the FMAC unit.
e) ½ of issue width will have a memory unit (3 on the 6-wide;
4 on the 8-wide) so there will be sufficient memory bandwidth
for the 30% memory instruction density of the instruction stream
and for those cases where it is higher.

f) instruction scheduling will be more like that of a Scoreboard
than of a reservation station so I can control the time of start
and the time of completion independently; but registers will
be completely renamed making this choice a sequencing issue
rather than a correctness issue.

g) there will be a set of Memory Buffers that are used to solve
memory ordering problems obeying the nuanced rules of memory
order where one end has causal order and the other end has strong
ordering, and this buffer is designed to perform memory references
based on the order specified in the TLB and whether ATOMIC stuff
is happening.

h) the overall datapath is designed to perform MAXTRIX300 at
1M×1M as fast as the memory system can deliver data, and
in particular be stall free for matrix's which fit in the L2 cache.
When we did this in 1991 we were getting 5.9 I/C for MATRIX300
and the packet cache was getting 2.2 for XLISP,.....without any
SIMD stuff.

It is easy to fetch enough linear instructions so that when there
are no branches one can issue 6-8-wide per cycle continuously.
On the other hand, when branches ARE present the game changes
completely. Where do you fetch the instructions at the target ?
How did you get an address of the branch target so you could
fetch there ?? How do you know how many instructions from the
sequential path get issued before instructions on the non-sequential
path(s) get issued ???

Anyone have pointers to current literature ?

Mitch

Re: State of the art non-linear FETCH

<c7575d2f-c047-4010-9531-95c8f3efc870n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34470&group=comp.arch#34470

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:8286:b0:765:a4f2:51ec with SMTP id ox6-20020a05620a828600b00765a4f251ecmr52949qkn.4.1696557680939; Thu, 05 Oct 2023 19:01:20 -0700 (PDT)
X-Received: by 2002:a05:6808:198d:b0:3a3:e17e:d2f7 with SMTP id bj13-20020a056808198d00b003a3e17ed2f7mr3599907oib.4.1696557680696; Thu, 05 Oct 2023 19:01:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.11.MISMATCH!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 5 Oct 2023 19:01:20 -0700 (PDT)
In-Reply-To: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:1d6b:714c:ebfd:a996; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:1d6b:714c:ebfd:a996
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c7575d2f-c047-4010-9531-95c8f3efc870n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 06 Oct 2023 02:01:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 36

by: Quadibloc - Fri, 6 Oct 2023 02:01 UTC

On Thursday, October 5, 2023 at 4:35:20 PM UTC-6, MitchAlsup wrote:

> On the other hand, when branches ARE present the game changes
> completely. Where do you fetch the instructions at the target ?
> How did you get an address of the branch target so you could
> fetch there ?? How do you know how many instructions from the
> sequential path get issued before instructions on the non-sequential
> path(s) get issued ???

For starters, I doubt very much any of the literature would use a term
like "non-linear fetch" to refer to this. Because "non-linear" is usually
used when the basic problem is that one is dealing with x^2 or worse
instead of ax+b, and this is a completely different problem domain.

What you seem to be asking about is current literature on minimizing the
costs associated with branching.

What I've understood is that the current state of the art is:

- if branches are unconditional, then just calculate the branch target
as far ahead of time as possible in the pipeline (i.e. using interlocks
to be sure the base register doesn't change out from under you)...

- in the case of conditional branches, speculatively doing the branch
the "most likely" way (branch prediction) or allocating extra registers
to use spare superscalar and SMT capacity to do both paths at once
has been the technique... with worrying about Meltdown and Spectre
and company the latest wrinkle.

And then there's predication as another useful technique.

Has anything _new_ been discovered to help with code with branches?
That's been described in the open literature?

That's a good question, and I don't know the answer.

John Savard

Re: State of the art non-linear FETCH

<65d8d145-5969-44d9-be59-d787c3aca76bn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34471&group=comp.arch#34471

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:3c09:b0:76d:a871:da22 with SMTP id tn9-20020a05620a3c0900b0076da871da22mr57568qkn.6.1696559559502;
Thu, 05 Oct 2023 19:32:39 -0700 (PDT)
X-Received: by 2002:a05:6808:18a3:b0:3ae:2710:cf87 with SMTP id
bi35-20020a05680818a300b003ae2710cf87mr3808810oib.7.1696559559332; Thu, 05
Oct 2023 19:32:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 5 Oct 2023 19:32:39 -0700 (PDT)
In-Reply-To: <c7575d2f-c047-4010-9531-95c8f3efc870n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:381d:15ba:3d64:fc83;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:381d:15ba:3d64:fc83
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com> <c7575d2f-c047-4010-9531-95c8f3efc870n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <65d8d145-5969-44d9-be59-d787c3aca76bn@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 06 Oct 2023 02:32:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 59

by: MitchAlsup - Fri, 6 Oct 2023 02:32 UTC

On Thursday, October 5, 2023 at 9:01:22 PM UTC-5, Quadibloc wrote:
> On Thursday, October 5, 2023 at 4:35:20 PM UTC-6, MitchAlsup wrote:
>
> > On the other hand, when branches ARE present the game changes
> > completely. Where do you fetch the instructions at the target ?
> > How did you get an address of the branch target so you could
> > fetch there ?? How do you know how many instructions from the
> > sequential path get issued before instructions on the non-sequential
> > path(s) get issued ???
> For starters, I doubt very much any of the literature would use a term
> like "non-linear fetch" to refer to this. Because "non-linear" is usually
> used when the basic problem is that one is dealing with x^2 or worse
> instead of ax+b, and this is a completely different problem domain.
>
> What you seem to be asking about is current literature on minimizing the
> costs associated with branching.
>
> What I've understood is that the current state of the art is:
>
> - if branches are unconditional, then just calculate the branch target
> as far ahead of time as possible in the pipeline (i.e. using interlocks
> to be sure the base register doesn't change out from under you)...
<
Ok, I know this, but how do you calculate the address of the branch target
in the same cycle you fetch the group of instructions containing that branch ??
That is, you are 2-4 cycles in front of where you see the branch as an instruction
and its displacement. So, you have to somehow predict the branch target addr
based on nothing but the address you are currently fetching.
<
And the conditionality does not mater as much as you can bet-both-ways
in the case of conditional--set yourself up for when the conditional is not-
taken and set yourself up for when the conditional is taken and then
predict one course of events or the other.
>
> - in the case of conditional branches, speculatively doing the branch
> the "most likely" way (branch prediction) or allocating extra registers
> to use spare superscalar and SMT capacity to do both paths at once
> has been the technique... with worrying about Meltdown and Spectre
> and company the latest wrinkle.
<
I have already figured out how to do the register stuff that deals with
1-way, that-way, and the other way.
>
> And then there's predication as another useful technique.
>
> Has anything _new_ been discovered to help with code with branches?
> That's been described in the open literature?
>
> That's a good question, and I don't know the answer.
>
> John Savard

Re: State of the art non-linear FETCH

<9da80c6d-f098-4967-b0bb-0acc120e2462n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34472&group=comp.arch#34472

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:4c0f:b0:640:116e:d176 with SMTP id qh15-20020a0562144c0f00b00640116ed176mr87377qvb.3.1696560835475;
Thu, 05 Oct 2023 19:53:55 -0700 (PDT)
X-Received: by 2002:a05:6870:6289:b0:1dd:39ce:e25c with SMTP id
s9-20020a056870628900b001dd39cee25cmr2543621oan.3.1696560835276; Thu, 05 Oct
2023 19:53:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 5 Oct 2023 19:53:54 -0700 (PDT)
In-Reply-To: <65d8d145-5969-44d9-be59-d787c3aca76bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<c7575d2f-c047-4010-9531-95c8f3efc870n@googlegroups.com> <65d8d145-5969-44d9-be59-d787c3aca76bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9da80c6d-f098-4967-b0bb-0acc120e2462n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Fri, 06 Oct 2023 02:53:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5078

by: robf...@gmail.com - Fri, 6 Oct 2023 02:53 UTC

On Thursday, October 5, 2023 at 10:32:41 PM UTC-4, MitchAlsup wrote:
> On Thursday, October 5, 2023 at 9:01:22 PM UTC-5, Quadibloc wrote:
> > On Thursday, October 5, 2023 at 4:35:20 PM UTC-6, MitchAlsup wrote:
> >
> > > On the other hand, when branches ARE present the game changes
> > > completely. Where do you fetch the instructions at the target ?
> > > How did you get an address of the branch target so you could
> > > fetch there ?? How do you know how many instructions from the
> > > sequential path get issued before instructions on the non-sequential
> > > path(s) get issued ???
> > For starters, I doubt very much any of the literature would use a term
> > like "non-linear fetch" to refer to this. Because "non-linear" is usually
> > used when the basic problem is that one is dealing with x^2 or worse
> > instead of ax+b, and this is a completely different problem domain.
> >
> > What you seem to be asking about is current literature on minimizing the
> > costs associated with branching.
> >
> > What I've understood is that the current state of the art is:
> >
> > - if branches are unconditional, then just calculate the branch target
> > as far ahead of time as possible in the pipeline (i.e. using interlocks
> > to be sure the base register doesn't change out from under you)...
> <
> Ok, I know this, but how do you calculate the address of the branch target
> in the same cycle you fetch the group of instructions containing that branch ??
> That is, you are 2-4 cycles in front of where you see the branch as an instruction
> and its displacement. So, you have to somehow predict the branch target addr
> based on nothing but the address you are currently fetching.
> <
> And the conditionality does not mater as much as you can bet-both-ways
> in the case of conditional--set yourself up for when the conditional is not-
> taken and set yourself up for when the conditional is taken and then
> predict one course of events or the other.
> >
> > - in the case of conditional branches, speculatively doing the branch
> > the "most likely" way (branch prediction) or allocating extra registers
> > to use spare superscalar and SMT capacity to do both paths at once
> > has been the technique... with worrying about Meltdown and Spectre
> > and company the latest wrinkle.
> <
> I have already figured out how to do the register stuff that deals with
> 1-way, that-way, and the other way.
> >
> > And then there's predication as another useful technique.
> >
> > Has anything _new_ been discovered to help with code with branches?
> > That's been described in the open literature?
> >
> > That's a good question, and I don't know the answer.
> >
> > John Savard

How about using absolute addresses for branches? No need to calculate the
target then; just load the next PCs. The target addresses could be placed in a
address specifying instruction before the branch. If fetching is going to be
done along multiple streams of instructions at the same time it is not
needed to know the branch outcome until a final select is done.

The reference to non-linear instruction fetches made me think of several
weird forms of instruction fetch such as instructions placed on the surface
of a sphere, or in some space manifold, organized like a spreadsheet
perhaps. I guess fetching along two different paths would be a step function.

Anyways, I am reminded of GPU branch-and-merge. I am also reminded of an
approach to superscalar operation that worked on basic-blocks in parallel.

MitchAlsup <MitchAlsup@aol.com> schrieb:
> I am looking for modern papers on the process of fetching
> non-linear sets of instructions where the first instructions
> in an ISSUE group come from a sequential long fetch and
> other instructions arrive from some other mechanism {such
> as a branch target cache}.

Here's one which pops up on Googling:

https://webs.um.es/jlaragon/papers/aragon_ACSAC05.pdf , which is
not particularly new. US9367471B2 cites it. Then there is
http://aperais.fr/papers/elastic_instruction_fetching_hpca19.pdf or
https://arxiv.org/abs/2102.01764 , but I do not really know enough
about the subject to judge if they are any good or not.

Re: State of the art non-linear FETCH

<f6119a39-e19f-4723-893b-6e89181ca689n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34477&group=comp.arch#34477

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:14f:b0:412:12e8:8532 with SMTP id v15-20020a05622a014f00b0041212e88532mr111580qtw.9.1696612930655;
Fri, 06 Oct 2023 10:22:10 -0700 (PDT)
X-Received: by 2002:a05:6870:9897:b0:1dc:dbb0:60aa with SMTP id
eg23-20020a056870989700b001dcdbb060aamr3350664oab.6.1696612930385; Fri, 06
Oct 2023 10:22:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 6 Oct 2023 10:22:10 -0700 (PDT)
In-Reply-To: <ufpat1$15imo$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b956:77d:fde:52b4;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b956:77d:fde:52b4
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com> <ufpat1$15imo$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f6119a39-e19f-4723-893b-6e89181ca689n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 06 Oct 2023 17:22:10 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2088

by: MitchAlsup - Fri, 6 Oct 2023 17:22 UTC

On Friday, October 6, 2023 at 10:59:32 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > I am looking for modern papers on the process of fetching
> > non-linear sets of instructions where the first instructions
> > in an ISSUE group come from a sequential long fetch and
> > other instructions arrive from some other mechanism {such
> > as a branch target cache}.
> Here's one which pops up on Googling:
>
> https://webs.um.es/jlaragon/papers/aragon_ACSAC05.pdf , which is
> not particularly new. US9367471B2 cites it. Then there is
> http://aperais.fr/papers/elastic_instruction_fetching_hpca19.pdf or
> https://arxiv.org/abs/2102.01764 , but I do not really know enough
> about the subject to judge if they are any good or not.
<
Thank you for these references.

MitchAlsup wrote:
> I am looking for modern papers on the process of fetching
> non-linear sets of instructions where the first instructions
> in an ISSUE group come from a sequential long fetch and
> other instructions arrive from some other mechanism {such
> as a branch target cache}.

It sounds like you are thinking about prefetching a cluster
(I'll call it that) of temporally and probabilistically correlated
instruction cache lines at a set of virtual addresses,
either prescribed in advance (a prefetch set) or deduced from
recent execution history.

Correct?

MitchAlsup wrote:
> On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> I am looking for modern papers on the process of fetching
>>> non-linear sets of instructions where the first instructions
>>> in an ISSUE group come from a sequential long fetch and
>>> other instructions arrive from some other mechanism {such
>>> as a branch target cache}.
> <
>> It sounds like you are thinking about prefetching a cluster
>> (I'll call it that) of temporally and probabilistically correlated
>> instruction cache lines at a set of virtual addresses,
>> either prescribed in advance (a prefetch set) or deduced from
>> recent execution history.
>>
>> Correct?
> <
> You may know more about it than I right now !!!
> <
> I see code this way:: You arrive at a known point (say subroutine entry)
> and then you know that there are sequences of instructions (IP++),
> connected by a web of branches. So, it seems to me that one has to
> be able to fetch the sequential part through the most efficient instruction
> deliver means (the I cache) and then be in a position to profit from providing
> more instructions from a predictable 'alternate' path.
> <
> {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}

A linear sequence is a subset of it. I saw a cluster as potentially being
spatially distributed set of cache lines that are likely to be touched
within a short temporal window.

Like a simd gather operation but for instructions, such that no matter
what pathways the program follows, when Fetch comes to build your uOp
packet it does not stall and can likely fill all 8 uOp lane slots.
For example

if (cond)
Foo (a,b,c);
... more code
else
Bar (d,e,f);

A prefetch cluster could consist the Virtual Address Ranges (VAR's)
covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).

In order to fetch this code without stalling, it would have to prefetch
the virtual address translations for each possible address path VAR1..VAR4
into D$L1, possibly also setting the Accessed bit on PTE's,
load I-TLB entries, and prefetch cache lines of each range into I$L1.

On entry to either Foo or Bar a new prefetch cluster starts for that path.

> <
> These can be fetched with wide width (8-words/cycle), and are probabilistically
> prone to branches (every 10.86 words one "takes" a branch of some form.) An
> I cache and some buffering takes care of this part of Instruction fetching.
> <
> Remembering that we want to keep the fetch-insert pipeline short::
> <
> We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> cache. Here you have instructions associated distinct from their own address
> so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> and takes something else for accessing ALT cache.

I don't know what this ALT cache is. I was just thinking of how to
prefetch the normal I$L1 with primary and alternate pathways so that
Fetch doesn't stall on alternate branches and calls.

> <
> We are going to want the degenerate case of infinite sequence of inst to be
> completely supplied by I cache; so ALT cache provides some words (4 ?!?)
> at the target and adds some way to perform an ALT fetch next time we get here.
> Thus, the I cache should be on the order of 2×-4× the size of ALT cache--and for
> all intents and purposes can use the same TLB entry as I cache uses.
> <
> So between FETCH and DECODE is a layer that applies branch prediction
> to the stream from I buffer and ALT cache providing 6-instructions to
> decode. We know how the I cache + I buffer sequencer works, so we have
> to find a way of getting the instructions to the point of choosing efficiently,
> and that works across cycles continuously efficiently, too.

Re: State of the art non-linear FETCH

<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34483&group=comp.arch#34483

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5c4d:0:b0:40c:1483:459c with SMTP id j13-20020ac85c4d000000b0040c1483459cmr167234qtj.2.1696710694909;
Sat, 07 Oct 2023 13:31:34 -0700 (PDT)
X-Received: by 2002:a05:6808:1594:b0:3ac:abba:fa5b with SMTP id
t20-20020a056808159400b003acabbafa5bmr5884923oiw.1.1696710694713; Sat, 07 Oct
2023 13:31:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 7 Oct 2023 13:31:34 -0700 (PDT)
In-Reply-To: <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4cb0:9ccf:1ec8:96c7;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4cb0:9ccf:1ec8:96c7
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 07 Oct 2023 20:31:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8099

by: MitchAlsup - Sat, 7 Oct 2023 20:31 UTC

On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail.com wrote:
> On Saturday, October 7, 2023 at 11:50:13 AM UTC-4, EricP wrote:
> > MitchAlsup wrote:
> > > On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
> > >> MitchAlsup wrote:
> > >>> I am looking for modern papers on the process of fetching
> > >>> non-linear sets of instructions where the first instructions
> > >>> in an ISSUE group come from a sequential long fetch and
> > >>> other instructions arrive from some other mechanism {such
> > >>> as a branch target cache}.
> > > <
> > >> It sounds like you are thinking about prefetching a cluster
> > >> (I'll call it that) of temporally and probabilistically correlated
> > >> instruction cache lines at a set of virtual addresses,
> > >> either prescribed in advance (a prefetch set) or deduced from
> > >> recent execution history.
> > >>
> > >> Correct?
> > > <
> > > You may know more about it than I right now !!!
> > > <
> > > I see code this way:: You arrive at a known point (say subroutine entry)
> > > and then you know that there are sequences of instructions (IP++),
> > > connected by a web of branches. So, it seems to me that one has to
> > > be able to fetch the sequential part through the most efficient instruction
> > > deliver means (the I cache) and then be in a position to profit from providing
> > > more instructions from a predictable 'alternate' path.
> > > <
> > > {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
> > A linear sequence is a subset of it. I saw a cluster as potentially being
> > spatially distributed set of cache lines that are likely to be touched
> > within a short temporal window.
> >
> > Like a simd gather operation but for instructions, such that no matter
> > what pathways the program follows, when Fetch comes to build your uOp
> > packet it does not stall and can likely fill all 8 uOp lane slots.
> > For example
> >
> > if (cond)
> > Foo (a,b,c);
> > ... more code
> > else
> > Bar (d,e,f);
> >
> > A prefetch cluster could consist the Virtual Address Ranges (VAR's)
> > covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
> > plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).
> >
> > In order to fetch this code without stalling, it would have to prefetch
> > the virtual address translations for each possible address path VAR1..VAR4
> > into D$L1, possibly also setting the Accessed bit on PTE's,
> > load I-TLB entries, and prefetch cache lines of each range into I$L1.
> >
> > On entry to either Foo or Bar a new prefetch cluster starts for that path.
> > > <
> > > These can be fetched with wide width (8-words/cycle), and are probabilistically
> > > prone to branches (every 10.86 words one "takes" a branch of some form.) An
> > > I cache and some buffering takes care of this part of Instruction fetching.
> > > <
> > > Remembering that we want to keep the fetch-insert pipeline short::
> > > <
> > > We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> > > cache. Here you have instructions associated distinct from their own address
> > > so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> > > and takes something else for accessing ALT cache.
> > I don't know what this ALT cache is. I was just thinking of how to
> > prefetch the normal I$L1 with primary and alternate pathways so that
> > Fetch doesn't stall on alternate branches and calls.
<
> I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> for the alternate path. This is like fine-grained SMT except only a single thread is
> present.
<
ALT-cache is not necessarily indexed/addressed like the I cache. This, in deed, does
make it necessarily another cache.
>
> I think it would be very difficult to get multiple streams of execution going at the same
> time.
<
One of the things I am explicitly NOT trying to do is SMT. Solving Spectré and Meltdown
pretty much make SMT µArchitectures untenable (for safety) anyway. If code has access
to a clock able to count instructions, you can't share caches, MMU, TLB, I/O Buffers,
predictors, prefetchers,...
<
> The primary stream I$ is likely to use up most of the available fetch spots. What
> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> How many alternate branch paths are going to be supported?
<
I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
to service the excursions from the sequential flow. What I don't want is to add cycles
between the fetches and the delivery to instruction queues. My 1-wide machines has
2 cycles between instruction being flopped after SRAM sense amplifiers and being
issued into execution. I want this 6-wide machine to have only 1 more::
<
FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>
The output of Choose is 6 instructions,
The output of Issue is 6 6instructions with register and constant operands at their
......respective instruction queue.
<
> If the core supports SMT too then there may be more I$ cache ports available to
> support multiple branch paths.
<
SMT is so 2001.
<
> > > <
> > > We are going to want the degenerate case of infinite sequence of inst to be
> > > completely supplied by I cache; so ALT cache provides some words (4 ?!?)
> > > at the target and adds some way to perform an ALT fetch next time we get here.
> > > Thus, the I cache should be on the order of 2×-4× the size of ALT cache--and for
> > > all intents and purposes can use the same TLB entry as I cache uses.
> > > <
> > > So between FETCH and DECODE is a layer that applies branch prediction
> > > to the stream from I buffer and ALT cache providing 6-instructions to
> > > decode. We know how the I cache + I buffer sequencer works, so we have
> > > to find a way of getting the instructions to the point of choosing efficiently,
> > > and that works across cycles continuously efficiently, too.

robf...@gmail.com wrote:
> On Saturday, October 7, 2023 at 11:50:13 AM UTC-4, EricP wrote:
>> MitchAlsup wrote:
>>> On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
>>>> MitchAlsup wrote:
>>>>> I am looking for modern papers on the process of fetching
>>>>> non-linear sets of instructions where the first instructions
>>>>> in an ISSUE group come from a sequential long fetch and
>>>>> other instructions arrive from some other mechanism {such
>>>>> as a branch target cache}.
>>> <
>>>> It sounds like you are thinking about prefetching a cluster
>>>> (I'll call it that) of temporally and probabilistically correlated
>>>> instruction cache lines at a set of virtual addresses,
>>>> either prescribed in advance (a prefetch set) or deduced from
>>>> recent execution history.
>>>>
>>>> Correct?
>>> <
>>> You may know more about it than I right now !!!
>>> <
>>> I see code this way:: You arrive at a known point (say subroutine entry)
>>> and then you know that there are sequences of instructions (IP++),
>>> connected by a web of branches. So, it seems to me that one has to
>>> be able to fetch the sequential part through the most efficient instruction
>>> deliver means (the I cache) and then be in a position to profit from providing
>>> more instructions from a predictable 'alternate' path.
>>> <
>>> {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
>> A linear sequence is a subset of it. I saw a cluster as potentially being
>> spatially distributed set of cache lines that are likely to be touched
>> within a short temporal window.
>>
>> Like a simd gather operation but for instructions, such that no matter
>> what pathways the program follows, when Fetch comes to build your uOp
>> packet it does not stall and can likely fill all 8 uOp lane slots.
>> For example
>>
>> if (cond)
>> Foo (a,b,c);
>> ... more code
>> else
>> Bar (d,e,f);
>>
>> A prefetch cluster could consist the Virtual Address Ranges (VAR's)
>> covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
>> plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).
>>
>> In order to fetch this code without stalling, it would have to prefetch
>> the virtual address translations for each possible address path VAR1..VAR4
>> into D$L1, possibly also setting the Accessed bit on PTE's,
>> load I-TLB entries, and prefetch cache lines of each range into I$L1.
>>
>> On entry to either Foo or Bar a new prefetch cluster starts for that path..
>>> <
>>> These can be fetched with wide width (8-words/cycle), and are probabilistically
>>> prone to branches (every 10.86 words one "takes" a branch of some form.) An
>>> I cache and some buffering takes care of this part of Instruction fetching.
>>> <
>>> Remembering that we want to keep the fetch-insert pipeline short::
>>> <
>>> We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
>>> cache. Here you have instructions associated distinct from their own address
>>> so coherence is a bit tricky. The other strategy takes IP for accessing I cache
>>> and takes something else for accessing ALT cache.
>> I don't know what this ALT cache is. I was just thinking of how to
>> prefetch the normal I$L1 with primary and alternate pathways so that
>> Fetch doesn't stall on alternate branches and calls.
>
> I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> for the alternate path. This is like fine-grained SMT except only a single thread is
> present.
>
> I think it would be very difficult to get multiple streams of execution going at the same
> time. The primary stream I$ is likely to use up most of the available fetch spots. What
> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> How many alternate branch paths are going to be supported?
>
> If the core supports SMT too then there may be more I$ cache ports available to
> support multiple branch paths.

Maybe its an exercise in keeping cache miss buffers busy.
Say I$ and D$ L1 each have 8 miss buffers. 8 I-TLB page table walkers could
keep D$ busy so it could be the equivalent to having 16 HW threads running.

So a question is how to generate useful work for those HW threads.
Prefetching clusters might be an answer.

Re: State of the art non-linear FETCH

<f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34485&group=comp.arch#34485

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5314:0:b0:41a:3e99:d233 with SMTP id t20-20020ac85314000000b0041a3e99d233mr134610qtn.0.1696717130659;
Sat, 07 Oct 2023 15:18:50 -0700 (PDT)
X-Received: by 2002:a05:6830:1357:b0:6bc:f328:696a with SMTP id
r23-20020a056830135700b006bcf328696amr3808689otq.0.1696717130474; Sat, 07 Oct
2023 15:18:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!news2.arglkargh.de!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 7 Oct 2023 15:18:50 -0700 (PDT)
In-Reply-To: <704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:1d00:7574:575c:7b80:51dc;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:1d00:7574:575c:7b80:51dc
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 07 Oct 2023 22:18:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: robf...@gmail.com - Sat, 7 Oct 2023 22:18 UTC

On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail.com wrote:
> > On Saturday, October 7, 2023 at 11:50:13 AM UTC-4, EricP wrote:
> > > MitchAlsup wrote:
> > > > On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
> > > >> MitchAlsup wrote:
> > > >>> I am looking for modern papers on the process of fetching
> > > >>> non-linear sets of instructions where the first instructions
> > > >>> in an ISSUE group come from a sequential long fetch and
> > > >>> other instructions arrive from some other mechanism {such
> > > >>> as a branch target cache}.
> > > > <
> > > >> It sounds like you are thinking about prefetching a cluster
> > > >> (I'll call it that) of temporally and probabilistically correlated
> > > >> instruction cache lines at a set of virtual addresses,
> > > >> either prescribed in advance (a prefetch set) or deduced from
> > > >> recent execution history.
> > > >>
> > > >> Correct?
> > > > <
> > > > You may know more about it than I right now !!!
> > > > <
> > > > I see code this way:: You arrive at a known point (say subroutine entry)
> > > > and then you know that there are sequences of instructions (IP++),
> > > > connected by a web of branches. So, it seems to me that one has to
> > > > be able to fetch the sequential part through the most efficient instruction
> > > > deliver means (the I cache) and then be in a position to profit from providing
> > > > more instructions from a predictable 'alternate' path.
> > > > <
> > > > {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
> > > A linear sequence is a subset of it. I saw a cluster as potentially being
> > > spatially distributed set of cache lines that are likely to be touched
> > > within a short temporal window.
> > >
> > > Like a simd gather operation but for instructions, such that no matter
> > > what pathways the program follows, when Fetch comes to build your uOp
> > > packet it does not stall and can likely fill all 8 uOp lane slots.
> > > For example
> > >
> > > if (cond)
> > > Foo (a,b,c);
> > > ... more code
> > > else
> > > Bar (d,e,f);
> > >
> > > A prefetch cluster could consist the Virtual Address Ranges (VAR's)
> > > covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
> > > plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).
> > >
> > > In order to fetch this code without stalling, it would have to prefetch
> > > the virtual address translations for each possible address path VAR1...VAR4
> > > into D$L1, possibly also setting the Accessed bit on PTE's,
> > > load I-TLB entries, and prefetch cache lines of each range into I$L1.
> > >
> > > On entry to either Foo or Bar a new prefetch cluster starts for that path.
> > > > <
> > > > These can be fetched with wide width (8-words/cycle), and are probabilistically
> > > > prone to branches (every 10.86 words one "takes" a branch of some form.) An
> > > > I cache and some buffering takes care of this part of Instruction fetching.
> > > > <
> > > > Remembering that we want to keep the fetch-insert pipeline short::
> > > > <
> > > > We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> > > > cache. Here you have instructions associated distinct from their own address
> > > > so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> > > > and takes something else for accessing ALT cache.
> > > I don't know what this ALT cache is. I was just thinking of how to
> > > prefetch the normal I$L1 with primary and alternate pathways so that
> > > Fetch doesn't stall on alternate branches and calls.
> <
> > I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> > two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> > for the alternate path. This is like fine-grained SMT except only a single thread is
> > present.
> <
> ALT-cache is not necessarily indexed/addressed like the I cache. This, in deed, does
> make it necessarily another cache.
> >
> > I think it would be very difficult to get multiple streams of execution going at the same
> > time.
> <
> One of the things I am explicitly NOT trying to do is SMT. Solving Spectré and Meltdown
> pretty much make SMT µArchitectures untenable (for safety) anyway. If code has access
> to a clock able to count instructions, you can't share caches, MMU, TLB, I/O Buffers,
> predictors, prefetchers,...
> <
> > The primary stream I$ is likely to use up most of the available fetch spots. What
> > if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > How many alternate branch paths are going to be supported?
> <
> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> to service the excursions from the sequential flow. What I don't want is to add cycles
> between the fetches and the delivery to instruction queues. My 1-wide machines has
> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> issued into execution. I want this 6-wide machine to have only 1 more::
> <
> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> >
> The output of Choose is 6 instructions,
> The output of Issue is 6 6instructions with register and constant operands at their
> .....respective instruction queue.

At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
of choose? I would think the choice would be made after EXECUTE once the branch
outcome is determined.
> <
> > If the core supports SMT too then there may be more I$ cache ports available to
> > support multiple branch paths.
> <
> SMT is so 2001.

I take that as encouragement; I am less than 20 years behind the times.

> <
> > > > <
> > > > We are going to want the degenerate case of infinite sequence of inst to be
> > > > completely supplied by I cache; so ALT cache provides some words (4 ?!?)
> > > > at the target and adds some way to perform an ALT fetch next time we get here.
> > > > Thus, the I cache should be on the order of 2×-4× the size of ALT cache--and for
> > > > all intents and purposes can use the same TLB entry as I cache uses..
> > > > <
> > > > So between FETCH and DECODE is a layer that applies branch prediction
> > > > to the stream from I buffer and ALT cache providing 6-instructions to
> > > > decode. We know how the I cache + I buffer sequencer works, so we have
> > > > to find a way of getting the instructions to the point of choosing efficiently,
> > > > and that works across cycles continuously efficiently, too.

Re: State of the art non-linear FETCH

<90c89ae1-a468-4831-ad61-b350a441ec0en@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34486&group=comp.arch#34486

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1986:b0:40d:b839:b5bb with SMTP id u6-20020a05622a198600b0040db839b5bbmr190405qtc.2.1696717303552;
Sat, 07 Oct 2023 15:21:43 -0700 (PDT)
X-Received: by 2002:a05:6870:4c03:b0:1e5:66ce:969e with SMTP id
pk3-20020a0568704c0300b001e566ce969emr2769396oab.9.1696717303320; Sat, 07 Oct
2023 15:21:43 -0700 (PDT)
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 7 Oct 2023 15:21:43 -0700 (PDT)
In-Reply-To: <wYjUM.1050$3l%9.139@fx02.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:1d00:7574:575c:7b80:51dc;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:1d00:7574:575c:7b80:51dc
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<wYjUM.1050$3l%9.139@fx02.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90c89ae1-a468-4831-ad61-b350a441ec0en@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 07 Oct 2023 22:21:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6619

by: robf...@gmail.com - Sat, 7 Oct 2023 22:21 UTC

On Saturday, October 7, 2023 at 5:20:33 PM UTC-4, EricP wrote:
> robf...@gmail.com wrote:
> > On Saturday, October 7, 2023 at 11:50:13 AM UTC-4, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
> >>>> MitchAlsup wrote:
> >>>>> I am looking for modern papers on the process of fetching
> >>>>> non-linear sets of instructions where the first instructions
> >>>>> in an ISSUE group come from a sequential long fetch and
> >>>>> other instructions arrive from some other mechanism {such
> >>>>> as a branch target cache}.
> >>> <
> >>>> It sounds like you are thinking about prefetching a cluster
> >>>> (I'll call it that) of temporally and probabilistically correlated
> >>>> instruction cache lines at a set of virtual addresses,
> >>>> either prescribed in advance (a prefetch set) or deduced from
> >>>> recent execution history.
> >>>>
> >>>> Correct?
> >>> <
> >>> You may know more about it than I right now !!!
> >>> <
> >>> I see code this way:: You arrive at a known point (say subroutine entry)
> >>> and then you know that there are sequences of instructions (IP++),
> >>> connected by a web of branches. So, it seems to me that one has to
> >>> be able to fetch the sequential part through the most efficient instruction
> >>> deliver means (the I cache) and then be in a position to profit from providing
> >>> more instructions from a predictable 'alternate' path.
> >>> <
> >>> {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
> >> A linear sequence is a subset of it. I saw a cluster as potentially being
> >> spatially distributed set of cache lines that are likely to be touched
> >> within a short temporal window.
> >>
> >> Like a simd gather operation but for instructions, such that no matter
> >> what pathways the program follows, when Fetch comes to build your uOp
> >> packet it does not stall and can likely fill all 8 uOp lane slots.
> >> For example
> >>
> >> if (cond)
> >> Foo (a,b,c);
> >> ... more code
> >> else
> >> Bar (d,e,f);
> >>
> >> A prefetch cluster could consist the Virtual Address Ranges (VAR's)
> >> covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
> >> plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).
> >>
> >> In order to fetch this code without stalling, it would have to prefetch
> >> the virtual address translations for each possible address path VAR1..VAR4
> >> into D$L1, possibly also setting the Accessed bit on PTE's,
> >> load I-TLB entries, and prefetch cache lines of each range into I$L1.
> >>
> >> On entry to either Foo or Bar a new prefetch cluster starts for that path..
> >>> <
> >>> These can be fetched with wide width (8-words/cycle), and are probabilistically
> >>> prone to branches (every 10.86 words one "takes" a branch of some form.) An
> >>> I cache and some buffering takes care of this part of Instruction fetching.
> >>> <
> >>> Remembering that we want to keep the fetch-insert pipeline short::
> >>> <
> >>> We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> >>> cache. Here you have instructions associated distinct from their own address
> >>> so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> >>> and takes something else for accessing ALT cache.
> >> I don't know what this ALT cache is. I was just thinking of how to
> >> prefetch the normal I$L1 with primary and alternate pathways so that
> >> Fetch doesn't stall on alternate branches and calls.
> >
> > I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> > two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> > for the alternate path. This is like fine-grained SMT except only a single thread is
> > present.
> >
> > I think it would be very difficult to get multiple streams of execution going at the same
> > time. The primary stream I$ is likely to use up most of the available fetch spots. What
> > if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > How many alternate branch paths are going to be supported?
> >
> > If the core supports SMT too then there may be more I$ cache ports available to
> > support multiple branch paths.
> Maybe its an exercise in keeping cache miss buffers busy.
> Say I$ and D$ L1 each have 8 miss buffers. 8 I-TLB page table walkers could
> keep D$ busy so it could be the equivalent to having 16 HW threads running.

Will not the instructions already be in the I$ most of the time? It is just that it is not
referenced by ALT-PC.

>
> So a question is how to generate useful work for those HW threads.
> Prefetching clusters might be an answer.

Re: State of the art non-linear FETCH

<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34487&group=comp.arch#34487

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:e113:0:b0:768:421b:a142 with SMTP id c19-20020a37e113000000b00768421ba142mr121244qkm.4.1696718610706;
Sat, 07 Oct 2023 15:43:30 -0700 (PDT)
X-Received: by 2002:a05:6808:201c:b0:3a7:7e66:2197 with SMTP id
q28-20020a056808201c00b003a77e662197mr5976680oiw.2.1696718610384; Sat, 07 Oct
2023 15:43:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 7 Oct 2023 15:43:30 -0700 (PDT)
In-Reply-To: <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4cb0:9ccf:1ec8:96c7;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4cb0:9ccf:1ec8:96c7
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 07 Oct 2023 22:43:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 9805

by: MitchAlsup - Sat, 7 Oct 2023 22:43 UTC

On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> > On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
> > > On Saturday, October 7, 2023 at 11:50:13 AM UTC-4, EricP wrote:
> > > > MitchAlsup wrote:
> > > > > On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
> > > > >> MitchAlsup wrote:
> > > > >>> I am looking for modern papers on the process of fetching
> > > > >>> non-linear sets of instructions where the first instructions
> > > > >>> in an ISSUE group come from a sequential long fetch and
> > > > >>> other instructions arrive from some other mechanism {such
> > > > >>> as a branch target cache}.
> > > > > <
> > > > >> It sounds like you are thinking about prefetching a cluster
> > > > >> (I'll call it that) of temporally and probabilistically correlated
> > > > >> instruction cache lines at a set of virtual addresses,
> > > > >> either prescribed in advance (a prefetch set) or deduced from
> > > > >> recent execution history.
> > > > >>
> > > > >> Correct?
> > > > > <
> > > > > You may know more about it than I right now !!!
> > > > > <
> > > > > I see code this way:: You arrive at a known point (say subroutine entry)
> > > > > and then you know that there are sequences of instructions (IP++),
> > > > > connected by a web of branches. So, it seems to me that one has to
> > > > > be able to fetch the sequential part through the most efficient instruction
> > > > > deliver means (the I cache) and then be in a position to profit from providing
> > > > > more instructions from a predictable 'alternate' path.
> > > > > <
> > > > > {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
> > > > A linear sequence is a subset of it. I saw a cluster as potentially being
> > > > spatially distributed set of cache lines that are likely to be touched
> > > > within a short temporal window.
> > > >
> > > > Like a simd gather operation but for instructions, such that no matter
> > > > what pathways the program follows, when Fetch comes to build your uOp
> > > > packet it does not stall and can likely fill all 8 uOp lane slots.
> > > > For example
> > > >
> > > > if (cond)
> > > > Foo (a,b,c);
> > > > ... more code
> > > > else
> > > > Bar (d,e,f);
> > > >
> > > > A prefetch cluster could consist the Virtual Address Ranges (VAR's)
> > > > covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
> > > > plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).
> > > >
> > > > In order to fetch this code without stalling, it would have to prefetch
> > > > the virtual address translations for each possible address path VAR1..VAR4
> > > > into D$L1, possibly also setting the Accessed bit on PTE's,
> > > > load I-TLB entries, and prefetch cache lines of each range into I$L1.
> > > >
> > > > On entry to either Foo or Bar a new prefetch cluster starts for that path.
> > > > > <
> > > > > These can be fetched with wide width (8-words/cycle), and are probabilistically
> > > > > prone to branches (every 10.86 words one "takes" a branch of some form.) An
> > > > > I cache and some buffering takes care of this part of Instruction fetching.
> > > > > <
> > > > > Remembering that we want to keep the fetch-insert pipeline short::
> > > > > <
> > > > > We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> > > > > cache. Here you have instructions associated distinct from their own address
> > > > > so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> > > > > and takes something else for accessing ALT cache.
> > > > I don't know what this ALT cache is. I was just thinking of how to
> > > > prefetch the normal I$L1 with primary and alternate pathways so that
> > > > Fetch doesn't stall on alternate branches and calls.
> > <
> > > I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> > > two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> > > for the alternate path. This is like fine-grained SMT except only a single thread is
> > > present.
> > <
> > ALT-cache is not necessarily indexed/addressed like the I cache. This, in deed, does
> > make it necessarily another cache.
> > >
> > > I think it would be very difficult to get multiple streams of execution going at the same
> > > time.
> > <
> > One of the things I am explicitly NOT trying to do is SMT. Solving Spectré and Meltdown
> > pretty much make SMT µArchitectures untenable (for safety) anyway. If code has access
> > to a clock able to count instructions, you can't share caches, MMU, TLB, I/O Buffers,
> > predictors, prefetchers,...
> > <
> > > The primary stream I$ is likely to use up most of the available fetch spots. What
> > > if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > > How many alternate branch paths are going to be supported?
> > <
> > I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> > to service the excursions from the sequential flow. What I don't want is to add cycles
> > between the fetches and the delivery to instruction queues. My 1-wide machines has
> > 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> > issued into execution. I want this 6-wide machine to have only 1 more::
> > <
> > FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> > >
> > The output of Choose is 6 instructions,
> > The output of Issue is 6 6instructions with register and constant operands at their
> > .....respective instruction queue.
> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> of choose? I would think the choice would be made after EXECUTE once the branch
> outcome is determined.
<
In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
<
Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
Once you get to 8-wide you might have to start doing 2 predictions per cycle....
<
> > <
> > > If the core supports SMT too then there may be more I$ cache ports available to
> > > support multiple branch paths.
> > <
> > SMT is so 2001.
> I take that as encouragement; I am less than 20 years behind the times.
> > <
> > > > > <
> > > > > We are going to want the degenerate case of infinite sequence of inst to be
> > > > > completely supplied by I cache; so ALT cache provides some words (4 ?!?)
> > > > > at the target and adds some way to perform an ALT fetch next time we get here.
> > > > > Thus, the I cache should be on the order of 2×-4× the size of ALT cache--and for
> > > > > all intents and purposes can use the same TLB entry as I cache uses.
> > > > > <
> > > > > So between FETCH and DECODE is a layer that applies branch prediction
> > > > > to the stream from I buffer and ALT cache providing 6-instructions to
> > > > > decode. We know how the I cache + I buffer sequencer works, so we have
> > > > > to find a way of getting the instructions to the point of choosing efficiently,
> > > > > and that works across cycles continuously efficiently, too.

MitchAlsup wrote:
> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
>>> <
>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
>>>> How many alternate branch paths are going to be supported?
>>> <
>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
>>> to service the excursions from the sequential flow. What I don't want is to add cycles
>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
>>> issued into execution. I want this 6-wide machine to have only 1 more::
>>> <
>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>>> The output of Choose is 6 instructions,
>>> The output of Issue is 6 6instructions with register and constant operands at their
>>> .....respective instruction queue.
>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
>> of choose? I would think the choice would be made after EXECUTE once the branch
>> outcome is determined.
> <
> In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> <
> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> Once you get to 8-wide you might have to start doing 2 predictions per cycle....

A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
fetch block per clock to sustain fetching with no bubbles.

We also have to deal with the I$L1 read access latency.
Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
To not stall fetch must request a I$L1 block 3 cycles before it needs it.
If a branch wants to change from sequential to alternate path and not stall
we need both blocks in the fetch buffers at the same instant.
If we wait until we parse the branch instruction to begin the I$L1 read
then it is too late.

We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
fetching has to propagate backwards 4 clocks such that we can switch from
sequential to alternate if we want to without missing a clock or even a lane.

If the I$L1 has dual even-odd banks then on average it can supply
two 64-byte block pairs per clock. That should be more than enough to
continuously feed the fetch unit.

A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
alternate fetch block physical address to the block it fetched 4 clocks ago,
and store this in the I$L1 cache line, for each of the low and high blocks.
Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
might change. The prefetched buffer address has to be verified as correct
when the branch instruction is seen and its target virtual address is
calculated and translated.

Re: State of the art non-linear FETCH

<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34493&group=comp.arch#34493

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:1924:b0:66d:3d6:2cfd with SMTP id es4-20020a056214192400b0066d03d62cfdmr3474qvb.9.1696963593961;
Tue, 10 Oct 2023 11:46:33 -0700 (PDT)
X-Received: by 2002:a05:6871:6a99:b0:1e1:3152:93fc with SMTP id
zf25-20020a0568716a9900b001e1315293fcmr7269445oab.6.1696963593711; Tue, 10
Oct 2023 11:46:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!45.76.7.193.MISMATCH!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 Oct 2023 11:46:33 -0700 (PDT)
In-Reply-To: <BMfVM.35908$rbid.35777@fx18.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b0f9:21ff:c801:ea3a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b0f9:21ff:c801:ea3a
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 10 Oct 2023 18:46:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 123

by: MitchAlsup - Tue, 10 Oct 2023 18:46 UTC

On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail..com wrote:
> >> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> >>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
> >>> <
> >>>> The primary stream I$ is likely to use up most of the available fetch spots. What
> >>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> >>>> How many alternate branch paths are going to be supported?
> >>> <
> >>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> >>> to service the excursions from the sequential flow. What I don't want is to add cycles
> >>> between the fetches and the delivery to instruction queues. My 1-wide machines has
> >>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> >>> issued into execution. I want this 6-wide machine to have only 1 more::
> >>> <
> >>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> >>> The output of Choose is 6 instructions,
> >>> The output of Issue is 6 6instructions with register and constant operands at their
> >>> .....respective instruction queue.
> >> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> >> of choose? I would think the choice would be made after EXECUTE once the branch
> >> outcome is determined.
> > <
> > In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> > prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> > order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> > it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> > <
> > Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> > Once you get to 8-wide you might have to start doing 2 predictions per cycle....
<
> A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
> Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
> fetch block per clock to sustain fetching with no bubbles.
<
Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
power-of-2. Now we are in a position to fetch 8 words as::
a) 0-1
b) 1-2
c) 2-3
d) 3-0 of the next IP
So, for an additional carry chain, you suffer no gross alignedness problems..
>
> We also have to deal with the I$L1 read access latency.
> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
<
Then you made it too big !! And why should not each SRAM in a cache line
not be accessible independently (that is throughput = 4 or some higher power
of 2).
<
> To not stall fetch must request a I$L1 block 3 cycles before it needs it.
> If a branch wants to change from sequential to alternate path and not stall
> we need both blocks in the fetch buffers at the same instant.
<
Obviously
<
> If we wait until we parse the branch instruction to begin the I$L1 read
> then it is too late.
<
Obviously--which is why we scan ahead or use some predictor to identify
the branches before we get to them.
>
> We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
> fetching has to propagate backwards 4 clocks such that we can switch from
> sequential to alternate if we want to without missing a clock or even a lane.
<
The other through train I have been working on is that we have a multiplicity
of successor addresses that are present when the sequential and Alternate
instructions are available; and a predictor predicts which one to use as
sequential and which one to use as ALT.
>
> If the I$L1 has dual even-odd banks then on average it can supply
> two 64-byte block pairs per clock. That should be more than enough to
> continuously feed the fetch unit.
<
L1 will have 4 blocks per tag each block contains 16-bytes.
>
> A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
> alternate fetch block physical address to the block it fetched 4 clocks ago,
> and store this in the I$L1 cache line, for each of the low and high blocks.
<
The trick is getting these things started. after started, they tend to work rather well.
<
> Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
> might change. The prefetched buffer address has to be verified as correct
> when the branch instruction is seen and its target virtual address is
> calculated and translated.

Re: State of the art non-linear FETCH

<f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34494&group=comp.arch#34494

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:25d:b0:775:74c4:36de with SMTP id q29-20020a05620a025d00b0077574c436demr324317qkn.3.1696964825701;
Tue, 10 Oct 2023 12:07:05 -0700 (PDT)
X-Received: by 2002:a05:6808:1809:b0:3ac:a02d:708f with SMTP id
bh9-20020a056808180900b003aca02d708fmr10015344oib.1.1696964825502; Tue, 10
Oct 2023 12:07:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 Oct 2023 12:07:05 -0700 (PDT)
In-Reply-To: <b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b0f9:21ff:c801:ea3a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b0f9:21ff:c801:ea3a
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 10 Oct 2023 19:07:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8604

by: MitchAlsup - Tue, 10 Oct 2023 19:07 UTC

On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
> On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
> > MitchAlsup wrote:
> > > On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
> > >> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> > >>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
> > >>> <
> > >>>> The primary stream I$ is likely to use up most of the available fetch spots. What
> > >>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > >>>> How many alternate branch paths are going to be supported?
> > >>> <
> > >>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> > >>> to service the excursions from the sequential flow. What I don't want is to add cycles
> > >>> between the fetches and the delivery to instruction queues. My 1-wide machines has
> > >>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> > >>> issued into execution. I want this 6-wide machine to have only 1 more::
> > >>> <
> > >>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> > >>> The output of Choose is 6 instructions,
> > >>> The output of Issue is 6 6instructions with register and constant operands at their
> > >>> .....respective instruction queue.
> > >> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> > >> of choose? I would think the choice would be made after EXECUTE once the branch
> > >> outcome is determined.
> > > <
> > > In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> > > prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> > > order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> > > it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> > > <
> > > Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> > > Once you get to 8-wide you might have to start doing 2 predictions per cycle....
> <
> > A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
> > Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
> > fetch block per clock to sustain fetching with no bubbles.
> <
> Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
> power-of-2. Now we are in a position to fetch 8 words as::
> a) 0-1
> b) 1-2
> c) 2-3
> d) 3-0 of the next IP
> So, for an additional carry chain, you suffer no gross alignedness problems.
> >
> > We also have to deal with the I$L1 read access latency.
> > Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
> <
> Then you made it too big !! And why should not each SRAM in a cache line
> not be accessible independently (that is throughput = 4 or some higher power
> of 2).
<
I need to elaborate here::
<
A D$ may have a 3-cycle latency and 1 throughput per bank but this has
logic in it that the I$ does not need--that is the byte alignment logic
that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
the same metrics.
<
The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
SRAMs only need the lower order ~16(-3) bits of address which comes out
of the adder at least 4 gate delays before the final HOB resolves. Thus, the
addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
clock. Tag comparison (hit) output multiplexing, byte alignment, and result
bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
which data goes into the Inst Buffer, while the IB is supplying instructions
into PARSE/DECODE. While in IB, instructions are scanned for branches and
their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
are getting close to the point where in spaghetti codes all instructions might
come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
to supply all instructions and missed instructions flowed through the C$. Here
we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
problems and would only deliver LDs that had resolved all dependencies.
<
BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
of associativity did not hurt all that much. This cannot be replicated now
mainly due to wire delay and balancing the several hundred cycles to DRAM
with L2 and L3 caching.
<
> <
> > To not stall fetch must request a I$L1 block 3 cycles before it needs it.
> > If a branch wants to change from sequential to alternate path and not stall
> > we need both blocks in the fetch buffers at the same instant.
> <
> Obviously
> <
> > If we wait until we parse the branch instruction to begin the I$L1 read
> > then it is too late.
> <
> Obviously--which is why we scan ahead or use some predictor to identify
> the branches before we get to them.
> >
> > We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
> > fetching has to propagate backwards 4 clocks such that we can switch from
> > sequential to alternate if we want to without missing a clock or even a lane.
> <
> The other through train I have been working on is that we have a multiplicity
> of successor addresses that are present when the sequential and Alternate
> instructions are available; and a predictor predicts which one to use as
> sequential and which one to use as ALT.
> >
> > If the I$L1 has dual even-odd banks then on average it can supply
> > two 64-byte block pairs per clock. That should be more than enough to
> > continuously feed the fetch unit.
> <
> L1 will have 4 blocks per tag each block contains 16-bytes.
> >
> > A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
> > alternate fetch block physical address to the block it fetched 4 clocks ago,
> > and store this in the I$L1 cache line, for each of the low and high blocks.
> <
> The trick is getting these things started. after started, they tend to work rather well.
> <
> > Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
> > might change. The prefetched buffer address has to be verified as correct
> > when the branch instruction is seen and its target virtual address is
> > calculated and translated.

Re: State of the art non-linear FETCH

<51602944-b6c0-43c6-b2ef-3c0960403eb7n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34495&group=comp.arch#34495

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1490:b0:418:1ae1:bc37 with SMTP id t16-20020a05622a149000b004181ae1bc37mr291436qtx.3.1696999559138; Tue, 10 Oct 2023 21:45:59 -0700 (PDT)
X-Received: by 2002:a05:6808:2001:b0:3ac:b428:844d with SMTP id q1-20020a056808200100b003acb428844dmr10351053oiw.8.1696999558893; Tue, 10 Oct 2023 21:45:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!69.80.99.18.MISMATCH!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 Oct 2023 21:45:58 -0700 (PDT)
In-Reply-To: <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:1d00:fe:76a8:1046:c9de; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:1d00:fe:76a8:1046:c9de
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com> <4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com> <Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com> <704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com> <5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad> <b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <51602944-b6c0-43c6-b2ef-3c0960403eb7n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 11 Oct 2023 04:45:59 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 201

by: robf...@gmail.com - Wed, 11 Oct 2023 04:45 UTC

On Tuesday, October 10, 2023 at 3:07:07 PM UTC-4, MitchAlsup wrote:
> On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
> > On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
> > > MitchAlsup wrote:
> > > > On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
> > > >> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> > > >>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
> > > >>> <
> > > >>>> The primary stream I$ is likely to use up most of the available fetch spots. What
> > > >>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > > >>>> How many alternate branch paths are going to be supported?
> > > >>> <
> > > >>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> > > >>> to service the excursions from the sequential flow. What I don't want is to add cycles
> > > >>> between the fetches and the delivery to instruction queues. My 1-wide machines has
> > > >>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> > > >>> issued into execution. I want this 6-wide machine to have only 1 more::
> > > >>> <
> > > >>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> > > >>> The output of Choose is 6 instructions,
> > > >>> The output of Issue is 6 6instructions with register and constant operands at their
> > > >>> .....respective instruction queue.
> > > >> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> > > >> of choose? I would think the choice would be made after EXECUTE once the branch
> > > >> outcome is determined.
> > > > <
> > > > In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> > > > prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> > > > order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> > > > it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> > > > <
> > > > Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> > > > Once you get to 8-wide you might have to start doing 2 predictions per cycle....
> > <
> > > A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
> > > Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
> > > fetch block per clock to sustain fetching with no bubbles.
> > <
> > Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
> > power-of-2. Now we are in a position to fetch 8 words as::
> > a) 0-1
> > b) 1-2
> > c) 2-3
> > d) 3-0 of the next IP
> > So, for an additional carry chain, you suffer no gross alignedness problems.
> > >
> > > We also have to deal with the I$L1 read access latency.
> > > Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
> > <
> > Then you made it too big !! And why should not each SRAM in a cache line
> > not be accessible independently (that is throughput = 4 or some higher power
> > of 2).
> <
> I need to elaborate here::
> <
> A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> logic in it that the I$ does not need--that is the byte alignment logic
> that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> the same metrics.
> <
> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> SRAMs only need the lower order ~16(-3) bits of address which comes out
> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> which data goes into the Inst Buffer, while the IB is supplying instructions
> into PARSE/DECODE. While in IB, instructions are scanned for branches and
> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> are getting close to the point where in spaghetti codes all instructions might
> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> to supply all instructions and missed instructions flowed through the C$. Here
> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> problems and would only deliver LDs that had resolved all dependencies.
> <
> BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
> of associativity did not hurt all that much. This cannot be replicated now
> mainly due to wire delay and balancing the several hundred cycles to DRAM
> with L2 and L3 caching.
> <
> > <
> > > To not stall fetch must request a I$L1 block 3 cycles before it needs it.
> > > If a branch wants to change from sequential to alternate path and not stall
> > > we need both blocks in the fetch buffers at the same instant.
> > <
> > Obviously
> > <
> > > If we wait until we parse the branch instruction to begin the I$L1 read
> > > then it is too late.
> > <
> > Obviously--which is why we scan ahead or use some predictor to identify
> > the branches before we get to them.
> > >
> > > We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
> > > fetching has to propagate backwards 4 clocks such that we can switch from
> > > sequential to alternate if we want to without missing a clock or even a lane.
> > <
> > The other through train I have been working on is that we have a multiplicity
> > of successor addresses that are present when the sequential and Alternate
> > instructions are available; and a predictor predicts which one to use as
> > sequential and which one to use as ALT.
> > >
> > > If the I$L1 has dual even-odd banks then on average it can supply
> > > two 64-byte block pairs per clock. That should be more than enough to
> > > continuously feed the fetch unit.
> > <
> > L1 will have 4 blocks per tag each block contains 16-bytes.
> > >
> > > A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
> > > alternate fetch block physical address to the block it fetched 4 clocks ago,
> > > and store this in the I$L1 cache line, for each of the low and high blocks.
> > <
> > The trick is getting these things started. after started, they tend to work rather well.
> > <
> > > Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
> > > might change. The prefetched buffer address has to be verified as correct
> > > when the branch instruction is seen and its target virtual address is
> > > calculated and translated.

Can some parsing be done at the fetch stage, if the encoding is really simple? I have
tried this before in a simple pipelined CPU and it seemed to work, but it was with
single cycle cache latency. I am not sure what impact would be on timing. Maybe
too many gates? It could be used to chop one cycle off the latency.

Rather than use a hardware fetch predictor would it be possible to use a static
compiler generated prediction? Seems to me that a “JMP Alt-target(s)” instruction
could be in the instruction stream sometime before the branch instructions it is
associated with. If it could be placed enough instructions before the branch it may
be possible to hide the latency. JMP-Alt could begin fetch from multiple targets,
then have a following branch select which Alt path is the correct one.

Re: State of the art non-linear FETCH

<73c687ea-b3cf-4c82-9951-48a61b5b159cn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34498&group=comp.arch#34498

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7f15:0:b0:403:a91d:bfec with SMTP id f21-20020ac87f15000000b00403a91dbfecmr348851qtk.0.1697057168131;
Wed, 11 Oct 2023 13:46:08 -0700 (PDT)
X-Received: by 2002:a9d:7ad6:0:b0:6b9:5156:a493 with SMTP id
m22-20020a9d7ad6000000b006b95156a493mr6850292otn.4.1697057167906; Wed, 11 Oct
2023 13:46:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Oct 2023 13:46:07 -0700 (PDT)
In-Reply-To: <51602944-b6c0-43c6-b2ef-3c0960403eb7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e5f6:ea9:bc42:79cb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e5f6:ea9:bc42:79cb
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<51602944-b6c0-43c6-b2ef-3c0960403eb7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <73c687ea-b3cf-4c82-9951-48a61b5b159cn@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 11 Oct 2023 20:46:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 222

by: MitchAlsup - Wed, 11 Oct 2023 20:46 UTC

On Tuesday, October 10, 2023 at 11:46:00 PM UTC-5, robf...@gmail.com wrote:
> On Tuesday, October 10, 2023 at 3:07:07 PM UTC-4, MitchAlsup wrote:
> > On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
> > > On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
> > > > MitchAlsup wrote:
> > > > > On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
> > > > >> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
> > > > >>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf....@gmail..com wrote:
> > > > >>> <
> > > > >>>> The primary stream I$ is likely to use up most of the available fetch spots. What
> > > > >>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > > > >>>> How many alternate branch paths are going to be supported?
> > > > >>> <
> > > > >>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> > > > >>> to service the excursions from the sequential flow. What I don't want is to add cycles
> > > > >>> between the fetches and the delivery to instruction queues. My 1-wide machines has
> > > > >>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> > > > >>> issued into execution. I want this 6-wide machine to have only 1 more::
> > > > >>> <
> > > > >>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> > > > >>> The output of Choose is 6 instructions,
> > > > >>> The output of Issue is 6 6instructions with register and constant operands at their
> > > > >>> .....respective instruction queue.
> > > > >> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
> > > > >> of choose? I would think the choice would be made after EXECUTE once the branch
> > > > >> outcome is determined.
> > > > > <
> > > > > In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> > > > > prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> > > > > order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> > > > > it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> > > > > <
> > > > > Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> > > > > Once you get to 8-wide you might have to start doing 2 predictions per cycle....
> > > <
> > > > A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
> > > > Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
> > > > fetch block per clock to sustain fetching with no bubbles.
> > > <
> > > Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
> > > power-of-2. Now we are in a position to fetch 8 words as::
> > > a) 0-1
> > > b) 1-2
> > > c) 2-3
> > > d) 3-0 of the next IP
> > > So, for an additional carry chain, you suffer no gross alignedness problems.
> > > >
> > > > We also have to deal with the I$L1 read access latency.
> > > > Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
> > > <
> > > Then you made it too big !! And why should not each SRAM in a cache line
> > > not be accessible independently (that is throughput = 4 or some higher power
> > > of 2).
> > <
> > I need to elaborate here::
> > <
> > A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> > logic in it that the I$ does not need--that is the byte alignment logic
> > that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> > the same metrics.
> > <
> > The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> > SRAMs only need the lower order ~16(-3) bits of address which comes out
> > of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> > addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> > clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> > bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> > which data goes into the Inst Buffer, while the IB is supplying instructions
> > into PARSE/DECODE. While in IB, instructions are scanned for branches and
> > their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> > are getting close to the point where in spaghetti codes all instructions might
> > come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> > to supply all instructions and missed instructions flowed through the C$. Here
> > we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> > problems and would only deliver LDs that had resolved all dependencies.
> > <
> > BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
> > of associativity did not hurt all that much. This cannot be replicated now
> > mainly due to wire delay and balancing the several hundred cycles to DRAM
> > with L2 and L3 caching.
> > <
> > > <
> > > > To not stall fetch must request a I$L1 block 3 cycles before it needs it.
> > > > If a branch wants to change from sequential to alternate path and not stall
> > > > we need both blocks in the fetch buffers at the same instant.
> > > <
> > > Obviously
> > > <
> > > > If we wait until we parse the branch instruction to begin the I$L1 read
> > > > then it is too late.
> > > <
> > > Obviously--which is why we scan ahead or use some predictor to identify
> > > the branches before we get to them.
> > > >
> > > > We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
> > > > fetching has to propagate backwards 4 clocks such that we can switch from
> > > > sequential to alternate if we want to without missing a clock or even a lane.
> > > <
> > > The other through train I have been working on is that we have a multiplicity
> > > of successor addresses that are present when the sequential and Alternate
> > > instructions are available; and a predictor predicts which one to use as
> > > sequential and which one to use as ALT.
> > > >
> > > > If the I$L1 has dual even-odd banks then on average it can supply
> > > > two 64-byte block pairs per clock. That should be more than enough to
> > > > continuously feed the fetch unit.
> > > <
> > > L1 will have 4 blocks per tag each block contains 16-bytes.
> > > >
> > > > A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
> > > > alternate fetch block physical address to the block it fetched 4 clocks ago,
> > > > and store this in the I$L1 cache line, for each of the low and high blocks.
> > > <
> > > The trick is getting these things started. after started, they tend to work rather well.
> > > <
> > > > Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
> > > > might change. The prefetched buffer address has to be verified as correct
> > > > when the branch instruction is seen and its target virtual address is
> > > > calculated and translated.
<
> Can some parsing be done at the fetch stage, if the encoding is really simple? I have
> tried this before in a simple pipelined CPU and it seemed to work, but it was with
> single cycle cache latency. I am not sure what impact would be on timing. Maybe
> too many gates? It could be used to chop one cycle off the latency.
<
The encoding means 40-gates and 4-gates of delay gives you a pointer to the next
instruction, a pointer to any displacement, and a pointer to any immediate. It is
cheap enough that you can afford to perform it prior to set-selection out of the
I$.
>
> Rather than use a hardware fetch predictor would it be possible to use a static
> compiler generated prediction? Seems to me that a “JMP Alt-target(s)” instruction
> could be in the instruction stream sometime before the branch instructions it is
> associated with. If it could be placed enough instructions before the branch it may
> be possible to hide the latency. JMP-Alt could begin fetch from multiple targets,
> then have a following branch select which Alt path is the correct one.
<
The entire raison détre of the ISA is that is requires fundamentally fewer instructions
than other RISC ISAs. My 66000 only requires 70% the number of instructions that
RISC-V requires. Adding a prepare-to-branch instruction would ruin the density.

Click here to read the complete article

Re: State of the art non-linear FETCH

<0ef2fe34-b067-4c9a-95db-0ae0edc0a4e3n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34499&group=comp.arch#34499

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:b2b:b0:66c:fd3a:982 with SMTP id w11-20020a0562140b2b00b0066cfd3a0982mr144172qvj.11.1697134161718;
Thu, 12 Oct 2023 11:09:21 -0700 (PDT)
X-Received: by 2002:a05:6808:20a2:b0:3ac:ac58:59f5 with SMTP id
s34-20020a05680820a200b003acac5859f5mr12393019oiw.8.1697134161495; Thu, 12
Oct 2023 11:09:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 Oct 2023 11:09:21 -0700 (PDT)
In-Reply-To: <73c687ea-b3cf-4c82-9951-48a61b5b159cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f4a8:b04d:b143:d24b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f4a8:b04d:b143:d24b
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<51602944-b6c0-43c6-b2ef-3c0960403eb7n@googlegroups.com> <73c687ea-b3cf-4c82-9951-48a61b5b159cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0ef2fe34-b067-4c9a-95db-0ae0edc0a4e3n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 12 Oct 2023 18:09:21 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2742

by: MitchAlsup - Thu, 12 Oct 2023 18:09 UTC

It occurs to me that I could invent a "flow" Cache--such a cache would supply
2 (or 3) indexes to the I$ and some "count" information--perhaps the number
of words consumed at each index and the kind of flow between the I$ fetch
and the ALT$ fetch (or perhaps 2 fetches to the I$).
<
The flow cache would have the property that it is small enough and fast enough
so that it could index itself on the successive cycle, while spewing off indexes
to the I$. Each I$ index is associated with the number of words to fetch on that
path, then the instruction buffer collects the 6-inst to enter decode and issue.
<
Since the fetches are in words and instructions are in words, there is no loss in
storage density (except for the flow cache itself) and when it is hitting we have
more than enough instructions each cycle to fill up the execution window easily.
<
Now, off to figure out how to backup from a misprediction in zero cycles.....

This gets difficult to follow because the word "prefetch" has been
overloaded so many times. My focus is on the instruction queue that
directly feeds the fetch-parse unit, which used to be called the
instruction prefetch queue but lately seems to be called by some
the Fetch Target Queue (FTQ).
Most uses for the term prefetch are D$ and I$ cache prefetching -
the monitoring, predicting and generating addresses to the cache
to trigger L1/L2 prefetch (also called "address generation" or AGEN).

A decoupled front end uses a Fetch Target Queue:

Fetch Directed Instruction Prefetching, 1999
https://web.eecs.umich.edu/~taustin/papers/MICRO32-fdp.pdf

then using the BTB to drive the FTQ:

A Scalable Front-End Architecture for Fast Instruction Delivery, 1999
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/scalable-front-end-isca99.pdf

Branch Target Buffer Organizations, 2023
https://hal.science/hal-04234792/

Shooting Down the Server Front-End Bottleneck, 2022
https://folk.ntnu.no/rakeshk/pubs/Shotgun_TOCS.pdf

EricP wrote:
> MitchAlsup wrote:
>> I am looking for modern papers on the process of fetching
>> non-linear sets of instructions where the first instructions
>> in an ISSUE group come from a sequential long fetch and
>> other instructions arrive from some other mechanism {such
>> as a branch target cache}.
>
> This gets difficult to follow because the word "prefetch" has been
> overloaded so many times. My focus is on the instruction queue that
> directly feeds the fetch-parse unit, which used to be called the
> instruction prefetch queue but lately seems to be called by some
> the Fetch Target Queue (FTQ).
> Most uses for the term prefetch are D$ and I$ cache prefetching -
> the monitoring, predicting and generating addresses to the cache
> to trigger L1/L2 prefetch (also called "address generation" or AGEN).
>
> A decoupled front end uses a Fetch Target Queue:
>
> Fetch Directed Instruction Prefetching, 1999
> https://web.eecs.umich.edu/~taustin/papers/MICRO32-fdp.pdf
>
> then using the BTB to drive the FTQ:
>
> A Scalable Front-End Architecture for Fast Instruction Delivery, 1999
> http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/scalable-front-end-isca99.pdf
>
>
> Branch Target Buffer Organizations, 2023
> https://hal.science/hal-04234792/
>
> Shooting Down the Server Front-End Bottleneck, 2022
> https://folk.ntnu.no/rakeshk/pubs/Shotgun_TOCS.pdf

Notice that these use a huge amount of resources to get the BTB
to generate the next fetch block address with sufficient lead time
for fetch to have zero bubbles.

I suggest this might be done easier by attaching it to the 32B block
in the I$L1 cache line fetched 4 clocks earlier. Just requires bits to
hold the Valid bit, an AltOnly bit, and alt physical block address.
If AltOnly=0 it would prefetch on both the sequential and alternate paths,
or if AltOnly=1 then just follow alt physical address and proceed sequential.
Cost is maybe 120 extra bits per cache line, one set for each 32B block.

Re: State of the art non-linear FETCH

<HGCXM.41729$qoC8.14869@fx40.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34568&group=comp.arch#34568

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx40.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: State of the art non-linear FETCH
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com> <4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com> <Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com> <704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com> <5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad> <b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
In-Reply-To: <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 133
Message-ID: <HGCXM.41729$qoC8.14869@fx40.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 17 Oct 2023 21:05:43 UTC
Date: Tue, 17 Oct 2023 17:05:10 -0400
X-Received-Bytes: 8126

by: EricP - Tue, 17 Oct 2023 21:05 UTC

MitchAlsup wrote:
> On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
>> On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
>>> MitchAlsup wrote:
>>>> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
>>>>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
>>>>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
>>>>>> <
>>>>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
>>>>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
>>>>>>> How many alternate branch paths are going to be supported?
>>>>>> <
>>>>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
>>>>>> to service the excursions from the sequential flow. What I don't want is to add cycles
>>>>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
>>>>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
>>>>>> issued into execution. I want this 6-wide machine to have only 1 more::
>>>>>> <
>>>>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>>>>>> The output of Choose is 6 instructions,
>>>>>> The output of Issue is 6 6instructions with register and constant operands at their
>>>>>> .....respective instruction queue.
>>>>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
>>>>> of choose? I would think the choice would be made after EXECUTE once the branch
>>>>> outcome is determined.
>>>> <
>>>> In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
>>>> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
>>>> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
>>>> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
>>>> <
>>>> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
>>>> Once you get to 8-wide you might have to start doing 2 predictions per cycle....
>> <
>>> A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
>>> Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
>>> fetch block per clock to sustain fetching with no bubbles.
>> <
>> Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
>> power-of-2. Now we are in a position to fetch 8 words as::
>> a) 0-1
>> b) 1-2
>> c) 2-3
>> d) 3-0 of the next IP
>> So, for an additional carry chain, you suffer no gross alignedness problems.
>>> We also have to deal with the I$L1 read access latency.
>>> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
>> <
>> Then you made it too big !! And why should not each SRAM in a cache line
>> not be accessible independently (that is throughput = 4 or some higher power
>> of 2).
> <
> I need to elaborate here::
> <
> A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> logic in it that the I$ does not need--that is the byte alignment logic
> that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> the same metrics.

Ok. I thought the I$L1 read pipeline would always be 3 stages
as it didn't look like there would be enough slack in stage 3 to do
much of anything else (that being instruction alignment and parse).
- cache address decode, latch
- word line drive, bit line drive, sense amp, latch
- tag match, way select mux, drive to prefetch buffer, latch

So just following my train of thought, it might as well read a whole
cache line into the prefetch buffers for each cache access. Out of the
prefetch buffer it reads one or two 32B blocks for consecutive
virtual addresses.

The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
address bits to shift from 0 to 7 instruction words.

That result feeds to the Log2 Parser which selects up to 6 instructions
from those source bytes.

Fetch Line Buf 0 (fully assoc index)
Fetch Line Buf 1
Fetch Line Buf 2
Fetch Line Buf 3
v v
32B Blk1 32B Blk0
v v
Alignment Shifter 8:1 muxes
v
Log2 Parser and Branch Detect
v v v v v v
I5 I4 I3 I2 I1 I0

> <
> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> SRAMs only need the lower order ~16(-3) bits of address which comes out
> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> which data goes into the Inst Buffer, while the IB is supplying instructions
> into PARSE/DECODE. While in IB, instructions are scanned for branches and
> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> are getting close to the point where in spaghetti codes all instructions might
> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> to supply all instructions and missed instructions flowed through the C$. Here
> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> problems and would only deliver LDs that had resolved all dependencies.

That was one of the reasons for suggesting a 32B block as the fetch unit.
Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
32B block holds about 6 instructions, which is also about a basic block
size and on average would contain 1 branch.

So by attaching an optional next-alt-fetch physical address to each
32B block in a 64B cache line, then it can specify both the sequential
and alt fetch path start to follow.

That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
so when the Log2 Parser sees a branch it can immediately switch to
parsing instructions from the alt path on the next clock.
If two cache lines from the alt path are loaded, that's 4 prefetch blocks
which should be enough to cover the I$ read pipeline latency.

The BTB and Return Stack Predictor are also providing virtual address
prefetch predictions, so that has to be integrated in this too.

> <
> BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
> of associativity did not hurt all that much. This cannot be replicated now
> mainly due to wire delay and balancing the several hundred cycles to DRAM
> with L2 and L3 caching.

Re: State of the art non-linear FETCH

<1a506549-03be-4cdc-9834-d68e53cd82d8n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34570&group=comp.arch#34570

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2cc7:b0:76f:cd2:5d10 with SMTP id tc7-20020a05620a2cc700b0076f0cd25d10mr71982qkn.5.1697590822665;
Tue, 17 Oct 2023 18:00:22 -0700 (PDT)
X-Received: by 2002:a05:6830:2b22:b0:6c6:3ea5:cdc8 with SMTP id
l34-20020a0568302b2200b006c63ea5cdc8mr1484897otv.5.1697590822416; Tue, 17 Oct
2023 18:00:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 17 Oct 2023 18:00:22 -0700 (PDT)
In-Reply-To: <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:317a:59be:f167:bdb0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:317a:59be:f167:bdb0
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<HGCXM.41729$qoC8.14869@fx40.iad> <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1a506549-03be-4cdc-9834-d68e53cd82d8n@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 18 Oct 2023 01:00:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 67

by: MitchAlsup - Wed, 18 Oct 2023 01:00 UTC

On Tuesday, October 17, 2023 at 5:15:32 PM UTC-5, robf...@gmail.com wrote:
> On Tuesday, October 17, 2023 at 5:05:48 PM UTC-4, EricP wrote:
> > MitchAlsup wrote:
>
<
> What if only some alt paths are supported?
<
Essentially all loops will fit in the IBuffer of this GBOoO machine.
Any short forward branch can be treated as a non-branch, for fetch purposes..
CALL/RET targets may be supported by a different "predictor".
Jump indirect will have its own "predictor".
<
> Could alt paths be available only for forward branches?
<
Alt path is a port reduction scheme on the I$
<
> How does the number of alt paths supported impact performance?
<
During execution you are going to want ~8-16 (especially when switch()s are only
1 flow control instruction--rather than 2 or 3).
As an IPC support module, you probably want at least 256.
But you want them to be independent of the size and associativity of I$
<
>
> How does the alt-path get set in the fetch block? When things are cold the alt path
> may not be set correctly.
<
may not !?!?!?!?! I expect predicting compulsory misses is completely futile.
However, I expect that predicting capacity/collision misses might be useful..
<
> So, it would need to be updated in the I$;
<
Not necessarily--flow prediction will be a cache all by itself--and it predicts only
the fetch indexes--and "hits" are checked during decode.
<
> I$ line invalidate required then?
<
You want the actual executed instructions to come from cache(s) that get invalidated
when I$ no-longer holds that instruction--{so, remote invalidates work} {{Packet caches
and trace caches have these problems.}}
<
>I have thought of placing the alt paths for the top four alternate
> paths in a function header rather than for every block in the function.
<
Function header is not present on 1-wide machines and I want code compatibility.
<
> It would
> reduce the I$ storage requirements. Then fetch from all four paths and use a
> branch instruction to select the path. It would mean that only a limited number of
> alt paths are supported but for many small functions a small number may be
> adequate.
<
An interesting notion.

Re: State of the art non-linear FETCH

<f1be921d-c5b5-4216-bbdc-9ac1d563cc0cn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34571&group=comp.arch#34571

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:450c:b0:774:1c49:2d6b with SMTP id t12-20020a05620a450c00b007741c492d6bmr86975qkp.12.1697594436241;
Tue, 17 Oct 2023 19:00:36 -0700 (PDT)
X-Received: by 2002:a05:6870:6392:b0:1dd:39ce:e25c with SMTP id
t18-20020a056870639200b001dd39cee25cmr1770632oap.3.1697594436067; Tue, 17 Oct
2023 19:00:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 17 Oct 2023 19:00:35 -0700 (PDT)
In-Reply-To: <HGCXM.41729$qoC8.14869@fx40.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:317a:59be:f167:bdb0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:317a:59be:f167:bdb0
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com>
<4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com>
<Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com>
<704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com>
<5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad>
<b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com>
<HGCXM.41729$qoC8.14869@fx40.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f1be921d-c5b5-4216-bbdc-9ac1d563cc0cn@googlegroups.com>
Subject: Re: State of the art non-linear FETCH
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 18 Oct 2023 02:00:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 18 Oct 2023 02:00 UTC

On Tuesday, October 17, 2023 at 4:05:48 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:

> > I need to elaborate here::
> > <
> > A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> > logic in it that the I$ does not need--that is the byte alignment logic
> > that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> > the same metrics.
<
> Ok. I thought the I$L1 read pipeline would always be 3 stages
> as it didn't look like there would be enough slack in stage 3 to do
> much of anything else (that being instruction alignment and parse).
<
This is where the 4-gates of delay to create unary next instruction pointers
comes in handy. It can be developed while set comparison is done and is
ready when set-multiplexer select line is ready. You don't have to store the
information in the I$ yet it is ready by the time you first see the bit-patterns.
You can redevelop that data from the instruction bit-pattern at lower cost
than you can store it !!
<
> - cache address decode, latch {flip-flop}
> - word line drive, bit line drive, sense amp, latch {flip-flop}
> - tag match, way select mux, drive to prefetch buffer, latch {flip-flop}
>
> So just following my train of thought, it might as well read a whole
> cache line into the prefetch buffers for each cache access. Out of the
> prefetch buffer it reads one or two 32B blocks for consecutive
> virtual addresses.
<
Cache line is 16-words, ... I can see where being able to do four × 4-word
fetches independently might me more sound, but this depends on the
instruction buffer µarchitecture.
>
> The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
> address bits to shift from 0 to 7 instruction words.
<
Note: next-instruction unary pointers are available here.
>
> That result feeds to the Log2 Parser which selects up to 6 instructions
> from those source bytes.
<
Since next instructions pointers were already available, one can find
all candidate instruction-specifiers (6-wide) in 6-gates of delay! So, this
PARSE logic can be overlapped with I$ set selection, Instruction Buffer
read-out; leaving DECODE looking at individual instructions for each
DECODE slot.
>
> Fetch Line Buf 0 (fully assoc index)
> Fetch Line Buf 1
> Fetch Line Buf 2
> Fetch Line Buf 3
> v v
> 32B Blk1 32B Blk0
> v v
> Alignment Shifter 8:1 muxes
> v
> Log2 Parser and Branch Detect
> v v v v v v
> I5 I4 I3 I2 I1 I0
<
Yes, pretty much.
<
> > <
> > The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> > SRAMs only need the lower order ~16(-3) bits of address which comes out
> > of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> > addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> > clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> > bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> > which data goes into the Inst Buffer, while the IB is supplying instructions
> > into PARSE/DECODE. While in IB, instructions are scanned for branches and
> > their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> > are getting close to the point where in spaghetti codes all instructions might
> > come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> > to supply all instructions and missed instructions flowed through the C$. Here
> > we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> > problems and would only deliver LDs that had resolved all dependencies.
<
> That was one of the reasons for suggesting a 32B block as the fetch unit.
<
It is certainly correct within a power of 2.
<
> Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
> can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
> 32B block holds about 6 instructions, which is also about a basic block
> size and on average would contain 1 branch.
<
Yes, plus if there is no branch, I can set up the ALT index to be the sequential
index and cover all sequential instruction sequences.
<
My statistics generator says I get 60.8% of branches are taken and branches
are 19.2% of instruction stream, one takes a branch every 9.17 instructions
{hand waving} So, between 8 and 10 words is fetch length of the sequential
fetch, and another 4 on the alternate fetch should cover everything except
branch density spikes.
>
> So by attaching an optional next-alt-fetch physical address to each
> 32B block in a 64B cache line, then it can specify both the sequential
> and alt fetch path start to follow.
<
Change/attaching/into/associating/ and we are on the same page.
Also change/address/into/index/ so we store fewer bits in the flow cache.
>
> That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
> so when the Log2 Parser sees a branch it can immediately switch to
> parsing instructions from the alt path on the next clock.
<
Must query the branch predictor to decide which path to follow.
<
> If two cache lines from the alt path are loaded, that's 4 prefetch blocks
> which should be enough to cover the I$ read pipeline latency.
<
Here is where 16×¼ cache lines will be superior.
>
> The BTB and Return Stack Predictor are also providing virtual address
> prefetch predictions, so that has to be integrated in this too.
<
So 1991 as a design point.
<
> > <
> > BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
> > of associativity did not hurt all that much. This cannot be replicated now
> > mainly due to wire delay and balancing the several hundred cycles to DRAM
> > with L2 and L3 caching.

Re: State of the art non-linear FETCH

<wmVXM.21568$%WT8.17513@fx12.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34578&group=comp.arch#34578

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: State of the art non-linear FETCH
References: <12ec2ad9-b8b6-4919-ba47-c9a86788c172n@googlegroups.com> <4a%TM.484$3l%9.141@fx02.iad> <80c4275a-5b74-4030-a112-d34520067b59n@googlegroups.com> <Q6fUM.22285$%wRe.19061@fx46.iad> <134498b5-3eed-42bf-8cd7-0a7622264625n@googlegroups.com> <704323f1-b9c7-4c21-95db-9aa6f5be5aa9n@googlegroups.com> <f948faaa-f1b8-40ca-b409-8570b55c5b41n@googlegroups.com> <5110f29f-9e95-4798-931c-7d68ad646b03n@googlegroups.com> <BMfVM.35908$rbid.35777@fx18.iad> <b882d768-420a-419c-89ba-124689674c0fn@googlegroups.com> <f5199c50-a69c-47c3-9cae-2878c9a935f8n@googlegroups.com> <HGCXM.41729$qoC8.14869@fx40.iad> <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
In-Reply-To: <bb99fa89-7873-473b-90c4-201f49659f63n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 180
Message-ID: <wmVXM.21568$%WT8.17513@fx12.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 18 Oct 2023 18:21:16 UTC
Date: Wed, 18 Oct 2023 14:21:00 -0400
X-Received-Bytes: 11015

by: EricP - Wed, 18 Oct 2023 18:21 UTC

robf...@gmail.com wrote:
> On Tuesday, October 17, 2023 at 5:05:48 PM UTC-4, EricP wrote:
>> MitchAlsup wrote:
>>> On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
>>>> On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
>>>>> MitchAlsup wrote:
>>>>>> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
>>>>>>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
>>>>>>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
>>>>>>>> <
>>>>>>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
>>>>>>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
>>>>>>>>> How many alternate branch paths are going to be supported?
>>>>>>>> <
>>>>>>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
>>>>>>>> to service the excursions from the sequential flow. What I don't want is to add cycles
>>>>>>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
>>>>>>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
>>>>>>>> issued into execution. I want this 6-wide machine to have only 1 more::
>>>>>>>> <
>>>>>>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>>>>>>>> The output of Choose is 6 instructions,
>>>>>>>> The output of Issue is 6 6instructions with register and constant operands at their
>>>>>>>> .....respective instruction queue.
>>>>>>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
>>>>>>> of choose? I would think the choice would be made after EXECUTE once the branch
>>>>>>> outcome is determined.
>>>>>> <
>>>>>> In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
>>>>>> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
>>>>>> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
>>>>>> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
>>>>>> <
>>>>>> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
>>>>>> Once you get to 8-wide you might have to start doing 2 predictions per cycle....
>>>> <
>>>>> A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
>>>>> Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
>>>>> fetch block per clock to sustain fetching with no bubbles.
>>>> <
>>>> Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
>>>> power-of-2. Now we are in a position to fetch 8 words as::
>>>> a) 0-1
>>>> b) 1-2
>>>> c) 2-3
>>>> d) 3-0 of the next IP
>>>> So, for an additional carry chain, you suffer no gross alignedness problems.
>>>>> We also have to deal with the I$L1 read access latency.
>>>>> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
>>>> <
>>>> Then you made it too big !! And why should not each SRAM in a cache line
>>>> not be accessible independently (that is throughput = 4 or some higher power
>>>> of 2).
>>> <
>>> I need to elaborate here::
>>> <
>>> A D$ may have a 3-cycle latency and 1 throughput per bank but this has
>>> logic in it that the I$ does not need--that is the byte alignment logic
>>> that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
>>> the same metrics.
>> Ok. I thought the I$L1 read pipeline would always be 3 stages
>> as it didn't look like there would be enough slack in stage 3 to do
>> much of anything else (that being instruction alignment and parse).
>> - cache address decode, latch
>> - word line drive, bit line drive, sense amp, latch
>> - tag match, way select mux, drive to prefetch buffer, latch
>>
>> So just following my train of thought, it might as well read a whole
>> cache line into the prefetch buffers for each cache access. Out of the
>> prefetch buffer it reads one or two 32B blocks for consecutive
>> virtual addresses.
>>
>> The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
>> address bits to shift from 0 to 7 instruction words.
>>
>> That result feeds to the Log2 Parser which selects up to 6 instructions
>> from those source bytes.
>>
>> Fetch Line Buf 0 (fully assoc index)
>> Fetch Line Buf 1
>> Fetch Line Buf 2
>> Fetch Line Buf 3
>> v v
>> 32B Blk1 32B Blk0
>> v v
>> Alignment Shifter 8:1 muxes
>> v
>> Log2 Parser and Branch Detect
>> v v v v v v
>> I5 I4 I3 I2 I1 I0
>>> <
>>> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
>>> SRAMs only need the lower order ~16(-3) bits of address which comes out
>>> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
>>> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
>>> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
>>> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
>>> which data goes into the Inst Buffer, while the IB is supplying instructions
>>> into PARSE/DECODE. While in IB, instructions are scanned for branches and
>>> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
>>> are getting close to the point where in spaghetti codes all instructions might
>>> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
>>> to supply all instructions and missed instructions flowed through the C$. Here
>>> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
>>> problems and would only deliver LDs that had resolved all dependencies.
>> That was one of the reasons for suggesting a 32B block as the fetch unit.
>> Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
>> can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
>> 32B block holds about 6 instructions, which is also about a basic block
>> size and on average would contain 1 branch.
>>
>> So by attaching an optional next-alt-fetch physical address to each
>> 32B block in a 64B cache line, then it can specify both the sequential
>> and alt fetch path start to follow.
>>
>> That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
>> so when the Log2 Parser sees a branch it can immediately switch to
>> parsing instructions from the alt path on the next clock.
>> If two cache lines from the alt path are loaded, that's 4 prefetch blocks
>> which should be enough to cover the I$ read pipeline latency.
>>
>> The BTB and Return Stack Predictor are also providing virtual address
>> prefetch predictions, so that has to be integrated in this too.
>>> <
>>> BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
>>> of associativity did not hurt all that much. This cannot be replicated now
>>> mainly due to wire delay and balancing the several hundred cycles to DRAM
>>> with L2 and L3 caching.
>
> What if only some alt paths are supported? Could alt paths be available only for
> forward branches? How does the number of alt paths supported impact
> performance?

For alt paths to be prefetched requires some kind of trigger.
Current systems seem to use the BTB as a trigger and, from what I read,
this can make the BTB structures quite large.
For zero bubble fetching is that the trigger has to be
set to go off far enough ahead to hide the bubble.
More paths, father ahead, means larger structures, slower access.

> How does the alt-path get set in the fetch block? When things are cold the alt path
> may not be set correctly. So, it would need to be updated in the I$; I$ line invalidate
> required then?

Click here to read the complete article

Subject	Author
State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	Quadibloc
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	Thomas Koenig
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	MitchAlsup
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	robf...@gmail.com
Re: State of the art non-linear FETCH	EricP
Re: State of the art non-linear FETCH	EricP