Rocksolid Light - comp.arch - Re: State of the art non-linear FETCH

robf...@gmail.com wrote:
> On Tuesday, October 17, 2023 at 5:05:48 PM UTC-4, EricP wrote:
>> MitchAlsup wrote:
>>> On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:
>>>> On Tuesday, October 10, 2023 at 12:23:49 PM UTC-5, EricP wrote:
>>>>> MitchAlsup wrote:
>>>>>> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
>>>>>>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:
>>>>>>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
>>>>>>>> <
>>>>>>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
>>>>>>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
>>>>>>>>> How many alternate branch paths are going to be supported?
>>>>>>>> <
>>>>>>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
>>>>>>>> to service the excursions from the sequential flow. What I don't want is to add cycles
>>>>>>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
>>>>>>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
>>>>>>>> issued into execution. I want this 6-wide machine to have only 1 more::
>>>>>>>> <
>>>>>>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>>>>>>>> The output of Choose is 6 instructions,
>>>>>>>> The output of Issue is 6 6instructions with register and constant operands at their
>>>>>>>> .....respective instruction queue.
>>>>>>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
>>>>>>> of choose? I would think the choice would be made after EXECUTE once the branch
>>>>>>> outcome is determined.
>>>>>> <
>>>>>> In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
>>>>>> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
>>>>>> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
>>>>>> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
>>>>>> <
>>>>>> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
>>>>>> Once you get to 8-wide you might have to start doing 2 predictions per cycle....
>>>> <
>>>>> A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
>>>>> Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
>>>>> fetch block per clock to sustain fetching with no bubbles.
>>>> <
>>>> Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
>>>> power-of-2. Now we are in a position to fetch 8 words as::
>>>> a) 0-1
>>>> b) 1-2
>>>> c) 2-3
>>>> d) 3-0 of the next IP
>>>> So, for an additional carry chain, you suffer no gross alignedness problems.
>>>>> We also have to deal with the I$L1 read access latency.
>>>>> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
>>>> <
>>>> Then you made it too big !! And why should not each SRAM in a cache line
>>>> not be accessible independently (that is throughput = 4 or some higher power
>>>> of 2).
>>> <
>>> I need to elaborate here::
>>> <
>>> A D$ may have a 3-cycle latency and 1 throughput per bank but this has
>>> logic in it that the I$ does not need--that is the byte alignment logic
>>> that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
>>> the same metrics.
>> Ok. I thought the I$L1 read pipeline would always be 3 stages
>> as it didn't look like there would be enough slack in stage 3 to do
>> much of anything else (that being instruction alignment and parse).
>> - cache address decode, latch
>> - word line drive, bit line drive, sense amp, latch
>> - tag match, way select mux, drive to prefetch buffer, latch
>>
>> So just following my train of thought, it might as well read a whole
>> cache line into the prefetch buffers for each cache access. Out of the
>> prefetch buffer it reads one or two 32B blocks for consecutive
>> virtual addresses.
>>
>> The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
>> address bits to shift from 0 to 7 instruction words.
>>
>> That result feeds to the Log2 Parser which selects up to 6 instructions
>> from those source bytes.
>>
>> Fetch Line Buf 0 (fully assoc index)
>> Fetch Line Buf 1
>> Fetch Line Buf 2
>> Fetch Line Buf 3
>> v v
>> 32B Blk1 32B Blk0
>> v v
>> Alignment Shifter 8:1 muxes
>> v
>> Log2 Parser and Branch Detect
>> v v v v v v
>> I5 I4 I3 I2 I1 I0
>>> <
>>> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
>>> SRAMs only need the lower order ~16(-3) bits of address which comes out
>>> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
>>> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
>>> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
>>> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
>>> which data goes into the Inst Buffer, while the IB is supplying instructions
>>> into PARSE/DECODE. While in IB, instructions are scanned for branches and
>>> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
>>> are getting close to the point where in spaghetti codes all instructions might
>>> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
>>> to supply all instructions and missed instructions flowed through the C$. Here
>>> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
>>> problems and would only deliver LDs that had resolved all dependencies.
>> That was one of the reasons for suggesting a 32B block as the fetch unit.
>> Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
>> can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
>> 32B block holds about 6 instructions, which is also about a basic block
>> size and on average would contain 1 branch.
>>
>> So by attaching an optional next-alt-fetch physical address to each
>> 32B block in a 64B cache line, then it can specify both the sequential
>> and alt fetch path start to follow.
>>
>> That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
>> so when the Log2 Parser sees a branch it can immediately switch to
>> parsing instructions from the alt path on the next clock.
>> If two cache lines from the alt path are loaded, that's 4 prefetch blocks
>> which should be enough to cover the I$ read pipeline latency.
>>
>> The BTB and Return Stack Predictor are also providing virtual address
>> prefetch predictions, so that has to be integrated in this too.
>>> <
>>> BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
>>> of associativity did not hurt all that much. This cannot be replicated now
>>> mainly due to wire delay and balancing the several hundred cycles to DRAM
>>> with L2 and L3 caching.
>
> What if only some alt paths are supported? Could alt paths be available only for
> forward branches? How does the number of alt paths supported impact
> performance?

For alt paths to be prefetched requires some kind of trigger.
Current systems seem to use the BTB as a trigger and, from what I read,
this can make the BTB structures quite large.
For zero bubble fetching is that the trigger has to be
set to go off far enough ahead to hide the bubble.
More paths, father ahead, means larger structures, slower access.

> How does the alt-path get set in the fetch block? When things are cold the alt path
> may not be set correctly. So, it would need to be updated in the I$; I$ line invalidate
> required then?

Existing Intel and AMD x86 designs have fetch marking instructions
with parser start and stop bit, so there are pathways to back propagate
from fetch-parse to I$ in current designs.

This might be done by when fetch reads a I$ [row,set,block] entry it comes
with the coordinates with the instruction packet, which we save in fetch.
This saves going through the index and tag structures to update that entry.

Fetch could keep a 4-entry circular buffer with the [row,set,block] of
32B blocks it just fetched from. It would just write physical address
of the current block to the auxiliary fields of block 4 clocks ago.

This update would be stored in a pending update buffer in cache
and written back when there was an unused I$ read or write cycle.
(And if the cache data area is also write pipelined then multiple pending
update buffers and coordination logic would be required).

This might imply frequent write updates to I$ cache.
The problem is finding an unused I$ cycle. If we are fetching both
sequential and possibly 1 alternate path then that might fully saturate
the I$ access bandwidth. And there is also prefetch writes to load
cache lines into I$L1 from I$L2.

> I have thought of placing the alt paths for the top four alternate
> paths in a function header rather than for every block in the function. It would
> reduce the I$ storage requirements. Then fetch from all four paths and use a
> branch instruction to select the path. It would mean that only a limited number of
> alt paths are supported but for many small functions a small number may be
> adequate.

The way I'm thinking, fetch buffers would be an expensive resource,
fully assoc index, really a small I$L0 cache, so the buffers would
be recycled quite quickly and not be prefetched too far ahead.

Subject	Replies	Author
State of the art non-linear FETCH By: MitchAlsup on Thu, 5 Oct 2023	28	MitchAlsup

Thrashing is just virtual crashing.

computers / comp.arch / Re: State of the art non-linear FETCH