Message-ID:

<doogie> Thinking is dangerous. It leads to ideas. -- Seen on #Debian

On 9/30/2023 1:20 PM, MitchAlsup wrote:
> On Saturday, September 30, 2023 at 12:50:55 PM UTC-5, BGB wrote:
>> On 9/30/2023 11:04 AM, EricP wrote:
>>> BGB wrote:
>>>> On 9/29/2023 2:02 PM, EricP wrote:
>>>>> BGB wrote:
>>>>>>>>
>>>>>>>> Any thoughts?...
>>>>>>>
>>>>>>> Its not just the MHz but the IPC you need to think about.
>>>>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>>>>>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>>>>>>
>>>>>>
>>>>>> For running stats from a running full simulation (predates to these
>>>>>> tweaks, running GLQuake with the HW rasterizer):
>>>>>> ~ 0.48 .. 0.54 bundles clock;
>>>>>> ~ 1.10 .. 1.40 instructions/bundle.
>>>>>>
>>>>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
>>>>>> MIPs/MHz).
>>>>>
>>>>> Oh that's pretty efficient then.
>>>>> In the past you had made comments which made it sound like
>>>>> having tlb, cache, and dram controller all hung off of what
>>>>> you called your "ring bus", which sounded like a token ring,
>>>>> and that the RB consumed many cycles latency.
>>>>> That gave me the impression of frequent, large stalls to cache,
>>>>> lots of bubbles, leading to low IPC.
>>>>>
>>>>
>>>> It does diminish IPC, but not as much as my older bus...
>>>
>>> Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
>>> That's not that efficient. I was thinking you were getting an IPC
>>> of 0.5 out ~0.7, the maximum possible with 1 register write port.
>>> A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
>>> that pretty much bottlenecks you at WB/Retire to < 1.0.
>>>
>> There are 3 write ports to the register file.
>>
>> However, they only see much use when the code actually uses them, which
>> for the most part, my C compiler doesn't. It basically emits normal
>> 1-wide RISC style code, then tries to jostle the instructions around and
>> put them in bundles.
>>
>> Results are pretty mixed, and it only really works if the code is
>> written in certain ways.
>>
>>
>> Ironically, for GLQuake, most of the ASM was in areas that dropped off
>> the map when switching to a hardware rasterizer; so the part of the
>> OpenGL pileline that remains, is mostly all the stuff that was written
>> in C (with a few random bits of ASM thrown in).
>>> I suspect those ring bus induced bubbles are likely killing your IPC.
>>> Fiddling with the internals won't help if the pipeline is mostly empty.
>>>
>> Ringbus latency doesn't matter when there are no L1 misses...
>>> I suggest the primary thing to think about for the future is getting the
>>> pipeline as full as possible. Then consider making it more efficient
>>> internally, adding more write register ports so you can retire > 1.0 IPC
>>> (there is little point in having 3 lanes if you can only retire 1/clock).
>>> Then thirdly start look at things like forwarding buses.
>>>
>> Well, would be back to a lot more fiddling with my C compiler in this case.
>>
>> As noted, the ISA in question is statically scheduled, so depends mostly
>> on either the compiler or ASM programmer to do the work.
>>>> It seems like, if there were no memory related overheads (if the L1
>>>> always hit), as is it would be in the area of 22% faster.
>>>>
>>>> L1 misses are still not good though, but this is true even on a modern
>>>> desktop PC.
>>>
>>> The cache miss rate may not be the primary bottleneck.
>>> Are you using the ring bus to talk to TLB's, I$L1, D$L1, L2, etc?
>>>
>> L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
>> stages.
>>
>> So:
>> PF IF ID1 ID2 EX1 EX2 EX3 WB
>> Or, alternately:
>> PF IF ID RF EX1 EX2 EX3 WB
>>
>> So, access is like:
>> EX1: Calculate address, send request to L1 cache;
>> EX2: Cache checks hit/miss, extracts data for load, prepare for store.
>> This is the stage where the pipeline stall is signaled on miss.
>> EX3: Data fetched for Load, final cleanup.
>> Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
>> Data stored back into L1 arrays here (on next clock edge).
>>> Some questions about your L1 cache:
>>>
>>> In clocks, what is I$L1 D$L1 read and write hit latency,
>>> and the total access latency including ring bus overhead?
>>> And is the D$L1 store pipelined?
>>>
>> Loads and stores are pipelined.
>>
>> TLB doesn't matter yet, L1 caches are virtually indexed and tagged.
>>> Do you use the same basic design for your 2-way assoc. TLB
>>> as the L1 cache, so the same numbers apply?
>>>
>>> And do you pipeline the TLB lookup in one stage, and D$L1 access in a
>>> second?
>>>
>> TLB is a separate component external to the L1 caches, and performs
>> translation on L1 miss.
>>
>> It has a roughly 3 cycle latency.
> <
> So, you take a 2-cycle look at L1 tag and if you are going to get a miss,
> you then take a 3-cycle access of TLB so you can "get on" ring-bus.
> So, AGEN to PA is 5 cycles.
> <

Somewhere in that area.

The L1 I$ and D$ are both on the bus, and requests from the D$ travel
through the I$ to get to the TLB.

The join point between the L1 and L2 rings has an extra 1-cycle delay on
each side to deal with forward/skip handling.

Quick mental checks, it is probably in the area of ~ 8 cycles between
when the AGU does its thing, and the first L1 miss request leaves the
CPU core.

Around the 9th cycle, it enters the L2 cache (which has a 5 cycle
latency IIRC, *1), 2 cycles of forward-skips for the response to get
back to the CPU core, then the response enters L1 D$ (and gets absorbed).

So, round trip, probably somewhere in the area of 17 clock cycles
(single core), or a few more if dual-core.

*1: This mostly being to allow for larger block-RAM arrays.

Depending on the message and other specifics, it may also travel through
the part of the ring which deals the Boot ROM and MMIO interfaces and
similar; but normal RAM requests and responses skip over this part of
the ring.

>> 1: Request comes in, setup for fetch from TLB arrays;
>> 2: Check for TLB hit/miss, raise exception on miss;
>> 3: Replace original request with translated request.
>> Output is on the clock-edge following the 3rd cycle.
>>> I'm suggesting that your primary objective is making that pathway from the
>>> Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
>>> So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
>>>
>>> Such that ideally it takes 2 pipelined stages for a cache reead hit,
>>> and if the D$L1 read hit is 1 clock that the load-to-use
>>> latency is 2 clocks (or at least that is possible), pipelined.
>>>
>>> And that a store is passed to D$L1 in 1 clock,
>>> and then the LSU can continue while the cache deals with it.
>>> The cache bus handshake would go "busy" until the store is complete.
>>> Also ideally store hits would pipeline the tag and data accesses
>>> so back to back store hits take 1 clock (but that's getting fancy).
>>>
>> There is no LSU in this design, or effectively, the L1 cache itself
>> takes on this role.
>>>> I suspect ringbus efficiency is diminishing the efficiency of external
>>>> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
>>>> the "raw" speed of accessing the RAM chip (in the associated unit tests).
>>>
>>> At the start this ring bus might have been a handy idea by
>>> making it easy to experiment with different configurations, but I
>>> think you should be looking at direct connections whenever possible.
>>>
>> Within the core itself, everything is bolted directly to the main pipeline.
>>
>> External to this, everything is on the ringbus.
>>
>> As noted, when there are no cache misses and no MMIO access or similar,
>> the bus isn't really involved.
>>
>>
>> But, yeah, I am left to realize that, say, driving the L2 cache with a
>> FIFO might have been better for performance (rather than just letting
>> requests circle the ring until they can be handled).
> <
> You know that this allows for un-ordered memory accesses--ACK.
> PAs get to memory banks in an order unlike the misses occurred--
> CDC 6600 had these effects and CDC 7600 got rid of them.

The ringbus is kinda chaotic in this sense:
Requests and response may happen in different orders;
Unpredictable and sometimes large delays in handling requests;
...

The L1 caches generally need to wait for all outstanding requests to
finish before execution can resume, as otherwise, dangling stores to the
L2 cache can come back to bite (in the form of memory coherence issues).

Granted, letting the CPU core continue without waiting for stores to be
acknowledged does make things faster, but is generally cut short when
something crashes due to fetching stale memory.

In some cases, unit testing modules that operate on the ringbus needs to
basically use random number generators for whether or not to handle
requests, and to loop them around in various ways, etc, to try to mimic
the general behavior of the bus.

In multi-core cases, there isn't really any cache coherence between the
cores; so absent explicit flushing it is hit or miss.

There is a mechanism for volatile memory access to try to work around
this (basically, the L1 caches will try to flush the cache line as soon
as the request completes). This can allow for some semblance of
coherence, but accessing memory in this way comes at a significant
performance penalty.

> <
>>>> <snip>
>>>>>> Top ranking uses of clock-cycles (for total stall cycles):
>>>>>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
>>>>>> Misc : ~ 23% (Misc uncategorized stalls)
>>>>>> IL : ~ 20% (Interlock stalls)
>>>>>> L1 I$ : ~ 18% (16K L1 I$, 1)
>>>>>> L1 D$ : ~ 9% (32K L1 D$)
>>>>>>
>>>>>> The IL (or Interlock) penalty is the main one that would be effected
>>>>>> by increasing latency.
>>>>>
>>>>> By "interlock stalls" do you mean register RAW dependency stalls?
>>>>
>>>> Yeah.
>>>>
>>>> Typically:
>>>> If an ALU operation happens, the result can't be used until 2 clock
>>>> cycles later;
>>>> If a Load happens, the result is not available for 3 clock cycles;
>>>> Trying to use the value before then stalls the frontend stages.
>>>
>>> Ok this sounds like you need more forwarding buses.
>>> Ideally this should allow back-to-back dependent operations.
>>>
>> Early on, I did try forwarding ADD results directly from the EX1 stage
>> (or, directly from the adder's combinatorial logic into the register
>> forwarding, which is more combinatorial logic feeding back into the ID2
>> stage).
>>
>> FPGA timing was not so happy with this sort of thing (it is a lot
>> happier when there are clock-edges for everything to settle out on).
>>>>> As distinct from D$L1 read access stall, if read access time > 1 clock
>>>>> or multi-cycle function units like integer divide.
>>>>>
>>>>
>>>> The L1 I$ and L1 D$ have different stats, as shown above.
>>>>
>>>> Things like DIV and FPU related stalls go in the MISC category.
>>>>
>>>> Based on emulator stats (and profiling), I can see that most of the
>>>> MISC overhead in GLQuake is due to FPU ops like FADD and FMUL and
>>>> similar.
>>>>
>>>>
>>>> So, somewhere between 5% and 9% of the total clock-cycles here are
>>>> being spent waiting for the FPU to do its thing.
>>>>
>>>> Except for the "low precision" ops, which are fully pipelined (these
>>>> will not result in any MISC penalty, but may result in an IL penalty
>>>> if the result is used too quickly).
>>>>
>>>>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
>>>>> to do a 64-bit add, so it uses two pipeline stages for ALU.
>>>>>
>>>>
>>>> It takes roughly 1 cycle internally, so:
>>>> ID2 stage: Fetch inputs for the ADD;
>>>> EX1 stage: Do the ADD;
>>>> EX2 stage: Make result visible to the world.
>>>>
>>>> For 1-cycle ops, it would need to forward the result directly from the
>>>> adder-chain logic or similar into the register forwarding logic. I
>>>> discovered fairly early on that for things like 64-bit ADD, this is bad.
>>>
>>> It should not be bad, you just need to sort out the clock edges and
>>> forwarding. In a sense these are feedback loops so they just need
>>> to be self re-enforcing (see below).
>>>
>>>> Most operations which "actually do something" thus sort of end up
>>>> needing a clock-edge for their results to come to rest (causing them
>>>> to effectively have a 2-cycle latency as far as the running program is
>>>> concerned).
>>>
>>> Ok that shouldn't happen. If your ALU is 1 clock latency then
>>> back-to-back execution should be possible with a forwarding bus.
>>>
>>> You are taking ALU result AFTER its stage output buffer and forwarding
>>> that back to register read, rather than taking the ALU result BEFORE
>>> the stage output buffer, and this is introducing an extra clock delay.
>>>
>> But, doing it this way makes FPGA timing constraints significantly
>> happier...
>>> I'm thinking of this organization:
>>>
>>> Decode Logic
>>> |
>>> v
>>> == Decode Stage Buffer ==
>> ID2 Stage.
>>> Immediate Data
>>> |
>>> | |---< Reg File
>>> | |
>>> | | |---------------
>>> v v v |
>>> Operand source mux |
>>> | | |
>>> v v |
>>> == Reg Read Stage Buffer == |
>> EX1 Stage
>>> Operand values |
>>> | | |
>>> v v |
>>> ALU |
>>> |>-------------------
>>> v
>>> == ALU Result Stage Buffer ==
>> Start of EX2 stage, ALU result gets forwarded here...
> <
> Given that you understand the nature of the Adder->result->forwarding->
> Operand as four ¼-cycle units of work. I have seen the flip-flops put in
> 3 of the 4 possible places. The thing is a logic-loop that needs a point
> of clock synchronization. But the flip-flops can be put:: I have seen::
> 1) flop->operand->adder->result->forward->flop
> 2) operand->flop->adder->result->forward->operand
> 3) operand->adder->flop->adder->result->forward
> but nothing rules out::
> 4) operand->adder->result->flop->forward->operand
> <
> And there are other variants when you have latches.......
> <

OK.

As noted, I had noted that this sort of configuration seems to work OK
for timing, even if not the best for instruction latency.

Say, timing prefers:
Input comes from clocked flip-flops;
Outputs go to clocked flip-flops;
MUX'ing is preferentially between pairs of flip-flops on clock-edges.

Comparably, bit-reordering is cheaper than most other logic.
Or, ADD'er chains are comparably slow.

But, OTOH, huge high-fanout signals (such as pipeline stall signals) are
also not ideal. But, this is harder to avoid.

Things like register forwarding and output value selection seem to have
a cost in that the signals may need to cross a larger area of the FPGA
(and more subject to "net delay").

I suspect MUX'ing and net delay was the main cost of the CONV ops.

Some previous "faster" logic had been turned into higher latency logic
for sake of loosening up timing, say:
Originally CMPxx+BT/BF or CMPxx+OP?T/OP?F would have no extra penalty
(predication was handled directly in EX1).

But, I ended up moving predication handling partly to the ID2 stage, as
this was better for timing. But, now means operations which update SR.T
need to trigger an interlock with operations that depend on SR.T ...

But, then this tradeoff made it easier to afford things like 32K L1
caches and similar.

So, along with LUT cost, FPGA timing constraints have been a
long-standing battle (and it is annoyingly difficult trying to achieve
or maintain clock speeds much over 50MHz or so on the Artix-7...).

Sometimes, one may realize trickery... Say:
For a Luma signal, one need not actually calculate luma "properly".

Proper luma might be, say:
Y=(2*G+R+B)/4

But, another trick might be to interleave the bits:
ggrbgrbg

One other trick being that one can linearly interpolate between a pair
of values using a 4-bit selector, say:
val = ((sel[3]?A:B)>>1) +
((sel[2]?A:B)>>2) +
((sel[1]?A:B)>>3) +
((sel[0]?A:B)>>3) ;
Which sort of works OK for small values (say, 8 or 12 bits).
Doesn't really scale well to larger values or larger selectors though.

Which, can also be roughly turned into a 3/8 + 5/8 blend:
val = ((sel[1]?A:B)>>1) +
((sel[0]?A:B)>>2) +
(A>>3) + (B>>3) ;
00: 7/8*B + 1/8*A (~=B)
01: 5/8*B + 3/8*A
10: 3/8*B + 5/8*A
11: 1/8*B + 7/8*A (~=A)

Though, one other option is to use "lookup tables" with a single adder,
which seems to have less latency for 5-bit input components mapped to an
8-bit scaled value, for 3/8 or 5/8.

Say (terse pseudocode):
sel5q8 = sel[1]?A:B;
sel3q8 = sel[0]?A:B;
sc5q8 = scale5q8tab[sel5q8]; //*1
sc3q8 = scale3q8tab[sel3q8];
C = sc5q8 + sc3q8;

*1: Actually a "case()" block in Verilog.

>>>> Notice that the clock edge locks the ALU forwarded value into
>>> the RR output stage, unless the RR stage output is stalled in which case
>>> it holds the ALU input stable and therefore also the forwarded result.
>>>
>>> This needs to be done consistently across all stages.
>>> It also needs to forward a load value to RR stage before WB.
>>>
>>> It also should be able to forward a register read value, or ALU result,
>>> or load result, or WB exception address to fetch for branches and jumps,
>>> again with no extra clocks. So for example
>>> JMP reg
>>> can go directly from RR to Fetch and start fetching the next cycle.
>>>
>>> But this forwarding is gold plating, after the pipeline is full.
>>>
>> Yeah.
>>
>> I was experimenting with going the way of "increasing" the effective
>> latency in some cases, trying to loosen timing enough that I could
>> hopefully boost the clock speed.
>>
>>
>> Getting more stuff to flow through the pipeline could be better, but is
>> the tired old path of continuing to beat on my C compiler...
>>
>>
>>>

Subject	Replies	Author
Misc: Another (possible) way to more MHz... By: BGB on Thu, 28 Sep 2023	34	BGB

<doogie> Thinking is dangerous. It leads to ideas. -- Seen on #Debian

computers / comp.arch / Re: Misc: Another (possible) way to more MHz...