Rocksolid Light - comp.arch - Misc: Another (possible) way to more MHz...

I recently had an idea (that small scale testing doesn't require
redesigning my whole pipeline):
If one delays nearly all of the operations to at least a 2-cycle
latency, then seemingly the timing gets a fair bit better.

In particular, a few 1-cycle units:
SHAD (bitwise shift and similar)
CONV (various bit-repacking instructions)
Were delayed to 2 cycle:
SHAD:
2 cycle latency doesn't have much obvious impact;
CONV: Minor impact
I suspect due to delaying MOV-2R and EXTx.x and similar.
I could special-case these in Lane 1.

There was already a slower CONV2 path which had mostly dealt with things
like FPU format conversion and other "more complicated" format
converters, so the CONV path had mostly been left for operations that
mostly involved shuffling the bits around (and the simple case
2-register MOV instruction and similar, etc).

Note that most ALU ops were already generally 2-cycle as well.

Partly this idea was based on the observation that adding the logic for
a BSWAP.Q instruction to the CONV path had a disproportionate impact on
LUT cost and timing. The actual logic in this case is very simple
(mostly shuffling the bytes around), so theoretically should not have
had as big of an impact.

Testing this idea, thus far, isn't enough to get the clock boosted to
75MHz, but did seemingly help here, and has seemingly redirected the
"worst failing paths" from being through the D$->EXn->RF pipeline, over
to being D$->ID1.

Along with paths from the input to the output side of the instruction
decoder. Might also consider disabling the (mostly not used for much)
RISC-V decoders, and see if this can help.

Had also now disabled the LDTEX instruction, now as it is "somewhat less
important" if TKRA-GL is mapped through a hardware rasterizer module.

And, thus far, unlike past attempts in this area, this approach doesn't
effectively ruin the performance of the L1 D$.

Seems like one could possibly try to design a core around this
assumption, avoiding any cases where combinatorial logic feeds into the
register-forwarding path (or, cheaper still, not have any register
forwarding; but giving every op a 3-cycle latency would be a little steep).

Though, one possibility could be to disable register forwarding from
Lane 3, in which case only interlocks would be available.
This would work partly as Lane 3 isn't used anywhere near as often as
Lanes 1 or 2.

....

Any thoughts?...

On Thursday, September 28, 2023 at 1:08:10 PM UTC-4, BGB wrote:
> I recently had an idea (that small scale testing doesn't require
> redesigning my whole pipeline):
> If one delays nearly all of the operations to at least a 2-cycle
> latency, then seemingly the timing gets a fair bit better.
>
> In particular, a few 1-cycle units:
> SHAD (bitwise shift and similar)
> CONV (various bit-repacking instructions)
> Were delayed to 2 cycle:
> SHAD:
> 2 cycle latency doesn't have much obvious impact;
> CONV: Minor impact
> I suspect due to delaying MOV-2R and EXTx.x and similar.
> I could special-case these in Lane 1.
>
>
> There was already a slower CONV2 path which had mostly dealt with things
> like FPU format conversion and other "more complicated" format
> converters, so the CONV path had mostly been left for operations that
> mostly involved shuffling the bits around (and the simple case
> 2-register MOV instruction and similar, etc).
>
> Note that most ALU ops were already generally 2-cycle as well.
>
>
> Partly this idea was based on the observation that adding the logic for
> a BSWAP.Q instruction to the CONV path had a disproportionate impact on
> LUT cost and timing. The actual logic in this case is very simple
> (mostly shuffling the bytes around), so theoretically should not have
> had as big of an impact.
>
>
> Testing this idea, thus far, isn't enough to get the clock boosted to
> 75MHz, but did seemingly help here, and has seemingly redirected the
> "worst failing paths" from being through the D$->EXn->RF pipeline, over
> to being D$->ID1.
>
> Along with paths from the input to the output side of the instruction
> decoder. Might also consider disabling the (mostly not used for much)
> RISC-V decoders, and see if this can help.
>
> Had also now disabled the LDTEX instruction, now as it is "somewhat less
> important" if TKRA-GL is mapped through a hardware rasterizer module.
>
>
> And, thus far, unlike past attempts in this area, this approach doesn't
> effectively ruin the performance of the L1 D$.
>
>
> Seems like one could possibly try to design a core around this
> assumption, avoiding any cases where combinatorial logic feeds into the
> register-forwarding path (or, cheaper still, not have any register
> forwarding; but giving every op a 3-cycle latency would be a little steep).
>
> Though, one possibility could be to disable register forwarding from
> Lane 3, in which case only interlocks would be available.
> This would work partly as Lane 3 isn't used anywhere near as often as
> Lanes 1 or 2.
>
> ...
>
>
>
> Any thoughts?...

Sounds like super-pipelining. I did this sort of thing for my PowerPC compatible
core. Each stage was multi-cycle, but it was still an overlapped pipeline. And it
did boost the clock frequency significantly. Overall performance was not a
whole lot better though. It might be good if it is desired to use a high clock
frequency. I prefer to use a lower clock frequency; it consumes less power.

BGB wrote:
> I recently had an idea (that small scale testing doesn't require
> redesigning my whole pipeline):
> If one delays nearly all of the operations to at least a 2-cycle
> latency, then seemingly the timing gets a fair bit better.
>
> In particular, a few 1-cycle units:
> SHAD (bitwise shift and similar)
> CONV (various bit-repacking instructions)
> Were delayed to 2 cycle:
> SHAD:
> 2 cycle latency doesn't have much obvious impact;
> CONV: Minor impact
> I suspect due to delaying MOV-2R and EXTx.x and similar.
> I could special-case these in Lane 1.
>
>
> There was already a slower CONV2 path which had mostly dealt with things
> like FPU format conversion and other "more complicated" format
> converters, so the CONV path had mostly been left for operations that
> mostly involved shuffling the bits around (and the simple case
> 2-register MOV instruction and similar, etc).
>
> Note that most ALU ops were already generally 2-cycle as well.
>
>
> Partly this idea was based on the observation that adding the logic for
> a BSWAP.Q instruction to the CONV path had a disproportionate impact on
> LUT cost and timing. The actual logic in this case is very simple
> (mostly shuffling the bytes around), so theoretically should not have
> had as big of an impact.
>
>
> Testing this idea, thus far, isn't enough to get the clock boosted to
> 75MHz, but did seemingly help here, and has seemingly redirected the
> "worst failing paths" from being through the D$->EXn->RF pipeline, over
> to being D$->ID1.
>
> Along with paths from the input to the output side of the instruction
> decoder. Might also consider disabling the (mostly not used for much)
> RISC-V decoders, and see if this can help.
>
> Had also now disabled the LDTEX instruction, now as it is "somewhat less
> important" if TKRA-GL is mapped through a hardware rasterizer module.
>
>
> And, thus far, unlike past attempts in this area, this approach doesn't
> effectively ruin the performance of the L1 D$.
>
>
> Seems like one could possibly try to design a core around this
> assumption, avoiding any cases where combinatorial logic feeds into the
> register-forwarding path (or, cheaper still, not have any register
> forwarding; but giving every op a 3-cycle latency would be a little steep).
>
> Though, one possibility could be to disable register forwarding from
> Lane 3, in which case only interlocks would be available.
> This would work partly as Lane 3 isn't used anywhere near as often as
> Lanes 1 or 2.
>
> ....
>
>
>
> Any thoughts?...

Its not just the MHz but the IPC you need to think about.
If you are running at 50 MHz but an actual IPC of 0.1 due to
stalls and pipeline bubbles then that's really just 5 MIPS.

I don't know what diagnostic probes you have. I would want to see
what each stage is doing in real time as it single stepped each clock,
see which stage buffers are valid or empty,
where stalls are originating and propagating.
Essentially the same info a cycle accurate simulator would show you.

That information can be used to guide where, for example,
a limited budget of forwarding buses or extra 64-bit adders
might best be utilized to eliminate bubbles and increase the IPC.

If whole-pipeline stalls are eating your IPC then maybe it doesn't
need elastic buffers on all stages, but maybe on just one stage
after RR to decouple the MEM stalls from IF-ID-RR stages.

On 9/29/2023 11:02 AM, EricP wrote:
> BGB wrote:
>> I recently had an idea (that small scale testing doesn't require
>> redesigning my whole pipeline):
>> If one delays nearly all of the operations to at least a 2-cycle
>> latency, then seemingly the timing gets a fair bit better.
>>
>> In particular, a few 1-cycle units:
>> SHAD (bitwise shift and similar)
>> CONV (various bit-repacking instructions)
>> Were delayed to 2 cycle:
>> SHAD:
>>     2 cycle latency doesn't have much obvious impact;
>> CONV: Minor impact
>>     I suspect due to delaying MOV-2R and EXTx.x and similar.
>>     I could special-case these in Lane 1.
>>
>>
>> There was already a slower CONV2 path which had mostly dealt with
>> things like FPU format conversion and other "more complicated" format
>> converters, so the CONV path had mostly been left for operations that
>> mostly involved shuffling the bits around (and the simple case
>> 2-register MOV instruction and similar, etc).
>>
>> Note that most ALU ops were already generally 2-cycle as well.
>>
>>
>> Partly this idea was based on the observation that adding the logic
>> for a BSWAP.Q instruction to the CONV path had a disproportionate
>> impact on LUT cost and timing. The actual logic in this case is very
>> simple (mostly shuffling the bytes around), so theoretically should
>> not have had as big of an impact.
>>
>>
>> Testing this idea, thus far, isn't enough to get the clock boosted to
>> 75MHz, but did seemingly help here, and has seemingly redirected the
>> "worst failing paths" from being through the D$->EXn->RF pipeline,
>> over to being D$->ID1.
>>
>> Along with paths from the input to the output side of the instruction
>> decoder. Might also consider disabling the (mostly not used for much)
>> RISC-V decoders, and see if this can help.
>>
>> Had also now disabled the LDTEX instruction, now as it is "somewhat
>> less important" if TKRA-GL is mapped through a hardware rasterizer
>> module.
>>
>>
>> And, thus far, unlike past attempts in this area, this approach
>> doesn't effectively ruin the performance of the L1 D$.
>>
>>
>> Seems like one could possibly try to design a core around this
>> assumption, avoiding any cases where combinatorial logic feeds into
>> the register-forwarding path (or, cheaper still, not have any register
>> forwarding; but giving every op a 3-cycle latency would be a little
>> steep).
>>
>> Though, one possibility could be to disable register forwarding from
>> Lane 3, in which case only interlocks would be available.
>> This would work partly as Lane 3 isn't used anywhere near as often as
>> Lanes 1 or 2.
>>
>> ....
>>
>>
>>
>> Any thoughts?...
>
> Its not just the MHz but the IPC you need to think about.
> If you are running at 50 MHz but an actual IPC of 0.1 due to
> stalls and pipeline bubbles then that's really just 5 MIPS.
>

For running stats from a running full simulation (predates to these
tweaks, running GLQuake with the HW rasterizer):
~ 0.48 .. 0.54 bundles clock;
~ 1.10 .. 1.40 instructions/bundle.

Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).

Top ranking uses of clock-cycles (for total stall cycles):
L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
Misc : ~ 23% (Misc uncategorized stalls)
IL : ~ 20% (Interlock stalls)
L1 I$ : ~ 18% (16K L1 I$, 1)
L1 D$ : ~ 9% (32K L1 D$)

The IL (or Interlock) penalty is the main one that would be effected by
increasing latency.

In general, the full simulation simulates pretty much all of the
hardware modules via Verilator (displaying the VGA output image as
output, and handling keyboard inputs via a PS/2 interface).

1: The bigger D$ was better for Doom and similar, but GLQuake seems to
lean in a lot more to the I$. Switching to a HW rasterizer seems to have
increased this imbalance.

At the moment, Doom tends to average slightly higher in terms of MIPs
scores.

As for emulator stats, the main instructions with high interlock
penalties are:
MOV.Q, MOV.L, ADD

MOV.Q and MOV.L seem to be spending around half of their clock-cycles on
interlocks, so around ~2 cycles average.

ADD seems to be spending around 1/3 of its cycles on interlock (the main
ALU ops already had a 2-cycle cost since early on).

The SHxD.x operators were increased to 2 cycles by this change, but are
generally a bit further down the list so that they don't hurt as much.

I did notice an increase in time spent in the millisecond counters on in
the boot-up sequence, but this seems mostly due to the CONV operations.

This appears most likely due to the "2 register MOV" instruction, which
itself uses around 3% of the total cycle budget (as a 1-cycle op), so
likely isn't helped by being demoted to 2-cycles.

I may need to add a special case here to allow either for "MOV"
specifically, or a subset of CONV ops (such as MOV and Sign/Zero
extension) to remain as 1-cycle ops (EXTS.L and EXTU.L often being used
in place of MOV for "int" and "unsigned int" to make sure they remain
properly extended).

> I don't know what diagnostic probes you have. I would want to see
> what each stage is doing in real time as it single stepped each clock,
> see which stage buffers are valid or empty,
> where stalls are originating and propagating.
> Essentially the same info a cycle accurate simulator would show you.
>

I have my emulator, which I try to keep cycle-accurate (though, it
hasn't been updated for this tweak yet).

It doesn't display any real-time information, but rather a big mess of
stats dumping at the end.

> That information can be used to guide where, for example,
> a limited budget of forwarding buses or extra 64-bit adders
> might best be utilized to eliminate bubbles and increase the IPC.
>
> If whole-pipeline stalls are eating your IPC then maybe it doesn't
> need elastic buffers on all stages, but maybe on just one stage
> after RR to decouple the MEM stalls from IF-ID-RR stages.
>

I had considered possibly redesigning the pipeline at one point to allow
a different mechanism for handling L1 misses. Namely marking registers
for missed loads as "not ready" and then stalling the Fetch/Decode
stages if the bundle would depend on a not-ready register (injecting the
fetched data back into the pipeline once the load completes).

However, redesigning the main pipeline would not be a small task.

However, as-is, my fiddling still falls well-short of being able to
boost the clock-speed to 75MHz.

Though, it looks like 75 MHz would be able to boost GLQuake mostly into
double-digit territory (assuming I can do so without wrecking the L1
caches or similar... which was always the problem in the past).

If the L1 caches are dropped to 2K or so, timing gets easier, but these
have a significantly higher L1 miss rate and then most of the clock
cycles end up going into dealing with L1 misses.

....

On Friday, September 29, 2023 at 1:07:47 PM UTC-4, BGB wrote:

> I had considered possibly redesigning the pipeline at one point to allow
> a different mechanism for handling L1 misses. Namely marking registers
> for missed loads as "not ready" and then stalling the Fetch/Decode
> stages if the bundle would depend on a not-ready register (injecting the
> fetched data back into the pipeline once the load completes).
>

I think you just re-invented the CDC 6600 register scoreboard.
- Tim

BGB wrote:
> On 9/29/2023 11:02 AM, EricP wrote:
>> BGB wrote:
>>> I recently had an idea (that small scale testing doesn't require
>>> redesigning my whole pipeline):
>>> If one delays nearly all of the operations to at least a 2-cycle
>>> latency, then seemingly the timing gets a fair bit better.
>>>
>>> In particular, a few 1-cycle units:
>>> SHAD (bitwise shift and similar)
>>> CONV (various bit-repacking instructions)
>>> Were delayed to 2 cycle:
>>> SHAD:
>>> 2 cycle latency doesn't have much obvious impact;
>>> CONV: Minor impact
>>> I suspect due to delaying MOV-2R and EXTx.x and similar.
>>> I could special-case these in Lane 1.
>>>
>>>
>>> There was already a slower CONV2 path which had mostly dealt with
>>> things like FPU format conversion and other "more complicated" format
>>> converters, so the CONV path had mostly been left for operations that
>>> mostly involved shuffling the bits around (and the simple case
>>> 2-register MOV instruction and similar, etc).
>>>
>>> Note that most ALU ops were already generally 2-cycle as well.
>>>
>>>
>>> Partly this idea was based on the observation that adding the logic
>>> for a BSWAP.Q instruction to the CONV path had a disproportionate
>>> impact on LUT cost and timing. The actual logic in this case is very
>>> simple (mostly shuffling the bytes around), so theoretically should
>>> not have had as big of an impact.
>>>
>>>
>>> Testing this idea, thus far, isn't enough to get the clock boosted to
>>> 75MHz, but did seemingly help here, and has seemingly redirected the
>>> "worst failing paths" from being through the D$->EXn->RF pipeline,
>>> over to being D$->ID1.
>>>
>>> Along with paths from the input to the output side of the instruction
>>> decoder. Might also consider disabling the (mostly not used for much)
>>> RISC-V decoders, and see if this can help.
>>>
>>> Had also now disabled the LDTEX instruction, now as it is "somewhat
>>> less important" if TKRA-GL is mapped through a hardware rasterizer
>>> module.
>>>
>>>
>>> And, thus far, unlike past attempts in this area, this approach
>>> doesn't effectively ruin the performance of the L1 D$.
>>>
>>>
>>> Seems like one could possibly try to design a core around this
>>> assumption, avoiding any cases where combinatorial logic feeds into
>>> the register-forwarding path (or, cheaper still, not have any
>>> register forwarding; but giving every op a 3-cycle latency would be a
>>> little steep).
>>>
>>> Though, one possibility could be to disable register forwarding from
>>> Lane 3, in which case only interlocks would be available.
>>> This would work partly as Lane 3 isn't used anywhere near as often as
>>> Lanes 1 or 2.
>>>
>>> ....
>>>
>>>
>>>
>>> Any thoughts?...
>>
>> Its not just the MHz but the IPC you need to think about.
>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>
>
> For running stats from a running full simulation (predates to these
> tweaks, running GLQuake with the HW rasterizer):
> ~ 0.48 .. 0.54 bundles clock;
> ~ 1.10 .. 1.40 instructions/bundle.
>
> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).

Oh that's pretty efficient then.
In the past you had made comments which made it sound like
having tlb, cache, and dram controller all hung off of what
you called your "ring bus", which sounded like a token ring,
and that the RB consumed many cycles latency.
That gave me the impression of frequent, large stalls to cache,
lots of bubbles, leading to low IPC.

> Top ranking uses of clock-cycles (for total stall cycles):
> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
> Misc : ~ 23% (Misc uncategorized stalls)
> IL : ~ 20% (Interlock stalls)
> L1 I$ : ~ 18% (16K L1 I$, 1)
> L1 D$ : ~ 9% (32K L1 D$)
>
> The IL (or Interlock) penalty is the main one that would be effected by
> increasing latency.

By "interlock stalls" do you mean register RAW dependency stalls?
As distinct from D$L1 read access stall, if read access time > 1 clock
or multi-cycle function units like integer divide.

> In general, the full simulation simulates pretty much all of the
> hardware modules via Verilator (displaying the VGA output image as
> output, and handling keyboard inputs via a PS/2 interface).
>
> 1: The bigger D$ was better for Doom and similar, but GLQuake seems to
> lean in a lot more to the I$. Switching to a HW rasterizer seems to have
> increased this imbalance.
>
>
> At the moment, Doom tends to average slightly higher in terms of MIPs
> scores.
>
>
>
> As for emulator stats, the main instructions with high interlock
> penalties are:
> MOV.Q, MOV.L, ADD
>
> MOV.Q and MOV.L seem to be spending around half of their clock-cycles on
> interlocks, so around ~2 cycles average.

I assume these are memory moves, aka LD/ST.
In this context does interlock mean a source register RAW dependency stall?

> ADD seems to be spending around 1/3 of its cycles on interlock (the main
> ALU ops already had a 2-cycle cost since early on).

IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
to do a 64-bit add, so it uses two pipeline stages for ALU.

And by interlock you mean if an instruction following in the next 2 slots
reads that same dest register then it must stall at RR for 1 or 2 clocks,
(assuming you have a forwarding bus from ALU result to RR, otherwise more).

So 1/3 of ADD instructions have this dependency.
Is that correct?

On Friday, September 29, 2023 at 1:03:02 PM UTC-5, Timothy McCaffrey wrote:
> On Friday, September 29, 2023 at 1:07:47 PM UTC-4, BGB wrote:
>
> > I had considered possibly redesigning the pipeline at one point to allow
> > a different mechanism for handling L1 misses. Namely marking registers
> > for missed loads as "not ready" and then stalling the Fetch/Decode
> > stages if the bundle would depend on a not-ready register (injecting the
> > fetched data back into the pipeline once the load completes).
> >
> I think you just re-invented the CDC 6600 register scoreboard.
> - Tim
<
Not quite:: CDC 6600 scoreboard scheduled the beginning of instruction
execution and also the end of instruction execution {{in contrast Thomasulo
only schedules the beginning of instruction execution}}
<
There are "all sorts of" mechanisms that provide RAW interlocking.

On Friday, September 29, 2023 at 12:07:47 PM UTC-5, BGB wrote:
> On 9/29/2023 11:02 AM, EricP wrote:
> >
> >
> For running stats from a running full simulation (predates to these
> tweaks, running GLQuake with the HW rasterizer):
> ~ 0.48 .. 0.54 bundles clock;
> ~ 1.10 .. 1.40 instructions/bundle.
<
So, about equal to the 1-wide 1st generation RISC machines, which got
0.7 I/C {including cache misses, delay slots, interlocks, TLB misses.}
>
> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).
>
Probably good for a 1-wide, not so good for a 3-wide.
>
> Top ranking uses of clock-cycles (for total stall cycles):
> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
> Misc : ~ 23% (Misc uncategorized stalls)
> IL : ~ 20% (Interlock stalls)
> L1 I$ : ~ 18% (16K L1 I$, 1)
> L1 D$ : ~ 9% (32K L1 D$)
>
> The IL (or Interlock) penalty is the main one that would be effected by
> increasing latency.
>
> In general, the full simulation simulates pretty much all of the
> hardware modules via Verilator (displaying the VGA output image as
> output, and handling keyboard inputs via a PS/2 interface).
>
> 1: The bigger D$ was better for Doom and similar, but GLQuake seems to
> lean in a lot more to the I$. Switching to a HW rasterizer seems to have
> increased this imbalance.
>
>
> At the moment, Doom tends to average slightly higher in terms of MIPs
> scores.
>
>
>
> As for emulator stats, the main instructions with high interlock
> penalties are:
> MOV.Q, MOV.L, ADD
>
> MOV.Q and MOV.L seem to be spending around half of their clock-cycles on
> interlocks, so around ~2 cycles average.
>
> ADD seems to be spending around 1/3 of its cycles on interlock (the main
> ALU ops already had a 2-cycle cost since early on).
>
>
>
> The SHxD.x operators were increased to 2 cycles by this change, but are
> generally a bit further down the list so that they don't hurt as much.
>
> I did notice an increase in time spent in the millisecond counters on in
> the boot-up sequence, but this seems mostly due to the CONV operations.
>
>
> This appears most likely due to the "2 register MOV" instruction, which
> itself uses around 3% of the total cycle budget (as a 1-cycle op), so
> likely isn't helped by being demoted to 2-cycles.
>
> I may need to add a special case here to allow either for "MOV"
> specifically, or a subset of CONV ops (such as MOV and Sign/Zero
> extension) to remain as 1-cycle ops (EXTS.L and EXTU.L often being used
> in place of MOV for "int" and "unsigned int" to make sure they remain
> properly extended).
> > I don't know what diagnostic probes you have. I would want to see
> > what each stage is doing in real time as it single stepped each clock,
> > see which stage buffers are valid or empty,
> > where stalls are originating and propagating.
> > Essentially the same info a cycle accurate simulator would show you.
> >
> I have my emulator, which I try to keep cycle-accurate (though, it
> hasn't been updated for this tweak yet).
>
> It doesn't display any real-time information, but rather a big mess of
> stats dumping at the end.
> > That information can be used to guide where, for example,
> > a limited budget of forwarding buses or extra 64-bit adders
> > might best be utilized to eliminate bubbles and increase the IPC.
> >
> > If whole-pipeline stalls are eating your IPC then maybe it doesn't
> > need elastic buffers on all stages, but maybe on just one stage
> > after RR to decouple the MEM stalls from IF-ID-RR stages.
> >
> I had considered possibly redesigning the pipeline at one point to allow
> a different mechanism for handling L1 misses. Namely marking registers
> for missed loads as "not ready" and then stalling the Fetch/Decode
> stages if the bundle would depend on a not-ready register (injecting the
> fetched data back into the pipeline once the load completes).
>
> However, redesigning the main pipeline would not be a small task.
>
>
>
> However, as-is, my fiddling still falls well-short of being able to
> boost the clock-speed to 75MHz.
>
> Though, it looks like 75 MHz would be able to boost GLQuake mostly into
> double-digit territory (assuming I can do so without wrecking the L1
> caches or similar... which was always the problem in the past).
>
>
> If the L1 caches are dropped to 2K or so, timing gets easier, but these
> have a significantly higher L1 miss rate and then most of the clock
> cycles end up going into dealing with L1 misses.
>
> ...

On 9/29/2023 2:02 PM, EricP wrote:
> BGB wrote:
>> On 9/29/2023 11:02 AM, EricP wrote:
>>> BGB wrote:
>>>> I recently had an idea (that small scale testing doesn't require
>>>> redesigning my whole pipeline):
>>>> If one delays nearly all of the operations to at least a 2-cycle
>>>> latency, then seemingly the timing gets a fair bit better.
>>>>
>>>> In particular, a few 1-cycle units:
>>>> SHAD (bitwise shift and similar)
>>>> CONV (various bit-repacking instructions)
>>>> Were delayed to 2 cycle:
>>>> SHAD:
>>>>     2 cycle latency doesn't have much obvious impact;
>>>> CONV: Minor impact
>>>>     I suspect due to delaying MOV-2R and EXTx.x and similar.
>>>>     I could special-case these in Lane 1.
>>>>
>>>>
>>>> There was already a slower CONV2 path which had mostly dealt with
>>>> things like FPU format conversion and other "more complicated"
>>>> format converters, so the CONV path had mostly been left for
>>>> operations that mostly involved shuffling the bits around (and the
>>>> simple case 2-register MOV instruction and similar, etc).
>>>>
>>>> Note that most ALU ops were already generally 2-cycle as well.
>>>>
>>>>
>>>> Partly this idea was based on the observation that adding the logic
>>>> for a BSWAP.Q instruction to the CONV path had a disproportionate
>>>> impact on LUT cost and timing. The actual logic in this case is very
>>>> simple (mostly shuffling the bytes around), so theoretically should
>>>> not have had as big of an impact.
>>>>
>>>>
>>>> Testing this idea, thus far, isn't enough to get the clock boosted
>>>> to 75MHz, but did seemingly help here, and has seemingly redirected
>>>> the "worst failing paths" from being through the D$->EXn->RF
>>>> pipeline, over to being D$->ID1.
>>>>
>>>> Along with paths from the input to the output side of the
>>>> instruction decoder. Might also consider disabling the (mostly not
>>>> used for much) RISC-V decoders, and see if this can help.
>>>>
>>>> Had also now disabled the LDTEX instruction, now as it is "somewhat
>>>> less important" if TKRA-GL is mapped through a hardware rasterizer
>>>> module.
>>>>
>>>>
>>>> And, thus far, unlike past attempts in this area, this approach
>>>> doesn't effectively ruin the performance of the L1 D$.
>>>>
>>>>
>>>> Seems like one could possibly try to design a core around this
>>>> assumption, avoiding any cases where combinatorial logic feeds into
>>>> the register-forwarding path (or, cheaper still, not have any
>>>> register forwarding; but giving every op a 3-cycle latency would be
>>>> a little steep).
>>>>
>>>> Though, one possibility could be to disable register forwarding from
>>>> Lane 3, in which case only interlocks would be available.
>>>> This would work partly as Lane 3 isn't used anywhere near as often
>>>> as Lanes 1 or 2.
>>>>
>>>> ....
>>>>
>>>>
>>>>
>>>> Any thoughts?...
>>>
>>> Its not just the MHz but the IPC you need to think about.
>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>>
>>
>> For running stats from a running full simulation (predates to these
>> tweaks, running GLQuake with the HW rasterizer):
>> ~ 0.48 .. 0.54 bundles clock;
>> ~ 1.10 .. 1.40 instructions/bundle.
>>
>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).
>
> Oh that's pretty efficient then.
> In the past you had made comments which made it sound like
> having tlb, cache, and dram controller all hung off of what
> you called your "ring bus", which sounded like a token ring,
> and that the RB consumed many cycles latency.
> That gave me the impression of frequent, large stalls to cache,
> lots of bubbles, leading to low IPC.
>

It does diminish IPC, but not as much as my older bus...

It seems like, if there were no memory related overheads (if the L1
always hit), as is it would be in the area of 22% faster.

L1 misses are still not good though, but this is true even on a modern
desktop PC.

I suspect ringbus efficiency is diminishing the efficiency of external
RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
the "raw" speed of accessing the RAM chip (in the associated unit tests).

However, if L1 cache sizes are reduced, performance tends to go in the
toilet.

As-is, performance is "not awful", and I am frequently getting upwards
of 20 fps in Doom (where Doom seems to max out at 35 fps).

Earlier on, somehow I had thought Doom would also get 20-30 fps on an
386, but I have since realized that this was not the case... (hence why
Doom has a feature to make the screen size smaller).

By most metrics, it seems like I am outperforming a 386; some stats
seems to imply that I am getting performance on par with early PowerPC
systems (though, I don't have any way to test this).

Did see a video not to long ago where someone ran Doom on a "NeXTcube",
and it was pretty obvious that it was running at single-digit
framerates. So, it seems like I am not doing too badly.

In my childhood, a 486DX2-66 had no trouble running Doom, so I had sort
of thought this was the general experience (I do remember though that
trying to run Quake on the thing "sucked pretty bad" though...).

But, in contrast, the original PlayStation had a 33 MHz MIPS R3000 CPU
and seemingly had no real difficulty running fully 3D games like
"MegaMan Legends" and similar.

Granted, I don't know what gamedev practices were like on the
PlayStation, like if it was C or ASM, or if people accessed the hardware
directly or went through APIs along similar lines to Windows GDI or
OpenGL or DirectX or similar...

As noted, in my case, I was going for an API design partly inspired by
the Windows GDI and also supporting OpenGL. Though, the design differs
slightly in that the practice is to render into a GL context, read the
raster image from the GL context, and then use the "GDI" calls to copy
it over into the screen "window" (by passing a "BIMAPINFOHEADER" pointer
along with a pointer to the buffer holding the raster image).

Though, it is possible things could be faster if it were possible to
draw directly into screen memory. Though, this is foiled some by me not
using a raster-oriented framebuffer for the video memory.

Possible eventual TODO would be to add some raster-oriented display
modes. Maybe also adding a call along similar lines to "SwapBuffers";
but then this would require a mechanism to change the location of the
screen's framebuffer in memory (I guess probably a better option for
performance than "memcpy'ing the screen contents or similar").

Will imagine that the original PlayStation probably did not have these
issues...

Though, did recently go and add support for multitexturing, which does
make lightmap rendering a little more practical (can use lightmaps with
less performance impact).

However, dynamic lighting effects don't work so well, as every time
GLQuake tries redrawing and reuploading the dynamic lightmaps,
performance tanks (but, setting "r_dynamic" to 0 is sort of lacking in
animated lighting effects).

I guess one possible option could be to modify it to render 2 versions
of each lightmap:
One with dynamic light-sources on;
One with dynamic light-sources off;
Then, weight and average the applicable "lightstyle" values, and then
use a threshold to select which of the lightmap textures to use.

This would at least deal with light styles like "flourospark", but not
so much the gentle strobe or flicker effects, but alas.

Then again, as-is I will probably stick with vertex lighting, as this is
faster.

Also, using the "glBegin/glEnd/glVertex" rendering strategy has some
amount of overhead (vertex arrays are faster; but the parts of the
engine that were modified to use vertex arrays don't currently support
the lightmap modes).

There is still some overhead that the engine walks the BSP and rebuilds
the vertex arrays each frame (had considered possibly modifying it to
cache the vertex arrays, and only rebuild them when the player moves
into a different PVS, but hadn't gotten around to experimenting with this).

Click here to read the complete article

On 9/29/2023 6:58 PM, MitchAlsup wrote:
> On Friday, September 29, 2023 at 12:07:47 PM UTC-5, BGB wrote:
>> On 9/29/2023 11:02 AM, EricP wrote:
>>>
>>>
>> For running stats from a running full simulation (predates to these
>> tweaks, running GLQuake with the HW rasterizer):
>> ~ 0.48 .. 0.54 bundles clock;
>> ~ 1.10 .. 1.40 instructions/bundle.
> <
> So, about equal to the 1-wide 1st generation RISC machines, which got
> 0.7 I/C {including cache misses, delay slots, interlocks, TLB misses.}
>>
>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).
>>
> Probably good for a 1-wide, not so good for a 3-wide.

Plain 1-wide operation is still a little worse here...

As noted, in practice, it is only averaging around 1.1 to 1.4
instructions per bundle, which (in general) means it is mostly running
1-wide code with the occasional WEX'ed instruction glued on.

It is generally easier to make more effective use of 2 or 3 wide
operation in ASM, but not so much from my C compiler output (unless the
C code is written in a way that allows the compiler to make more
effective use of it).

The 3rd lane is infrequently used in practice, so mostly just exists as
an excuse to have a 6R3W register file (mostly useful for 128-bit SIMD
ops and similar).

For a considered "GPU Mode" profile, had considered dropping to 2-lane
with a 6R2W register file, but the savings from dropping the 3rd lane
were fairly small (where, as-is, the 3rd lane only really does CONV and
ALU ops; and optionally can do integer shift ops as well). Had also
considered making this profile XG2-only.

Couldn't get it cheap enough to be worthwhile, and had since
(eventually) ended up writing a fixed-function rasterizer module instead
(which gets better performance and only needs around 1/6 as many LUTs as
a CPU core).

Granted, a fixed-function module can't run fragment shaders (could in
theory still run vertex shaders as this part still runs on the CPU;
maybe using the CPU for fragment shaders, if I ever get around to a GLSL
compiler).

....

On 9/29/2023 6:53 PM, MitchAlsup wrote:
> On Friday, September 29, 2023 at 1:03:02 PM UTC-5, Timothy McCaffrey wrote:
>> On Friday, September 29, 2023 at 1:07:47 PM UTC-4, BGB wrote:
>>
>>> I had considered possibly redesigning the pipeline at one point to allow
>>> a different mechanism for handling L1 misses. Namely marking registers
>>> for missed loads as "not ready" and then stalling the Fetch/Decode
>>> stages if the bundle would depend on a not-ready register (injecting the
>>> fetched data back into the pipeline once the load completes).
>>>
>> I think you just re-invented the CDC 6600 register scoreboard.
>> - Tim
> <
> Not quite:: CDC 6600 scoreboard scheduled the beginning of instruction
> execution and also the end of instruction execution {{in contrast Thomasulo
> only schedules the beginning of instruction execution}}
> <
> There are "all sorts of" mechanisms that provide RAW interlocking.

As-is, it is a blob of logic something like (psuedocode):
needStallRs =
((id2IdRs == exA1IdRn) && exA1Held) ||
((id2IdRs == exB1IdRn) && exB1Held) ||
((id2IdRs == exC1IdRn) && exC1Held) ||
((id2IdRs == exA2IdRn) && exA2Held) ||
((id2IdRs == exB2IdRn) && exB2Held) ||
((id2IdRs == exC2IdRn) && exC2Held) ||
((id2IdRs == exA3IdRn) && exA3Held) ||
((id2IdRs == exB3IdRn) && exB3Held) ||
((id2IdRs == exC3IdRn) && exC3Held) ;
needStallRt =
((id2IdRt == exA1IdRn) && exA1Held) ||
...
...
if(id2IdRs == REG_ID_ZZR)
needStallRs = 0;
if(id2IdRt == REG_ID_ZZR)
needStallRt = 0;
...

needStallInterlock = needStallRs || needStallRt || ...

Where the interlock stall mechanism causes the PF/IF/ID1/ID2 stages to
stall, with NOPs being forwarded into EX1.

The ready bit would likely be handled by internally expanding the
register fields, and then using one of the bits to signal whether or not
the value is ready to be used or is still waiting for an associated
operation to finish.

But, with ready flagging (and a mechanism to inject the values later),
it could be possible in theory to eliminate the need to stall the EX
stages (and possibly a way to "hide" some of the L1 misses).

....

On 9/29/2023 12:18 AM, robf...@gmail.com wrote:
> On Thursday, September 28, 2023 at 1:08:10 PM UTC-4, BGB wrote:
>> I recently had an idea (that small scale testing doesn't require
>> redesigning my whole pipeline):
>> If one delays nearly all of the operations to at least a 2-cycle
>> latency, then seemingly the timing gets a fair bit better.
>>
>> In particular, a few 1-cycle units:
>> SHAD (bitwise shift and similar)
>> CONV (various bit-repacking instructions)
>> Were delayed to 2 cycle:
>> SHAD:
>> 2 cycle latency doesn't have much obvious impact;
>> CONV: Minor impact
>> I suspect due to delaying MOV-2R and EXTx.x and similar.
>> I could special-case these in Lane 1.
>>
>>
>> There was already a slower CONV2 path which had mostly dealt with things
>> like FPU format conversion and other "more complicated" format
>> converters, so the CONV path had mostly been left for operations that
>> mostly involved shuffling the bits around (and the simple case
>> 2-register MOV instruction and similar, etc).
>>
>> Note that most ALU ops were already generally 2-cycle as well.
>>
>>
>> Partly this idea was based on the observation that adding the logic for
>> a BSWAP.Q instruction to the CONV path had a disproportionate impact on
>> LUT cost and timing. The actual logic in this case is very simple
>> (mostly shuffling the bytes around), so theoretically should not have
>> had as big of an impact.
>>
>>
>> Testing this idea, thus far, isn't enough to get the clock boosted to
>> 75MHz, but did seemingly help here, and has seemingly redirected the
>> "worst failing paths" from being through the D$->EXn->RF pipeline, over
>> to being D$->ID1.
>>
>> Along with paths from the input to the output side of the instruction
>> decoder. Might also consider disabling the (mostly not used for much)
>> RISC-V decoders, and see if this can help.
>>
>> Had also now disabled the LDTEX instruction, now as it is "somewhat less
>> important" if TKRA-GL is mapped through a hardware rasterizer module.
>>
>>
>> And, thus far, unlike past attempts in this area, this approach doesn't
>> effectively ruin the performance of the L1 D$.
>>
>>
>> Seems like one could possibly try to design a core around this
>> assumption, avoiding any cases where combinatorial logic feeds into the
>> register-forwarding path (or, cheaper still, not have any register
>> forwarding; but giving every op a 3-cycle latency would be a little steep).
>>
>> Though, one possibility could be to disable register forwarding from
>> Lane 3, in which case only interlocks would be available.
>> This would work partly as Lane 3 isn't used anywhere near as often as
>> Lanes 1 or 2.
>>
>> ...
>>
>>
>>
>> Any thoughts?...
>
> Sounds like super-pipelining. I did this sort of thing for my PowerPC compatible
> core. Each stage was multi-cycle, but it was still an overlapped pipeline. And it
> did boost the clock frequency significantly. Overall performance was not a
> whole lot better though. It might be good if it is desired to use a high clock
> frequency. I prefer to use a lower clock frequency; it consumes less power.

Looked stuff up:
I didn't add any more pipeline stages here, rather delayed the results
from some prior 1-cycle operations over to handled in the EX2 stage
instead (effectively turning them into instructions with a 2-cycle latency).

It is still partial (things like constant loads and similar are still
1-cycle ops).

BGB wrote:
> On 9/29/2023 2:02 PM, EricP wrote:
>> BGB wrote:
>>>>>
>>>>> Any thoughts?...
>>>>
>>>> Its not just the MHz but the IPC you need to think about.
>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>>>
>>>
>>> For running stats from a running full simulation (predates to these
>>> tweaks, running GLQuake with the HW rasterizer):
>>> ~ 0.48 .. 0.54 bundles clock;
>>> ~ 1.10 .. 1.40 instructions/bundle.
>>>
>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
>>> MIPs/MHz).
>>
>> Oh that's pretty efficient then.
>> In the past you had made comments which made it sound like
>> having tlb, cache, and dram controller all hung off of what
>> you called your "ring bus", which sounded like a token ring,
>> and that the RB consumed many cycles latency.
>> That gave me the impression of frequent, large stalls to cache,
>> lots of bubbles, leading to low IPC.
>>
>
> It does diminish IPC, but not as much as my older bus...

Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
That's not that efficient. I was thinking you were getting an IPC
of 0.5 out ~0.7, the maximum possible with 1 register write port.
A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
that pretty much bottlenecks you at WB/Retire to < 1.0.

I suspect those ring bus induced bubbles are likely killing your IPC.
Fiddling with the internals won't help if the pipeline is mostly empty.

I suggest the primary thing to think about for the future is getting the
pipeline as full as possible. Then consider making it more efficient
internally, adding more write register ports so you can retire > 1.0 IPC
(there is little point in having 3 lanes if you can only retire 1/clock).
Then thirdly start look at things like forwarding buses.

> It seems like, if there were no memory related overheads (if the L1
> always hit), as is it would be in the area of 22% faster.
>
> L1 misses are still not good though, but this is true even on a modern
> desktop PC.

The cache miss rate may not be the primary bottleneck.
Are you using the ring bus to talk to TLB's, I$L1, D$L1, L2, etc?

Some questions about your L1 cache:

In clocks, what is I$L1 D$L1 read and write hit latency,
and the total access latency including ring bus overhead?
And is the D$L1 store pipelined?

Do you use the same basic design for your 2-way assoc. TLB
as the L1 cache, so the same numbers apply?

And do you pipeline the TLB lookup in one stage, and D$L1 access in a second?

I'm suggesting that your primary objective is making that pathway from the
Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
So a direct 1:1 connect, zero bus overhead and latency, just cache latency.

Such that ideally it takes 2 pipelined stages for a cache read hit,
and if the D$L1 read hit is 1 clock that the load-to-use
latency is 2 clocks (or at least that is possible), pipelined.

And that a store is passed to D$L1 in 1 clock,
and then the LSU can continue while the cache deals with it.
The cache bus handshake would go "busy" until the store is complete.
Also ideally store hits would pipeline the tag and data accesses
so back to back store hits take 1 clock (but that's getting fancy).

> I suspect ringbus efficiency is diminishing the efficiency of external
> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
> the "raw" speed of accessing the RAM chip (in the associated unit tests).

At the start this ring bus might have been a handy idea by
making it easy to experiment with different configurations, but I
think you should be looking at direct connections whenever possible.

> <snip>
>>> Top ranking uses of clock-cycles (for total stall cycles):
>>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
>>> Misc : ~ 23% (Misc uncategorized stalls)
>>> IL : ~ 20% (Interlock stalls)
>>> L1 I$ : ~ 18% (16K L1 I$, 1)
>>> L1 D$ : ~ 9% (32K L1 D$)
>>>
>>> The IL (or Interlock) penalty is the main one that would be effected
>>> by increasing latency.
>>
>> By "interlock stalls" do you mean register RAW dependency stalls?
>
> Yeah.
>
> Typically:
> If an ALU operation happens, the result can't be used until 2 clock
> cycles later;
> If a Load happens, the result is not available for 3 clock cycles;
> Trying to use the value before then stalls the frontend stages.

Ok this sounds like you need more forwarding buses.
Ideally this should allow back-to-back dependent operations.

>> As distinct from D$L1 read access stall, if read access time > 1 clock
>> or multi-cycle function units like integer divide.
>>
>
> The L1 I$ and L1 D$ have different stats, as shown above.
>
> Things like DIV and FPU related stalls go in the MISC category.
>
> Based on emulator stats (and profiling), I can see that most of the MISC
> overhead in GLQuake is due to FPU ops like FADD and FMUL and similar.
>
>
> So, somewhere between 5% and 9% of the total clock-cycles here are being
> spent waiting for the FPU to do its thing.
>
> Except for the "low precision" ops, which are fully pipelined (these
> will not result in any MISC penalty, but may result in an IL penalty if
> the result is used too quickly).
>
>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
>> to do a 64-bit add, so it uses two pipeline stages for ALU.
>>
>
> It takes roughly 1 cycle internally, so:
> ID2 stage: Fetch inputs for the ADD;
> EX1 stage: Do the ADD;
> EX2 stage: Make result visible to the world.
>
> For 1-cycle ops, it would need to forward the result directly from the
> adder-chain logic or similar into the register forwarding logic. I
> discovered fairly early on that for things like 64-bit ADD, this is bad.

It should not be bad, you just need to sort out the clock edges and
forwarding. In a sense these are feedback loops so they just need
to be self re-enforcing (see below).

> Most operations which "actually do something" thus sort of end up
> needing a clock-edge for their results to come to rest (causing them to
> effectively have a 2-cycle latency as far as the running program is
> concerned).

Ok that shouldn't happen. If your ALU is 1 clock latency then
back-to-back execution should be possible with a forwarding bus.

You are taking ALU result AFTER its stage output buffer and forwarding
that back to register read, rather than taking the ALU result BEFORE
the stage output buffer, and this is introducing an extra clock delay.

I'm thinking of this organization:

Notice that the clock edge locks the ALU forwarded value into
the RR output stage, unless the RR stage output is stalled in which case
it holds the ALU input stable and therefore also the forwarded result.

This needs to be done consistently across all stages.
It also needs to forward a load value to RR stage before WB.

It also should be able to forward a register read value, or ALU result,
or load result, or WB exception address to fetch for branches and jumps,
again with no extra clocks. So for example
JMP reg
can go directly from RR to Fetch and start fetching the next cycle.

But this forwarding is gold plating, after the pipeline is full.

On 9/30/2023 11:04 AM, EricP wrote:
> BGB wrote:
>> On 9/29/2023 2:02 PM, EricP wrote:
>>> BGB wrote:
>>>>>>
>>>>>> Any thoughts?...
>>>>>
>>>>> Its not just the MHz but the IPC you need to think about.
>>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>>>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>>>>
>>>>
>>>> For running stats from a running full simulation (predates to these
>>>> tweaks, running GLQuake with the HW rasterizer):
>>>> ~ 0.48 .. 0.54 bundles clock;
>>>> ~ 1.10 .. 1.40 instructions/bundle.
>>>>
>>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
>>>> MIPs/MHz).
>>>
>>> Oh that's pretty efficient then.
>>> In the past you had made comments which made it sound like
>>> having tlb, cache, and dram controller all hung off of what
>>> you called your "ring bus", which sounded like a token ring,
>>> and that the RB consumed many cycles latency.
>>> That gave me the impression of frequent, large stalls to cache,
>>> lots of bubbles, leading to low IPC.
>>>
>>
>> It does diminish IPC, but not as much as my older bus...
>
> Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
> That's not that efficient. I was thinking you were getting an IPC
> of 0.5 out ~0.7, the maximum possible with 1 register write port.
> A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
> that pretty much bottlenecks you at WB/Retire to < 1.0.
>

There are 3 write ports to the register file.

However, they only see much use when the code actually uses them, which
for the most part, my C compiler doesn't. It basically emits normal
1-wide RISC style code, then tries to jostle the instructions around and
put them in bundles.

Results are pretty mixed, and it only really works if the code is
written in certain ways.

Ironically, for GLQuake, most of the ASM was in areas that dropped off
the map when switching to a hardware rasterizer; so the part of the
OpenGL pileline that remains, is mostly all the stuff that was written
in C (with a few random bits of ASM thrown in).

> I suspect those ring bus induced bubbles are likely killing your IPC.
> Fiddling with the internals won't help if the pipeline is mostly empty.
>

Ringbus latency doesn't matter when there are no L1 misses...

> I suggest the primary thing to think about for the future is getting the
> pipeline as full as possible. Then consider making it more efficient
> internally, adding more write register ports so you can retire > 1.0 IPC
> (there is little point in having 3 lanes if you can only retire 1/clock).
> Then thirdly start look at things like forwarding buses.
>

Well, would be back to a lot more fiddling with my C compiler in this case.

As noted, the ISA in question is statically scheduled, so depends mostly
on either the compiler or ASM programmer to do the work.

>> It seems like, if there were no memory related overheads (if the L1
>> always hit), as is it would be in the area of 22% faster.
>>
>> L1 misses are still not good though, but this is true even on a modern
>> desktop PC.
>
> The cache miss rate may not be the primary bottleneck.
> Are you using the ring bus to talk to TLB's, I$L1, D$L1, L2, etc?
>

L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
stages.

So:
PF IF ID1 ID2 EX1 EX2 EX3 WB
Or, alternately:
PF IF ID RF EX1 EX2 EX3 WB

So, access is like:
EX1: Calculate address, send request to L1 cache;
EX2: Cache checks hit/miss, extracts data for load, prepare for store.
This is the stage where the pipeline stall is signaled on miss.
EX3: Data fetched for Load, final cleanup.
Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
Data stored back into L1 arrays here (on next clock edge).

> Some questions about your L1 cache:
>
> In clocks, what is I$L1 D$L1 read and write hit latency,
> and the total access latency including ring bus overhead?
> And is the D$L1 store pipelined?
>

Loads and stores are pipelined.

TLB doesn't matter yet, L1 caches are virtually indexed and tagged.

> Do you use the same basic design for your 2-way assoc. TLB
> as the L1 cache, so the same numbers apply?
>
> And do you pipeline the TLB lookup in one stage, and D$L1 access in a
> second?
>

TLB is a separate component external to the L1 caches, and performs
translation on L1 miss.

It has a roughly 3 cycle latency.
1: Request comes in, setup for fetch from TLB arrays;
2: Check for TLB hit/miss, raise exception on miss;
3: Replace original request with translated request.
Output is on the clock-edge following the 3rd cycle.

> I'm suggesting that your primary objective is making that pathway from the
> Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
> So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
>
> Such that ideally it takes 2 pipelined stages for a cache read hit,
> and if the D$L1 read hit is 1 clock that the load-to-use
> latency is 2 clocks (or at least that is possible), pipelined.
>
> And that a store is passed to D$L1 in 1 clock,
> and then the LSU can continue while the cache deals with it.
> The cache bus handshake would go "busy" until the store is complete.
> Also ideally store hits would pipeline the tag and data accesses
> so back to back store hits take 1 clock (but that's getting fancy).
>

There is no LSU in this design, or effectively, the L1 cache itself
takes on this role.

>> I suspect ringbus efficiency is diminishing the efficiency of external
>> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
>> the "raw" speed of accessing the RAM chip (in the associated unit tests).
>
> At the start this ring bus might have been a handy idea by
> making it easy to experiment with different configurations, but I
> think you should be looking at direct connections whenever possible.
>

Within the core itself, everything is bolted directly to the main pipeline.

External to this, everything is on the ringbus.

As noted, when there are no cache misses and no MMIO access or similar,
the bus isn't really involved.

But, yeah, I am left to realize that, say, driving the L2 cache with a
FIFO might have been better for performance (rather than just letting
requests circle the ring until they can be handled).

>> <snip>
>>>> Top ranking uses of clock-cycles (for total stall cycles):
>>>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
>>>> Misc : ~ 23% (Misc uncategorized stalls)
>>>> IL : ~ 20% (Interlock stalls)
>>>> L1 I$ : ~ 18% (16K L1 I$, 1)
>>>> L1 D$ : ~ 9% (32K L1 D$)
>>>>
>>>> The IL (or Interlock) penalty is the main one that would be effected
>>>> by increasing latency.
>>>
>>> By "interlock stalls" do you mean register RAW dependency stalls?
>>
>> Yeah.
>>
>> Typically:
>> If an ALU operation happens, the result can't be used until 2 clock
>> cycles later;
>> If a Load happens, the result is not available for 3 clock cycles;
>> Trying to use the value before then stalls the frontend stages.
>
> Ok this sounds like you need more forwarding buses.
> Ideally this should allow back-to-back dependent operations.
>

Early on, I did try forwarding ADD results directly from the EX1 stage
(or, directly from the adder's combinatorial logic into the register
forwarding, which is more combinatorial logic feeding back into the ID2
stage).

FPGA timing was not so happy with this sort of thing (it is a lot
happier when there are clock-edges for everything to settle out on).

>>> As distinct from D$L1 read access stall, if read access time > 1 clock
>>> or multi-cycle function units like integer divide.
>>>
>>
>> The L1 I$ and L1 D$ have different stats, as shown above.
>>
>> Things like DIV and FPU related stalls go in the MISC category.
>>
>> Based on emulator stats (and profiling), I can see that most of the
>> MISC overhead in GLQuake is due to FPU ops like FADD and FMUL and
>> similar.
>>
>>
>> So, somewhere between 5% and 9% of the total clock-cycles here are
>> being spent waiting for the FPU to do its thing.
>>
>> Except for the "low precision" ops, which are fully pipelined (these
>> will not result in any MISC penalty, but may result in an IL penalty
>> if the result is used too quickly).
>>
>>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
>>> to do a 64-bit add, so it uses two pipeline stages for ALU.
>>>
>>
>> It takes roughly 1 cycle internally, so:
>> ID2 stage: Fetch inputs for the ADD;
>> EX1 stage: Do the ADD;
>> EX2 stage: Make result visible to the world.
>>
>> For 1-cycle ops, it would need to forward the result directly from the
>> adder-chain logic or similar into the register forwarding logic. I
>> discovered fairly early on that for things like 64-bit ADD, this is bad.
>
> It should not be bad, you just need to sort out the clock edges and
> forwarding. In a sense these are feedback loops so they just need
> to be self re-enforcing (see below).
>
>> Most operations which "actually do something" thus sort of end up
>> needing a clock-edge for their results to come to rest (causing them
>> to effectively have a 2-cycle latency as far as the running program is
>> concerned).
>
> Ok that shouldn't happen. If your ALU is 1 clock latency then
> back-to-back execution should be possible with a forwarding bus.
>
> You are taking ALU result AFTER its stage output buffer and forwarding
> that back to register read, rather than taking the ALU result BEFORE
> the stage output buffer, and this is introducing an extra clock delay.
>

Click here to read the complete article

On Saturday, September 30, 2023 at 12:50:55 PM UTC-5, BGB wrote:
> On 9/30/2023 11:04 AM, EricP wrote:
> > BGB wrote:
> >> On 9/29/2023 2:02 PM, EricP wrote:
> >>> BGB wrote:
> >>>>>>
> >>>>>> Any thoughts?...
> >>>>>
> >>>>> Its not just the MHz but the IPC you need to think about.
> >>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
> >>>>> stalls and pipeline bubbles then that's really just 5 MIPS.
> >>>>>
> >>>>
> >>>> For running stats from a running full simulation (predates to these
> >>>> tweaks, running GLQuake with the HW rasterizer):
> >>>> ~ 0.48 .. 0.54 bundles clock;
> >>>> ~ 1.10 .. 1.40 instructions/bundle.
> >>>>
> >>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
> >>>> MIPs/MHz).
> >>>
> >>> Oh that's pretty efficient then.
> >>> In the past you had made comments which made it sound like
> >>> having tlb, cache, and dram controller all hung off of what
> >>> you called your "ring bus", which sounded like a token ring,
> >>> and that the RB consumed many cycles latency.
> >>> That gave me the impression of frequent, large stalls to cache,
> >>> lots of bubbles, leading to low IPC.
> >>>
> >>
> >> It does diminish IPC, but not as much as my older bus...
> >
> > Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
> > That's not that efficient. I was thinking you were getting an IPC
> > of 0.5 out ~0.7, the maximum possible with 1 register write port.
> > A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
> > that pretty much bottlenecks you at WB/Retire to < 1.0.
> >
> There are 3 write ports to the register file.
>
> However, they only see much use when the code actually uses them, which
> for the most part, my C compiler doesn't. It basically emits normal
> 1-wide RISC style code, then tries to jostle the instructions around and
> put them in bundles.
>
> Results are pretty mixed, and it only really works if the code is
> written in certain ways.
>
>
> Ironically, for GLQuake, most of the ASM was in areas that dropped off
> the map when switching to a hardware rasterizer; so the part of the
> OpenGL pileline that remains, is mostly all the stuff that was written
> in C (with a few random bits of ASM thrown in).
> > I suspect those ring bus induced bubbles are likely killing your IPC.
> > Fiddling with the internals won't help if the pipeline is mostly empty.
> >
> Ringbus latency doesn't matter when there are no L1 misses...
> > I suggest the primary thing to think about for the future is getting the
> > pipeline as full as possible. Then consider making it more efficient
> > internally, adding more write register ports so you can retire > 1.0 IPC
> > (there is little point in having 3 lanes if you can only retire 1/clock).
> > Then thirdly start look at things like forwarding buses.
> >
> Well, would be back to a lot more fiddling with my C compiler in this case.
>
> As noted, the ISA in question is statically scheduled, so depends mostly
> on either the compiler or ASM programmer to do the work.
> >> It seems like, if there were no memory related overheads (if the L1
> >> always hit), as is it would be in the area of 22% faster.
> >>
> >> L1 misses are still not good though, but this is true even on a modern
> >> desktop PC.
> >
> > The cache miss rate may not be the primary bottleneck.
> > Are you using the ring bus to talk to TLB's, I$L1, D$L1, L2, etc?
> >
> L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
> stages.
>
> So:
> PF IF ID1 ID2 EX1 EX2 EX3 WB
> Or, alternately:
> PF IF ID RF EX1 EX2 EX3 WB
>
> So, access is like:
> EX1: Calculate address, send request to L1 cache;
> EX2: Cache checks hit/miss, extracts data for load, prepare for store.
> This is the stage where the pipeline stall is signaled on miss.
> EX3: Data fetched for Load, final cleanup.
> Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
> Data stored back into L1 arrays here (on next clock edge).
> > Some questions about your L1 cache:
> >
> > In clocks, what is I$L1 D$L1 read and write hit latency,
> > and the total access latency including ring bus overhead?
> > And is the D$L1 store pipelined?
> >
> Loads and stores are pipelined.
>
> TLB doesn't matter yet, L1 caches are virtually indexed and tagged.
> > Do you use the same basic design for your 2-way assoc. TLB
> > as the L1 cache, so the same numbers apply?
> >
> > And do you pipeline the TLB lookup in one stage, and D$L1 access in a
> > second?
> >
> TLB is a separate component external to the L1 caches, and performs
> translation on L1 miss.
>
> It has a roughly 3 cycle latency.
<
So, you take a 2-cycle look at L1 tag and if you are going to get a miss,
you then take a 3-cycle access of TLB so you can "get on" ring-bus.
So, AGEN to PA is 5 cycles.
<
> 1: Request comes in, setup for fetch from TLB arrays;
> 2: Check for TLB hit/miss, raise exception on miss;
> 3: Replace original request with translated request.
> Output is on the clock-edge following the 3rd cycle.
> > I'm suggesting that your primary objective is making that pathway from the
> > Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
> > So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
> >
> > Such that ideally it takes 2 pipelined stages for a cache read hit,
> > and if the D$L1 read hit is 1 clock that the load-to-use
> > latency is 2 clocks (or at least that is possible), pipelined.
> >
> > And that a store is passed to D$L1 in 1 clock,
> > and then the LSU can continue while the cache deals with it.
> > The cache bus handshake would go "busy" until the store is complete.
> > Also ideally store hits would pipeline the tag and data accesses
> > so back to back store hits take 1 clock (but that's getting fancy).
> >
> There is no LSU in this design, or effectively, the L1 cache itself
> takes on this role.
> >> I suspect ringbus efficiency is diminishing the efficiency of external
> >> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
> >> the "raw" speed of accessing the RAM chip (in the associated unit tests).
> >
> > At the start this ring bus might have been a handy idea by
> > making it easy to experiment with different configurations, but I
> > think you should be looking at direct connections whenever possible.
> >
> Within the core itself, everything is bolted directly to the main pipeline.
>
> External to this, everything is on the ringbus.
>
> As noted, when there are no cache misses and no MMIO access or similar,
> the bus isn't really involved.
>
>
> But, yeah, I am left to realize that, say, driving the L2 cache with a
> FIFO might have been better for performance (rather than just letting
> requests circle the ring until they can be handled).
<
You know that this allows for un-ordered memory accesses--ACK.
PAs get to memory banks in an order unlike the misses occurred--
CDC 6600 had these effects and CDC 7600 got rid of them.
<
> >> <snip>
> >>>> Top ranking uses of clock-cycles (for total stall cycles):
> >>>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
> >>>> Misc : ~ 23% (Misc uncategorized stalls)
> >>>> IL : ~ 20% (Interlock stalls)
> >>>> L1 I$ : ~ 18% (16K L1 I$, 1)
> >>>> L1 D$ : ~ 9% (32K L1 D$)
> >>>>
> >>>> The IL (or Interlock) penalty is the main one that would be effected
> >>>> by increasing latency.
> >>>
> >>> By "interlock stalls" do you mean register RAW dependency stalls?
> >>
> >> Yeah.
> >>
> >> Typically:
> >> If an ALU operation happens, the result can't be used until 2 clock
> >> cycles later;
> >> If a Load happens, the result is not available for 3 clock cycles;
> >> Trying to use the value before then stalls the frontend stages.
> >
> > Ok this sounds like you need more forwarding buses.
> > Ideally this should allow back-to-back dependent operations.
> >
> Early on, I did try forwarding ADD results directly from the EX1 stage
> (or, directly from the adder's combinatorial logic into the register
> forwarding, which is more combinatorial logic feeding back into the ID2
> stage).
>
> FPGA timing was not so happy with this sort of thing (it is a lot
> happier when there are clock-edges for everything to settle out on).
> >>> As distinct from D$L1 read access stall, if read access time > 1 clock
> >>> or multi-cycle function units like integer divide.
> >>>
> >>
> >> The L1 I$ and L1 D$ have different stats, as shown above.
> >>
> >> Things like DIV and FPU related stalls go in the MISC category.
> >>
> >> Based on emulator stats (and profiling), I can see that most of the
> >> MISC overhead in GLQuake is due to FPU ops like FADD and FMUL and
> >> similar.
> >>
> >>
> >> So, somewhere between 5% and 9% of the total clock-cycles here are
> >> being spent waiting for the FPU to do its thing.
> >>
> >> Except for the "low precision" ops, which are fully pipelined (these
> >> will not result in any MISC penalty, but may result in an IL penalty
> >> if the result is used too quickly).
> >>
> >>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
> >>> to do a 64-bit add, so it uses two pipeline stages for ALU.
> >>>
> >>
> >> It takes roughly 1 cycle internally, so:
> >> ID2 stage: Fetch inputs for the ADD;
> >> EX1 stage: Do the ADD;
> >> EX2 stage: Make result visible to the world.
> >>
> >> For 1-cycle ops, it would need to forward the result directly from the
> >> adder-chain logic or similar into the register forwarding logic. I
> >> discovered fairly early on that for things like 64-bit ADD, this is bad.
> >
> > It should not be bad, you just need to sort out the clock edges and
> > forwarding. In a sense these are feedback loops so they just need
> > to be self re-enforcing (see below).
> >
> >> Most operations which "actually do something" thus sort of end up
> >> needing a clock-edge for their results to come to rest (causing them
> >> to effectively have a 2-cycle latency as far as the running program is
> >> concerned).
> >
> > Ok that shouldn't happen. If your ALU is 1 clock latency then
> > back-to-back execution should be possible with a forwarding bus.
> >
> > You are taking ALU result AFTER its stage output buffer and forwarding
> > that back to register read, rather than taking the ALU result BEFORE
> > the stage output buffer, and this is introducing an extra clock delay.
> >
> But, doing it this way makes FPGA timing constraints significantly
> happier...
> > I'm thinking of this organization:
> >
> > Decode Logic
> > |
> > v
> > == Decode Stage Buffer => ID2 Stage.
> > Immediate Data
> > |
> > | |---< Reg File
> > | |
> > | | |---------------
> > v v v |
> > Operand source mux |
> > | | |
> > v v |
> > == Reg Read Stage Buffer == |
> EX1 Stage
> > Operand values |
> > | | |
> > v v |
> > ALU |
> > |>-------------------
> > v
> > == ALU Result Stage Buffer => Start of EX2 stage, ALU result gets forwarded here...
<
Given that you understand the nature of the Adder->result->forwarding->
Operand as four ¼-cycle units of work. I have seen the flip-flops put in
3 of the 4 possible places. The thing is a logic-loop that needs a point
of clock synchronization. But the flip-flops can be put:: I have seen::
1) flop->operand->adder->result->forward->flop
2) operand->flop->adder->result->forward->operand
3) operand->adder->flop->adder->result->forward
but nothing rules out::
4) operand->adder->result->flop->forward->operand
<
And there are other variants when you have latches.......
<
> > > Notice that the clock edge locks the ALU forwarded value into
> > the RR output stage, unless the RR stage output is stalled in which case
> > it holds the ALU input stable and therefore also the forwarded result.
> >
> > This needs to be done consistently across all stages.
> > It also needs to forward a load value to RR stage before WB.
> >
> > It also should be able to forward a register read value, or ALU result,
> > or load result, or WB exception address to fetch for branches and jumps,
> > again with no extra clocks. So for example
> > JMP reg
> > can go directly from RR to Fetch and start fetching the next cycle.
> >
> > But this forwarding is gold plating, after the pipeline is full.
> >
> Yeah.
>
> I was experimenting with going the way of "increasing" the effective
> latency in some cases, trying to loosen timing enough that I could
> hopefully boost the clock speed.
>
>
> Getting more stuff to flow through the pipeline could be better, but is
> the tired old path of continuing to beat on my C compiler...
>
>
> >

Click here to read the complete article

On 9/30/2023 1:20 PM, MitchAlsup wrote:
> On Saturday, September 30, 2023 at 12:50:55 PM UTC-5, BGB wrote:
>> On 9/30/2023 11:04 AM, EricP wrote:
>>> BGB wrote:
>>>> On 9/29/2023 2:02 PM, EricP wrote:
>>>>> BGB wrote:
>>>>>>>>
>>>>>>>> Any thoughts?...
>>>>>>>
>>>>>>> Its not just the MHz but the IPC you need to think about.
>>>>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>>>>>> stalls and pipeline bubbles then that's really just 5 MIPS.
>>>>>>>
>>>>>>
>>>>>> For running stats from a running full simulation (predates to these
>>>>>> tweaks, running GLQuake with the HW rasterizer):
>>>>>> ~ 0.48 .. 0.54 bundles clock;
>>>>>> ~ 1.10 .. 1.40 instructions/bundle.
>>>>>>
>>>>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
>>>>>> MIPs/MHz).
>>>>>
>>>>> Oh that's pretty efficient then.
>>>>> In the past you had made comments which made it sound like
>>>>> having tlb, cache, and dram controller all hung off of what
>>>>> you called your "ring bus", which sounded like a token ring,
>>>>> and that the RB consumed many cycles latency.
>>>>> That gave me the impression of frequent, large stalls to cache,
>>>>> lots of bubbles, leading to low IPC.
>>>>>
>>>>
>>>> It does diminish IPC, but not as much as my older bus...
>>>
>>> Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
>>> That's not that efficient. I was thinking you were getting an IPC
>>> of 0.5 out ~0.7, the maximum possible with 1 register write port.
>>> A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
>>> that pretty much bottlenecks you at WB/Retire to < 1.0.
>>>
>> There are 3 write ports to the register file.
>>
>> However, they only see much use when the code actually uses them, which
>> for the most part, my C compiler doesn't. It basically emits normal
>> 1-wide RISC style code, then tries to jostle the instructions around and
>> put them in bundles.
>>
>> Results are pretty mixed, and it only really works if the code is
>> written in certain ways.
>>
>>
>> Ironically, for GLQuake, most of the ASM was in areas that dropped off
>> the map when switching to a hardware rasterizer; so the part of the
>> OpenGL pileline that remains, is mostly all the stuff that was written
>> in C (with a few random bits of ASM thrown in).
>>> I suspect those ring bus induced bubbles are likely killing your IPC.
>>> Fiddling with the internals won't help if the pipeline is mostly empty.
>>>
>> Ringbus latency doesn't matter when there are no L1 misses...
>>> I suggest the primary thing to think about for the future is getting the
>>> pipeline as full as possible. Then consider making it more efficient
>>> internally, adding more write register ports so you can retire > 1.0 IPC
>>> (there is little point in having 3 lanes if you can only retire 1/clock).
>>> Then thirdly start look at things like forwarding buses.
>>>
>> Well, would be back to a lot more fiddling with my C compiler in this case.
>>
>> As noted, the ISA in question is statically scheduled, so depends mostly
>> on either the compiler or ASM programmer to do the work.
>>>> It seems like, if there were no memory related overheads (if the L1
>>>> always hit), as is it would be in the area of 22% faster.
>>>>
>>>> L1 misses are still not good though, but this is true even on a modern
>>>> desktop PC.
>>>
>>> The cache miss rate may not be the primary bottleneck.
>>> Are you using the ring bus to talk to TLB's, I$L1, D$L1, L2, etc?
>>>
>> L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
>> stages.
>>
>> So:
>> PF IF ID1 ID2 EX1 EX2 EX3 WB
>> Or, alternately:
>> PF IF ID RF EX1 EX2 EX3 WB
>>
>> So, access is like:
>> EX1: Calculate address, send request to L1 cache;
>> EX2: Cache checks hit/miss, extracts data for load, prepare for store.
>> This is the stage where the pipeline stall is signaled on miss.
>> EX3: Data fetched for Load, final cleanup.
>> Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
>> Data stored back into L1 arrays here (on next clock edge).
>>> Some questions about your L1 cache:
>>>
>>> In clocks, what is I$L1 D$L1 read and write hit latency,
>>> and the total access latency including ring bus overhead?
>>> And is the D$L1 store pipelined?
>>>
>> Loads and stores are pipelined.
>>
>> TLB doesn't matter yet, L1 caches are virtually indexed and tagged.
>>> Do you use the same basic design for your 2-way assoc. TLB
>>> as the L1 cache, so the same numbers apply?
>>>
>>> And do you pipeline the TLB lookup in one stage, and D$L1 access in a
>>> second?
>>>
>> TLB is a separate component external to the L1 caches, and performs
>> translation on L1 miss.
>>
>> It has a roughly 3 cycle latency.
> <
> So, you take a 2-cycle look at L1 tag and if you are going to get a miss,
> you then take a 3-cycle access of TLB so you can "get on" ring-bus.
> So, AGEN to PA is 5 cycles.
> <

Somewhere in that area.

The L1 I$ and D$ are both on the bus, and requests from the D$ travel
through the I$ to get to the TLB.

The join point between the L1 and L2 rings has an extra 1-cycle delay on
each side to deal with forward/skip handling.

Quick mental checks, it is probably in the area of ~ 8 cycles between
when the AGU does its thing, and the first L1 miss request leaves the
CPU core.

Around the 9th cycle, it enters the L2 cache (which has a 5 cycle
latency IIRC, *1), 2 cycles of forward-skips for the response to get
back to the CPU core, then the response enters L1 D$ (and gets absorbed).

So, round trip, probably somewhere in the area of 17 clock cycles
(single core), or a few more if dual-core.

*1: This mostly being to allow for larger block-RAM arrays.

Depending on the message and other specifics, it may also travel through
the part of the ring which deals the Boot ROM and MMIO interfaces and
similar; but normal RAM requests and responses skip over this part of
the ring.

>> 1: Request comes in, setup for fetch from TLB arrays;
>> 2: Check for TLB hit/miss, raise exception on miss;
>> 3: Replace original request with translated request.
>> Output is on the clock-edge following the 3rd cycle.
>>> I'm suggesting that your primary objective is making that pathway from the
>>> Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
>>> So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
>>>
>>> Such that ideally it takes 2 pipelined stages for a cache reead hit,
>>> and if the D$L1 read hit is 1 clock that the load-to-use
>>> latency is 2 clocks (or at least that is possible), pipelined.
>>>
>>> And that a store is passed to D$L1 in 1 clock,
>>> and then the LSU can continue while the cache deals with it.
>>> The cache bus handshake would go "busy" until the store is complete.
>>> Also ideally store hits would pipeline the tag and data accesses
>>> so back to back store hits take 1 clock (but that's getting fancy).
>>>
>> There is no LSU in this design, or effectively, the L1 cache itself
>> takes on this role.
>>>> I suspect ringbus efficiency is diminishing the efficiency of external
>>>> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
>>>> the "raw" speed of accessing the RAM chip (in the associated unit tests).
>>>
>>> At the start this ring bus might have been a handy idea by
>>> making it easy to experiment with different configurations, but I
>>> think you should be looking at direct connections whenever possible.
>>>
>> Within the core itself, everything is bolted directly to the main pipeline.
>>
>> External to this, everything is on the ringbus.
>>
>> As noted, when there are no cache misses and no MMIO access or similar,
>> the bus isn't really involved.
>>
>>
>> But, yeah, I am left to realize that, say, driving the L2 cache with a
>> FIFO might have been better for performance (rather than just letting
>> requests circle the ring until they can be handled).
> <
> You know that this allows for un-ordered memory accesses--ACK.
> PAs get to memory banks in an order unlike the misses occurred--
> CDC 6600 had these effects and CDC 7600 got rid of them.

Click here to read the complete article

In article <ufb4cm$1ef70$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>So, along with LUT cost, FPGA timing constraints have been a
>long-standing battle (and it is annoyingly difficult trying to achieve
>or maintain clock speeds much over 50MHz or so on the Artix-7...).

Just to be clear, you should easily be hitting 100MHz on FPGAs for
simple ALU and cache designs. That's around 10 stages of 6-input LUTs,
which is an incredible amount of work.

From the Artix-7 data sheet, the BRAM CLK to DOUT (without output
register, which is all you should ever use for performance) is 2.46ns in
-1 speedgrade. Do not use registered BRAMs--even if all you do is
register the BRAM output in slices, that's better since it pulls the
signals in closer to your other logic (slice logic can easily talk to
other slice logic--but BRAMs and DSP48s are "elsewhere" and there's
effectively a routing penalty to get to/from them). This delay can be
longer sometimes, but 1ns is a good rule of thumb. On Artix -1 speed parts,
maybe this is closert to 1.5ns, but that's still manageable. In short,
it's better to pay the 1ns routing penalty in the same clock you access
the BRAM array, not the next clock. From your main pipeline, you sent
the address over to the BRAM in cycle 0, this costs about 1ns of
routing. In the next cycle, BRAM delivers data at 2.46ns, and it costs
about 1ns to bring it back to your slices and register it. This should
take just 3.5ns, giving you about 6ns more to do whatever else you want
in this clock.

The Artix-7 data sheet also says DSP48E1 in-to-out is 2.41ns. You
should be able to easily build a 64-bit ALU in under 7ns using DSP48s,
leaving lots of time for other overhead in your clock cycle. If you are
trying to synthesize Verilog like: "out[63:0] = a[63:0] + b[63:0]",
this defaults to a ripple carry and it has two downsides--slow (but not
unbearably, should be about 32 ripple stages of 100ps each = 3.x ns),
but it requires a long stack of slices which makes routing terrible--it
stretches your datapath to be over 32 rows, which the tools won't deal
with well, making all your slice-to-slice routing for all CPU signals
terrible. Instantiate DSP48E1 ALUs, and lots of them if needed.

You should be able to do all add/mul in a few DSP48E1 (you can CASCOUT
for simplicity, or mux them, or whatever). There are at least 40 DSPs
on the smallest Artix-7 so use them. On Xilinx FPGAs, you get one free
LUT timing-wise before a register: going straight into a slice register
has the same delay as going through a 6-input LUT and then registering.
That's a 4:1 mux, and you should definitely be using that for your
execute stage. Do one group of DSP's for add (each DSP48 can do
48-bits, so cascade two together to get 64 bits), one group for
multiply, and mux them together if you want. Using a simple CASCOUT, I
think you can do 64-bit adds in 2 stages of DSP48, which is just 3.93ns.
That makes 100MHz pretty easily, even if there's 3.0ns of routing delay.

Kent

On Sunday, October 1, 2023 at 9:44:28 PM UTC-4, Kent Dickey wrote:
> In article <ufb4cm$1ef70$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >So, along with LUT cost, FPGA timing constraints have been a
> >long-standing battle (and it is annoyingly difficult trying to achieve
> >or maintain clock speeds much over 50MHz or so on the Artix-7...).
> Just to be clear, you should easily be hitting 100MHz on FPGAs for
> simple ALU and cache designs. That's around 10 stages of 6-input LUTs,
> which is an incredible amount of work.
>
> From the Artix-7 data sheet, the BRAM CLK to DOUT (without output
> register, which is all you should ever use for performance) is 2.46ns in
> -1 speedgrade. Do not use registered BRAMs--even if all you do is
> register the BRAM output in slices, that's better since it pulls the
> signals in closer to your other logic (slice logic can easily talk to
> other slice logic--but BRAMs and DSP48s are "elsewhere" and there's
> effectively a routing penalty to get to/from them). This delay can be
> longer sometimes, but 1ns is a good rule of thumb. On Artix -1 speed parts,
> maybe this is closert to 1.5ns, but that's still manageable. In short,
> it's better to pay the 1ns routing penalty in the same clock you access
> the BRAM array, not the next clock. From your main pipeline, you sent
> the address over to the BRAM in cycle 0, this costs about 1ns of
> routing. In the next cycle, BRAM delivers data at 2.46ns, and it costs
> about 1ns to bring it back to your slices and register it. This should
> take just 3.5ns, giving you about 6ns more to do whatever else you want
> in this clock.
>
> The Artix-7 data sheet also says DSP48E1 in-to-out is 2.41ns. You
> should be able to easily build a 64-bit ALU in under 7ns using DSP48s,
> leaving lots of time for other overhead in your clock cycle. If you are
> trying to synthesize Verilog like: "out[63:0] = a[63:0] + b[63:0]",
> this defaults to a ripple carry and it has two downsides--slow (but not
> unbearably, should be about 32 ripple stages of 100ps each = 3.x ns),
> but it requires a long stack of slices which makes routing terrible--it
> stretches your datapath to be over 32 rows, which the tools won't deal
> with well, making all your slice-to-slice routing for all CPU signals
> terrible. Instantiate DSP48E1 ALUs, and lots of them if needed.
>
> You should be able to do all add/mul in a few DSP48E1 (you can CASCOUT
> for simplicity, or mux them, or whatever). There are at least 40 DSPs
> on the smallest Artix-7 so use them. On Xilinx FPGAs, you get one free
> LUT timing-wise before a register: going straight into a slice register
> has the same delay as going through a 6-input LUT and then registering.
> That's a 4:1 mux, and you should definitely be using that for your
> execute stage. Do one group of DSP's for add (each DSP48 can do
> 48-bits, so cascade two together to get 64 bits), one group for
> multiply, and mux them together if you want. Using a simple CASCOUT, I
> think you can do 64-bit adds in 2 stages of DSP48, which is just 3.93ns.
> That makes 100MHz pretty easily, even if there's 3.0ns of routing delay.
>
> Kent

It is certainly possible to hit 100 MHz for simple designs. It is keeping the
clock rate up for more complex designs that is a challenge. I have found it
challenging to get beyond 50 MHz too. I think the average tinkerer
(non-expert) is likely to have trouble getting past 50 MHz.

I was pleased when I managed to get a 68k core running at 85 MHz
(in an Artix-7 - 1), but that was non-overlapped pipelined taking several clocks per
instruction.

I use mainly System Verilog code and auto route and let the tools decide
what to do. Letting the tools do most of the work (easy) probably is what
results in ½ the potential performance.

Most of the cores I have worked on are for my own edification and so
I have not put effort into floor-planning and connecting FPGA resources
manually to get the maximum out of the cores.

I think 50 MHz is a good first target. I find a 40 MHz video dot clock rate
good to work with.

On 10/1/2023 11:11 PM, robf...@gmail.com wrote:
> On Sunday, October 1, 2023 at 9:44:28 PM UTC-4, Kent Dickey wrote:
>> In article <ufb4cm$1ef70$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>> So, along with LUT cost, FPGA timing constraints have been a
>>> long-standing battle (and it is annoyingly difficult trying to achieve
>>> or maintain clock speeds much over 50MHz or so on the Artix-7...).
>> Just to be clear, you should easily be hitting 100MHz on FPGAs for
>> simple ALU and cache designs. That's around 10 stages of 6-input LUTs,
>> which is an incredible amount of work.
>>
>> From the Artix-7 data sheet, the BRAM CLK to DOUT (without output
>> register, which is all you should ever use for performance) is 2.46ns in
>> -1 speedgrade. Do not use registered BRAMs--even if all you do is
>> register the BRAM output in slices, that's better since it pulls the
>> signals in closer to your other logic (slice logic can easily talk to
>> other slice logic--but BRAMs and DSP48s are "elsewhere" and there's
>> effectively a routing penalty to get to/from them). This delay can be
>> longer sometimes, but 1ns is a good rule of thumb. On Artix -1 speed parts,
>> maybe this is closert to 1.5ns, but that's still manageable. In short,
>> it's better to pay the 1ns routing penalty in the same clock you access
>> the BRAM array, not the next clock. From your main pipeline, you sent
>> the address over to the BRAM in cycle 0, this costs about 1ns of
>> routing. In the next cycle, BRAM delivers data at 2.46ns, and it costs
>> about 1ns to bring it back to your slices and register it. This should
>> take just 3.5ns, giving you about 6ns more to do whatever else you want
>> in this clock.
>>
>> The Artix-7 data sheet also says DSP48E1 in-to-out is 2.41ns. You
>> should be able to easily build a 64-bit ALU in under 7ns using DSP48s,
>> leaving lots of time for other overhead in your clock cycle. If you are
>> trying to synthesize Verilog like: "out[63:0] = a[63:0] + b[63:0]",
>> this defaults to a ripple carry and it has two downsides--slow (but not
>> unbearably, should be about 32 ripple stages of 100ps each = 3.x ns),
>> but it requires a long stack of slices which makes routing terrible--it
>> stretches your datapath to be over 32 rows, which the tools won't deal
>> with well, making all your slice-to-slice routing for all CPU signals
>> terrible. Instantiate DSP48E1 ALUs, and lots of them if needed.
>>
>> You should be able to do all add/mul in a few DSP48E1 (you can CASCOUT
>> for simplicity, or mux them, or whatever). There are at least 40 DSPs
>> on the smallest Artix-7 so use them. On Xilinx FPGAs, you get one free
>> LUT timing-wise before a register: going straight into a slice register
>> has the same delay as going through a 6-input LUT and then registering.
>> That's a 4:1 mux, and you should definitely be using that for your
>> execute stage. Do one group of DSP's for add (each DSP48 can do
>> 48-bits, so cascade two together to get 64 bits), one group for
>> multiply, and mux them together if you want. Using a simple CASCOUT, I
>> think you can do 64-bit adds in 2 stages of DSP48, which is just 3.93ns.
>> That makes 100MHz pretty easily, even if there's 3.0ns of routing delay.
>>
>> Kent
>
> It is certainly possible to hit 100 MHz for simple designs. It is keeping the
> clock rate up for more complex designs that is a challenge. I have found it
> challenging to get beyond 50 MHz too. I think the average tinkerer
> (non-expert) is likely to have trouble getting past 50 MHz.
>

I have gotten 100 MHz for simpler RISC-like cores.

But, getting something like my BJX2 ISA design running at 75 or 100 MHz
is a little bit more of a challenge (while also using decent sized L1
caches; eg: 32kB).

Where, ISA specs:

https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2020-04-30_BJX2D.txt

https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2020-04-30_BJX2_IsaDescD.txt

I am fiddling around a bit with it, and have been getting the core
"closer" to being able to boost the speed, but the "Worst Negative
Slack" is still at around 2.59ns, and is putting up a whole lot of a
fight here...

Looks like for a lot of the failing paths are sort of like:
~ 12-14 levels;
~ 50-130 high-fanout;
~ 4.5 ns of logic delay;
~ 10.5 ns of net-delay.

What makes things harder is that I am trying to pull this off while
staying with 32K L1 caches, ...

> I was pleased when I managed to get a 68k core running at 85 MHz
> (in an Artix-7 - 1), but that was non-overlapped pipelined taking several clocks per
> instruction.
>

Yeah, the FPGAs I am using are also generally at a -1 speed grade.

Although tempting, the "Nexys Video" was almost universally sold out,
and was rather expensive.

So, my test boards are still mostly the "Nexys A7" (XC7A100T) and
"QMTECH XC7A200T".

I also have a Arty S7-50 board (XC7S50), but can't fit a "full featured"
form of the BJX2 core onto it.

Like, when dealing with this FPGA, something like a SIMD unit that does
4 single-precision operations in parallel, is a bit of an ask, ...

Seemingly, trying to get this unit to work at 75MHz is also difficult.
Though, I do have a fallback case, which can still run the SIMD ops but
has a 10 cycle latency (non-pipelined) rather than 3 cycles (pipelined).

So, between these two (theoretical peak):
Cheaper SIMD: 20 MFLOP/s at 50MHz;
Fancier SIMD: 200 MFLOP/s at 50MH.

Though, generally requires using hand-crafted ASM to get much advantage
from this SIMD unit.

> I use mainly System Verilog code and auto route and let the tools decide
> what to do. Letting the tools do most of the work (easy) probably is what
> results in ½ the potential performance.
>
> Most of the cores I have worked on are for my own edification and so
> I have not put effort into floor-planning and connecting FPGA resources
> manually to get the maximum out of the cores.
>

I am mostly using "Vivado Synthesis Defaults" and similar; no
floorplanning, ...

For whatever reason, trying to change things too much here breaks
Vivado's ability to effectively show stuff in the "Netlist" tab or show
stuff in the "Utilization" graph. Also seems like it would make it
harder for people to recreate my results.

> I think 50 MHz is a good first target. I find a 40 MHz video dot clock rate
> good to work with.
>

I can note that SweRV generally passes timing at 25 or 33 MHz, so 50MHz
doesn't seem too bad.

And, on the other side, MicroBlaze doesn't have too much difficulty with
running at 100MHz, but MicroBlaze is nowhere near as feature-rich as BJX2.

In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>I am fiddling around a bit with it, and have been getting the core
>"closer" to being able to boost the speed, but the "Worst Negative
>Slack" is still at around 2.59ns, and is putting up a whole lot of a
>fight here...
>
>Looks like for a lot of the failing paths are sort of like:
> ~ 12-14 levels;
> ~ 50-130 high-fanout;
> ~ 4.5 ns of logic delay;
> ~ 10.5 ns of net-delay.
>
>
>What makes things harder is that I am trying to pull this off while
>staying with 32K L1 caches, ...

A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
settle for 70MHz.

Note that 14 levels of LUTs is equivalent to about 30 levels of gates. This is
a slow design independent of it being in an FPGA, and independent of
any FPGA routing issues.

If you want to not optimize your control and other logic, that's your
choice. But you're mixing things up. You're saying an ALU cannot be
done within 10ns on an FPGA, and I'm pointing out that's not true.
Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
performance when they got to 32bits or wider. And on an FPGA, bad
decisions compound--if you have a huge slow ALU, it makes everything else
slower as well (since everything gets further apart).

And as I pointed out, 32KB of cache should be no problem as long as it's
direct mapped. If you want 4-way associative, that's also possible, but
it requires care and careful logic for the way selection. Note it's
just a lot of logic to handle unaligned (if you are supporting that) and
it requires care to do it fast. Again, if you don't want to optimize
your logic, that's your choice, but you keep complaining about FPGA
speed, so you're implying you want it to go faster.

A good rule for optimizing any logic is note what is the inherent critical
path. Take dcache: it's generating the address, sending it to the cache,
getting data back, muxing to get the correct way, aligning the data,
flopping the result. Assume all control is valid in advance, what's
the minimum logic that has to be done? And the best design does exactly
that, and no random "pipe stall" signal is a critical path.

A simple trick to remove any control signal from your critical path is
to rewrite code to do a late mux using the slow signal. For example:

wire step1 = (slow_signal) ? a : b:
wire step2 = (other_signal) ? step1 : something_else;
wire flop_d_input = (other_signal2) ? step2 : something_else2;

So if slow_signal is the slowest signal, what can happen is the synthesizer
defaults to reducing logic size, and so step1 is an early LUT, and
then there's logic after it, so you have a critical path from slow_signal
through additional LUTS. You can mechanically create two trees of logic,
one where slow_signal==0 and one where slow_signal==1, and then do
slow_signal muxing last:

wire step1_slow_is_0 = b;
wire step2_slow_is_0 = (other_signal) ? step1_slow_is_0 : something_else;
wire flop_d_slow_is_0 = (other_signal2) ? step2_slow_is_0 : something_else2;

wire step1_slow_is_1 = a;
wire step2_slow_is_1 = (other_signal) ? step1_slow_is_1 : something_else;
wire flop_d_slow_is_1 = (other_signal2) ? step2_slow_is_1 : something_else2;

wire flop_d_input = (slow_signal) ? flop_d_slow_is_1 : flop_d_slow_is_0;

This is annoying to do, and sometimes tools can help, but if you have
a control signal which keeps messing up your logic, this eliminates it.
I name signals like "xxx_if_step" or "capture_maybe" to note that the
early logic steps are not fully qualified. Note: synthesizers sometimes
catch on and undo your change to reduce LUT count. In that case, you have
to mark the last signal before the mux as dont_touch to the synthesizer.
Often, passing the intermediate signal like flop_d_slow_is_1 as a
"debug output" that then gets optimized away (so pass it to another
module not being synthesized with this block, so it gets optimized away
much later) is enough to prevent the synthesis optimization (which
reduces LUT count but hurst timing).

If you post some details on your ALU implementation and your data cache
implementation, I'm sure folks could provide pointers on improvements.

Kent

On 10/2/2023 8:56 AM, Kent Dickey wrote:
> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>> I am fiddling around a bit with it, and have been getting the core
>> "closer" to being able to boost the speed, but the "Worst Negative
>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>> fight here...
>>
>> Looks like for a lot of the failing paths are sort of like:
>> ~ 12-14 levels;
>> ~ 50-130 high-fanout;
>> ~ 4.5 ns of logic delay;
>> ~ 10.5 ns of net-delay.
>>
>>
>> What makes things harder is that I am trying to pull this off while
>> staying with 32K L1 caches, ...
>
> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> settle for 70MHz.
>
> Note that 14 levels of LUTs is equivalent to about 30 levels of gates. This is
> a slow design independent of it being in an FPGA, and independent of
> any FPGA routing issues.
>

It seems to vary based on what sorts of clock-speeds one synthesizes at:
100MHz seems to give ~ 10 levels;
75MHz seems to give ~ 14 levels;
50MHz seems to give ~ 19 levels.

Don't really know how this part works exactly...

> If you want to not optimize your control and other logic, that's your
> choice. But you're mixing things up. You're saying an ALU cannot be
> done within 10ns on an FPGA, and I'm pointing out that's not true.
> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> performance when they got to 32bits or wider. And on an FPGA, bad
> decisions compound--if you have a huge slow ALU, it makes everything else
> slower as well (since everything gets further apart).
>

General ALU adder logic is sort of like:
== Clock edge on input ==
tValS = tValRs;
tValT = tValRt;
tCarryIn = 0;
if((opUIxt[5:0] == JX2_ALU_SUB) || (opUIxt[5:0] == JX2_ALU_SBB))
tValT = ~tValRt;
if(opUIxt[5:0] == JX2_ALU_SUB)
tCarryIn = 1;
if(opUIxt[5:0] == JX2_ALU_SBB)
tCarryIn = !regInSR[0];
...

tAddVal0p0 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 0;
tAddVal0p1 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 1;
tAddVal1p0 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 0;
tAddVal1p1 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 1;
tAddVal2p0 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 0;
tAddVal2p1 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 1;
tAddVal3p0 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 0;
tAddVal3p1 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 1;

tSelBit0 = tCarryIn;
tSelBit1 = tSelBit0 ? tAddVal0p1[16]:tAddVal0p0[16];
tSelBit2 = tSelBit1 ? tAddVal1p1[16]:tAddVal1p0[16];
tSelBit3 = tSelBit2 ? tAddVal2p1[16]:tAddVal2p0[16];
tCarryOut = tSelBit3 ? tAddVal3p1[16]:tAddVal3p0[16];

tAddVal = {
tSelBit3 ? tAddVal3p1[15:0] : tAddVal3p0[15:0],
tSelBit2 ? tAddVal2p1[15:0] : tAddVal2p0[15:0],
tSelBit1 ? tAddVal1p1[15:0] : tAddVal1p0[15:0],
tSelBit0 ? tAddVal0p1[15:0] : tAddVal0p0[15:0]
};

case(opUIxt[5:0])
//drive outputs for ALU.
endcase
if(opUCmd[5:0]==JX2_UCMD_ALU3)
begin
//present final outputs as ALU outputs.
end
if(opUCmd[5:0]==JX2_UCMD_CONV2)
begin
//present outputs from various format converters.
end
== Clock Edge ==

There is a combiner case where the Lane 1 and 2 ALUs may combine and
perform 128-bit integer addition and similar (similar with the integer
shift units, which combine for 128-bit integer shift).

> And as I pointed out, 32KB of cache should be no problem as long as it's
> direct mapped. If you want 4-way associative, that's also possible, but
> it requires care and careful logic for the way selection. Note it's
> just a lot of logic to handle unaligned (if you are supporting that) and
> it requires care to do it fast. Again, if you don't want to optimize
> your logic, that's your choice, but you keep complaining about FPGA
> speed, so you're implying you want it to go faster.
>
> A good rule for optimizing any logic is note what is the inherent critical
> path. Take dcache: it's generating the address, sending it to the cache,
> getting data back, muxing to get the correct way, aligning the data,
> flopping the result. Assume all control is valid in advance, what's
> the minimum logic that has to be done? And the best design does exactly
> that, and no random "pipe stall" signal is a critical path.
>

Cache design is sort of like:
Input Stage:
tNxtReqA0 = tAddrIn[47: 4];
tNxtReqA1 = tNxtReqA0 + 1; //may be done carry-select
tNxtReqAddrHi = tAddrIn[95:48]; // high bits of 96-bit VA.
tNxtReqBix = tAddrIn[4:0];

if(tNxtReqA0[0])
begin
tNxtRegAxA = tNxtReqA1;
tNxtRegAxB = tNxtReqA0;
end
else
begin
tNxtRegAxA = tNxtReqA0;
tNxtRegAxB = tNxtReqA1;
end
tNxtRegAxH =
tNxtReqAddrHi[47:32] ^
tNxtReqAddrHi[31:16] ^
tNxtReqAddrHi[15: 0] ^
tKrrModeHash;
//Hash based on keyring and current CPU mode.
//Includes a hardware RNG value that changes each flush.
tNxtReqIxA = tNxtRegAxA[10:1];
tNxtReqIxB = tNxtRegAxB[10:1];
tNxtReqIx2A = tNxtReqIxA;
tNxtReqIx2B = tNxtReqIxB;
if(exHold)
begin
tNxtReqIx2A = tReqIxA;
tNxtReqIx2B = tReqIxB;
end
...
=== Edge ===
if(!exHold2) //only update if pipeline not stalled
begin
tReqAxA <= tNxtReqAxA;
tReqAxB <= tNxtReqAxB;
tReqIxA <= tNxtReqIxA;
tReqIxB <= tNxtReqIxB;
...
end

tBlkDataA <= arrMemDataA[tNxtReqIx2A];
tBlkDataB <= arrMemDataB[tNxtReqIx2B];
tBlkAddrA <= arrMemAddrA[tNxtReqIx2A];
tBlkAddrB <= arrMemAddrB[tNxtReqIx2B];
tBlkIxA <= tNxtReqIx2A;
tBlkIxB <= tNxtReqIx2B;

=== Next Cycle ===
tAddrMissA =
(tBlkAddrA[47:32] != tReqAxA[43:28]) ||
(tBlkAddrA[31:16] != tReqAxA[27:12]) ||
(tBlkAddrA[15: 5] != tReqAxA[11: 1]) ||
(tBlkAddrA[63:48] != tReqAxH) ; //1
//*1: Storing/comparing full 96-bit virtual addr here is expensive.
//So, L1 caches cheat and use a hash internally.
tAddrMissB =
...

tReqMiss = tAddrMissA || tAddrMissB;
...

tDcHoldOut = tReqMiss;
if(not_ready)
tDcHoldOut = 1;
if(waiting-for-ram-responses)
tDcHoldOut = 1;
...

The tDcHoldOut would be later OR'ed with other signals to generate the
final pipeline stall signal.

Block extraction logic is sort of like:
if(tReqBix[4])
begin
tSelBlockData0 = { tBlkDataA, tBlkDataB };
end
else
begin
tSelBlockData0 = { tBlkDataB, tBlkDataA };
end

tSelBlockData1 = tSelBlockData0[127:0];
if(tReqBix[3])
tSelBlockData1 = tSelBlockData0[191:64];
//tSelBlockData1 used as direct output for 128-bit Load

tSelBlockData2 = tSelBlockData1[95:0];
if(tReqBix[2])
tSelBlockData2 = tSelBlockData1[127:32];

tSelBlockData3 = tSelBlockData2[79:0];
if(tReqBix[1])
tSelBlockData3 = tSelBlockData2[95:16];

tSelBlockData4 = tSelBlockData3[63:0];
if(tReqBix[0])
tSelBlockData4 = tSelBlockData3[71:8];

//tSelBlockData4: Generates output for 64-bit and less.

Final value preparation mostly involves sign/zero extension to the
appropriate size (for integer loads), or misc stuff like
Binary32->Binary64 conversion.

> A simple trick to remove any control signal from your critical path is
> to rewrite code to do a late mux using the slow signal. For example:
>
> wire step1 = (slow_signal) ? a : b:
> wire step2 = (other_signal) ? step1 : something_else;
> wire flop_d_input = (other_signal2) ? step2 : something_else2;
>
> So if slow_signal is the slowest signal, what can happen is the synthesizer
> defaults to reducing logic size, and so step1 is an early LUT, and
> then there's logic after it, so you have a critical path from slow_signal
> through additional LUTS. You can mechanically create two trees of logic,
> one where slow_signal==0 and one where slow_signal==1, and then do
> slow_signal muxing last:
>
> wire step1_slow_is_0 = b;
> wire step2_slow_is_0 = (other_signal) ? step1_slow_is_0 : something_else;
> wire flop_d_slow_is_0 = (other_signal2) ? step2_slow_is_0 : something_else2;
>
> wire step1_slow_is_1 = a;
> wire step2_slow_is_1 = (other_signal) ? step1_slow_is_1 : something_else;
> wire flop_d_slow_is_1 = (other_signal2) ? step2_slow_is_1 : something_else2;
>
> wire flop_d_input = (slow_signal) ? flop_d_slow_is_1 : flop_d_slow_is_0;
>
> This is annoying to do, and sometimes tools can help, but if you have
> a control signal which keeps messing up your logic, this eliminates it.
> I name signals like "xxx_if_step" or "capture_maybe" to note that the
> early logic steps are not fully qualified. Note: synthesizers sometimes
> catch on and undo your change to reduce LUT count. In that case, you have
> to mark the last signal before the mux as dont_touch to the synthesizer.
> Often, passing the intermediate signal like flop_d_slow_is_1 as a
> "debug output" that then gets optimized away (so pass it to another
> module not being synthesized with this block, so it gets optimized away
> much later) is enough to prevent the synthesis optimization (which
> reduces LUT count but hurst timing).
>
> If you post some details on your ALU implementation and your data cache
> implementation, I'm sure folks could provide pointers on improvements.
>

Click here to read the complete article

In article <ufetbd$31mip$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>On 10/2/2023 8:56 AM, Kent Dickey wrote:
>> In article <ufdvs9$2s0qj$1@dont-email.me>, BGB <cr88192@gmail.com> wrote:
>>> I am fiddling around a bit with it, and have been getting the core
>>> "closer" to being able to boost the speed, but the "Worst Negative
>>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
>>> fight here...
>>>
>>> Looks like for a lot of the failing paths are sort of like:
>>> ~ 12-14 levels;
>>> ~ 50-130 high-fanout;
>>> ~ 4.5 ns of logic delay;
>>> ~ 10.5 ns of net-delay.
>>>
>>>
>>> What makes things harder is that I am trying to pull this off while
>>> staying with 32K L1 caches, ...
>>
>> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
>> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
>> settle for 70MHz.
>>
>> Note that 14 levels of LUTs is equivalent to about 30 levels of gates.
>This is
>> a slow design independent of it being in an FPGA, and independent of
>> any FPGA routing issues.
>>
>
>It seems to vary based on what sorts of clock-speeds one synthesizes at:
> 100MHz seems to give ~ 10 levels;
> 75MHz seems to give ~ 14 levels;
> 50MHz seems to give ~ 19 levels.
>
>Don't really know how this part works exactly...
>
>
>> If you want to not optimize your control and other logic, that's your
>> choice. But you're mixing things up. You're saying an ALU cannot be
>> done within 10ns on an FPGA, and I'm pointing out that's not true.
>> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
>> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
>> performance when they got to 32bits or wider. And on an FPGA, bad
>> decisions compound--if you have a huge slow ALU, it makes everything else
>> slower as well (since everything gets further apart).
>>
>
>General ALU adder logic is sort of like:
> == Clock edge on input ==
> tValS = tValRs;
> tValT = tValRt;
> tCarryIn = 0;
> if((opUIxt[5:0] == JX2_ALU_SUB) || (opUIxt[5:0] == JX2_ALU_SBB))
> tValT = ~tValRt;
> if(opUIxt[5:0] == JX2_ALU_SUB)
> tCarryIn = 1;
> if(opUIxt[5:0] == JX2_ALU_SBB)
> tCarryIn = !regInSR[0];
> ...
>
> tAddVal0p0 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 0;
> tAddVal0p1 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 1;
> tAddVal1p0 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 0;
> tAddVal1p1 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 1;
> tAddVal2p0 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 0;
> tAddVal2p1 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 1;
> tAddVal3p0 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 0;
> tAddVal3p1 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 1;
>
> tSelBit0 = tCarryIn;
> tSelBit1 = tSelBit0 ? tAddVal0p1[16]:tAddVal0p0[16];
> tSelBit2 = tSelBit1 ? tAddVal1p1[16]:tAddVal1p0[16];
> tSelBit3 = tSelBit2 ? tAddVal2p1[16]:tAddVal2p0[16];
> tCarryOut = tSelBit3 ? tAddVal3p1[16]:tAddVal3p0[16];
>
> tAddVal = {
> tSelBit3 ? tAddVal3p1[15:0] : tAddVal3p0[15:0],
> tSelBit2 ? tAddVal2p1[15:0] : tAddVal2p0[15:0],
> tSelBit1 ? tAddVal1p1[15:0] : tAddVal1p0[15:0],
> tSelBit0 ? tAddVal0p1[15:0] : tAddVal0p0[15:0]
> };

You want the ALU data path to be operating on registered values.

But: you look at opUIxt[5:0], and if SUB or SBB, then you invert tValT[63:0].
This then goes in to your adders. This is slow--you have a LUT to look
at the opUIxt, and THEN you have a LUT to invert the 64 bit data path.
Then the addition is done. It would be best to push the data inversion
to the previous clock--you likely are reading the register data and then
registering it, so doing the inversion is "free" (remember, you get a free
LUT before each register). That's the least amount of logic.

But at the very least, detect that inversion is needed in the previous
clock. Then you have a single bit, inversion_needed for this cycle. Then
do:

wire [63:0] tvalT_maybe_inverted = (inversion_needed) ? ~tValT[63:0] :
tvalT[63:0];

and then do the addition with tvalT_maybe_inverted.

It would be best to decode opUIxt[5:0] in the previous clock, and then
use those registered values in this cycle as needed.

As a style suggestion, I always size all Verilog variables. This way,
it's easy to tell data path from control:

wire var1 = var2 && var3; // Single bit--control logic
wire [63:0] var5 = var6[63:0] & var7[63:0] // data-path, 64 LUTs

I'll point out your one line "tValT = ~tValRt;" is a major operation
(it's 64 LUTs) and it's kind of hidden in the code.

I don't understand your cache code, so I cannot really comment. If you
explain a little better what's going on, I can probably help. I cannot
tell which signals are registers and which are calculated, so I'm
completely lost. Maybe walk me through what you're doing over the 2
cycles of the cache access: There's the cycle you prepare the address and
send it to the BRAM; then the next cycle the BRAM uses that address (it
flops it internally), and drives the data out. With the infer style,
you should write it so the cache access uses a register as the index to
the Verilog array--this matches how the BRAM works. You also try to
avoid doing a 96-bit compare to determine cache hit--that's going to
be slow.

Kent

On Monday, October 2, 2023 at 12:06:57 PM UTC-5, BGB wrote:
> On 10/2/2023 8:56 AM, Kent Dickey wrote:
> > In article <ufdvs9$2s0qj$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >> I am fiddling around a bit with it, and have been getting the core
> >> "closer" to being able to boost the speed, but the "Worst Negative
> >> Slack" is still at around 2.59ns, and is putting up a whole lot of a
> >> fight here...
> >>
> >> Looks like for a lot of the failing paths are sort of like:
> >> ~ 12-14 levels;
> >> ~ 50-130 high-fanout;
> >> ~ 4.5 ns of logic delay;
> >> ~ 10.5 ns of net-delay.
> >>
> >>
> >> What makes things harder is that I am trying to pull this off while
> >> staying with 32K L1 caches, ...
> >
> > A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
> > then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> > settle for 70MHz.
> >
> > Note that 14 levels of LUTs is equivalent to about 30 levels of gates. This is
> > a slow design independent of it being in an FPGA, and independent of
> > any FPGA routing issues.
> >
> It seems to vary based on what sorts of clock-speeds one synthesizes at:
> 100MHz seems to give ~ 10 levels;
> 75MHz seems to give ~ 14 levels;
> 50MHz seems to give ~ 19 levels.
>
> Don't really know how this part works exactly...
> > If you want to not optimize your control and other logic, that's your
> > choice. But you're mixing things up. You're saying an ALU cannot be
> > done within 10ns on an FPGA, and I'm pointing out that's not true.
> > Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> > just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> > performance when they got to 32bits or wider. And on an FPGA, bad
> > decisions compound--if you have a huge slow ALU, it makes everything else
> > slower as well (since everything gets further apart).
> >
> General ALU adder logic is sort of like:
> == Clock edge on input ==
> tValS = tValRs;
> tValT = tValRt;
> tCarryIn = 0;
> if((opUIxt[5:0] == JX2_ALU_SUB) || (opUIxt[5:0] == JX2_ALU_SBB))
> tValT = ~tValRt;
> if(opUIxt[5:0] == JX2_ALU_SUB)
> tCarryIn = 1;
> if(opUIxt[5:0] == JX2_ALU_SBB)
> tCarryIn = !regInSR[0];
> ...
>
> tAddVal0p0 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 0;
> tAddVal0p1 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 1;
> tAddVal1p0 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 0;
> tAddVal1p1 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 1;
> tAddVal2p0 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 0;
> tAddVal2p1 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 1;
> tAddVal3p0 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 0;
> tAddVal3p1 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 1;
<
This is a carry select Adder--and generally fan-out boound from
carry[16] to next carry[16]
>
> tSelBit0 = tCarryIn;
> tSelBit1 = tSelBit0 ? tAddVal0p1[16]:tAddVal0p0[16];
> tSelBit2 = tSelBit1 ? tAddVal1p1[16]:tAddVal1p0[16];
> tSelBit3 = tSelBit2 ? tAddVal2p1[16]:tAddVal2p0[16];
> tCarryOut = tSelBit3 ? tAddVal3p1[16]:tAddVal3p0[16];
>
> tAddVal = {
> tSelBit3 ? tAddVal3p1[15:0] : tAddVal3p0[15:0],
> tSelBit2 ? tAddVal2p1[15:0] : tAddVal2p0[15:0],
> tSelBit1 ? tAddVal1p1[15:0] : tAddVal1p0[15:0],
> tSelBit0 ? tAddVal0p1[15:0] : tAddVal0p0[15:0]
> };
<
Yep. The question is what is the time between tAddVal and the == Clock
Edge == below. It should be less than ¼ cycle.
>
> case(opUIxt[5:0])
> //drive outputs for ALU.
> endcase
> if(opUCmd[5:0]==JX2_UCMD_ALU3)
> begin
> //present final outputs as ALU outputs.
> end
> if(opUCmd[5:0]==JX2_UCMD_CONV2)
> begin
> //present outputs from various format converters.
> end
> == Clock Edge ==
>
>
> There is a combiner case where the Lane 1 and 2 ALUs may combine and
> perform 128-bit integer addition and similar (similar with the integer
> shift units, which combine for 128-bit integer shift).
> > And as I pointed out, 32KB of cache should be no problem as long as it's
> > direct mapped. If you want 4-way associative, that's also possible, but
> > it requires care and careful logic for the way selection. Note it's
> > just a lot of logic to handle unaligned (if you are supporting that) and
> > it requires care to do it fast. Again, if you don't want to optimize
> > your logic, that's your choice, but you keep complaining about FPGA
> > speed, so you're implying you want it to go faster.
> >
> > A good rule for optimizing any logic is note what is the inherent critical
> > path. Take dcache: it's generating the address, sending it to the cache,
> > getting data back, muxing to get the correct way, aligning the data,
> > flopping the result. Assume all control is valid in advance, what's
> > the minimum logic that has to be done? And the best design does exactly
> > that, and no random "pipe stall" signal is a critical path.
> >
> Cache design is sort of like:
> Input Stage:
> tNxtReqA0 = tAddrIn[47: 4];
> tNxtReqA1 = tNxtReqA0 + 1; //may be done carry-select
> tNxtReqAddrHi = tAddrIn[95:48]; // high bits of 96-bit VA.
> tNxtReqBix = tAddrIn[4:0];
<
With a 32KB cache and the TLB in the L2 loop, you only need
(15-4) == 11-bits to the SRAM decoders. Everything else is used
after tag read (comparisons). These bits are 2-3 carry selects
faster than the HOBs. Also, you can skip the other addition
using some circular SRAM select logic.
<
>
> if(tNxtReqA0[0])
> begin
> tNxtRegAxA = tNxtReqA1;
> tNxtRegAxB = tNxtReqA0;
> end
> else
> begin
> tNxtRegAxA = tNxtReqA0;
> tNxtRegAxB = tNxtReqA1;
> end
> tNxtRegAxH =
> tNxtReqAddrHi[47:32] ^
> tNxtReqAddrHi[31:16] ^
> tNxtReqAddrHi[15: 0] ^
> tKrrModeHash;
> //Hash based on keyring and current CPU mode.
> //Includes a hardware RNG value that changes each flush.
> tNxtReqIxA = tNxtRegAxA[10:1];
> tNxtReqIxB = tNxtRegAxB[10:1];
> tNxtReqIx2A = tNxtReqIxA;
> tNxtReqIx2B = tNxtReqIxB;
> if(exHold)
> begin
> tNxtReqIx2A = tReqIxA;
> tNxtReqIx2B = tReqIxB;
> end
> ...
> === Edge ===
> if(!exHold2) //only update if pipeline not stalled
> begin
> tReqAxA <= tNxtReqAxA;
> tReqAxB <= tNxtReqAxB;
> tReqIxA <= tNxtReqIxA;
> tReqIxB <= tNxtReqIxB;
> ...
> end
>
> tBlkDataA <= arrMemDataA[tNxtReqIx2A];
> tBlkDataB <= arrMemDataB[tNxtReqIx2B];
> tBlkAddrA <= arrMemAddrA[tNxtReqIx2A];
> tBlkAddrB <= arrMemAddrB[tNxtReqIx2B];
> tBlkIxA <= tNxtReqIx2A;
> tBlkIxB <= tNxtReqIx2B;
>
> === Next Cycle ===
> tAddrMissA =
> (tBlkAddrA[47:32] != tReqAxA[43:28]) ||
> (tBlkAddrA[31:16] != tReqAxA[27:12]) ||
> (tBlkAddrA[15: 5] != tReqAxA[11: 1]) ||
> (tBlkAddrA[63:48] != tReqAxH) ; //1
> //*1: Storing/comparing full 96-bit virtual addr here is expensive.
> //So, L1 caches cheat and use a hash internally.
> tAddrMissB =
> ...
>
> tReqMiss = tAddrMissA || tAddrMissB;
> ...
>
> tDcHoldOut = tReqMiss;
> if(not_ready)
> tDcHoldOut = 1;
> if(waiting-for-ram-responses)
> tDcHoldOut = 1;
> ...
>
> The tDcHoldOut would be later OR'ed with other signals to generate the
> final pipeline stall signal.
>
> Block extraction logic is sort of like:
> if(tReqBix[4])
> begin
> tSelBlockData0 = { tBlkDataA, tBlkDataB };
> end
> else
> begin
> tSelBlockData0 = { tBlkDataB, tBlkDataA };
> end
>
> tSelBlockData1 = tSelBlockData0[127:0];
> if(tReqBix[3])
> tSelBlockData1 = tSelBlockData0[191:64];
> //tSelBlockData1 used as direct output for 128-bit Load
>
> tSelBlockData2 = tSelBlockData1[95:0];
> if(tReqBix[2])
> tSelBlockData2 = tSelBlockData1[127:32];
>
> tSelBlockData3 = tSelBlockData2[79:0];
> if(tReqBix[1])
> tSelBlockData3 = tSelBlockData2[95:16];
>
> tSelBlockData4 = tSelBlockData3[63:0];
> if(tReqBix[0])
> tSelBlockData4 = tSelBlockData3[71:8];
>
> //tSelBlockData4: Generates output for 64-bit and less.
>
> Final value preparation mostly involves sign/zero extension to the
> appropriate size (for integer loads), or misc stuff like
> Binary32->Binary64 conversion.
> > A simple trick to remove any control signal from your critical path is
> > to rewrite code to do a late mux using the slow signal. For example:
> >
> > wire step1 = (slow_signal) ? a : b:
> > wire step2 = (other_signal) ? step1 : something_else;
> > wire flop_d_input = (other_signal2) ? step2 : something_else2;
> >
> > So if slow_signal is the slowest signal, what can happen is the synthesizer
> > defaults to reducing logic size, and so step1 is an early LUT, and
> > then there's logic after it, so you have a critical path from slow_signal
> > through additional LUTS. You can mechanically create two trees of logic,
> > one where slow_signal==0 and one where slow_signal==1, and then do
> > slow_signal muxing last:
> >
> > wire step1_slow_is_0 = b;
> > wire step2_slow_is_0 = (other_signal) ? step1_slow_is_0 : something_else;
> > wire flop_d_slow_is_0 = (other_signal2) ? step2_slow_is_0 : something_else2;
> >
> > wire step1_slow_is_1 = a;
> > wire step2_slow_is_1 = (other_signal) ? step1_slow_is_1 : something_else;
> > wire flop_d_slow_is_1 = (other_signal2) ? step2_slow_is_1 : something_else2;
> >
> > wire flop_d_input = (slow_signal) ? flop_d_slow_is_1 : flop_d_slow_is_0;
> >
> > This is annoying to do, and sometimes tools can help, but if you have
> > a control signal which keeps messing up your logic, this eliminates it.
> > I name signals like "xxx_if_step" or "capture_maybe" to note that the
> > early logic steps are not fully qualified. Note: synthesizers sometimes
> > catch on and undo your change to reduce LUT count. In that case, you have
> > to mark the last signal before the mux as dont_touch to the synthesizer..
> > Often, passing the intermediate signal like flop_d_slow_is_1 as a
> > "debug output" that then gets optimized away (so pass it to another
> > module not being synthesized with this block, so it gets optimized away
> > much later) is enough to prevent the synthesis optimization (which
> > reduces LUT count but hurst timing).
> >
> > If you post some details on your ALU implementation and your data cache
> > implementation, I'm sure folks could provide pointers on improvements.
> >
> I have battled with synthesis a lot in the past here...
>
>
> Mostly, the stall signal isn't used in combinatorial logic for the most
> part, but rather effects the @(posedge clock) logic.
>
> If the stall signal is active, then most of the existing flip-flop
> values are left as-is.
>
>
>
> > Kent

Click here to read the complete article

On Monday, October 2, 2023 at 8:56:49 AM UTC-5, Kent Dickey wrote:
> In article <ufdvs9$2s0qj$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >I am fiddling around a bit with it, and have been getting the core
> >"closer" to being able to boost the speed, but the "Worst Negative
> >Slack" is still at around 2.59ns, and is putting up a whole lot of a
> >fight here...
> >
> >Looks like for a lot of the failing paths are sort of like:
> > ~ 12-14 levels;
> > ~ 50-130 high-fanout;
> > ~ 4.5 ns of logic delay;
> > ~ 10.5 ns of net-delay.
> >
> >
> >What makes things harder is that I am trying to pull this off while
> >staying with 32K L1 caches, ...
> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> settle for 70MHz.
>
> Note that 14 levels of LUTs is equivalent to about 30 levels of gates. This is
> a slow design independent of it being in an FPGA, and independent of
> any FPGA routing issues.
<
Thanks for jumping in here.
>
> If you want to not optimize your control and other logic, that's your
> choice. But you're mixing things up. You're saying an ALU cannot be
> done within 10ns on an FPGA, and I'm pointing out that's not true.
> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> performance when they got to 32bits or wider. And on an FPGA, bad
> decisions compound--if you have a huge slow ALU, it makes everything else
> slower as well (since everything gets further apart).
>
> And as I pointed out, 32KB of cache should be no problem as long as it's
> direct mapped. If you want 4-way associative, that's also possible, but
> it requires care and careful logic for the way selection. Note it's
> just a lot of logic to handle unaligned (if you are supporting that) and
> it requires care to do it fast. Again, if you don't want to optimize
> your logic, that's your choice, but you keep complaining about FPGA
> speed, so you're implying you want it to go faster.
>
> A good rule for optimizing any logic is note what is the inherent critical
> path. Take dcache: it's generating the address, sending it to the cache,
> getting data back, muxing to get the correct way, aligning the data,
> flopping the result. Assume all control is valid in advance, what's
> the minimum logic that has to be done? And the best design does exactly
> that, and no random "pipe stall" signal is a critical path.
>
> A simple trick to remove any control signal from your critical path is
> to rewrite code to do a late mux using the slow signal. For example:
>
> wire step1 = (slow_signal) ? a : b:
> wire step2 = (other_signal) ? step1 : something_else;
> wire flop_d_input = (other_signal2) ? step2 : something_else2;
>
> So if slow_signal is the slowest signal, what can happen is the synthesizer
> defaults to reducing logic size, and so step1 is an early LUT, and
> then there's logic after it, so you have a critical path from slow_signal
> through additional LUTS. You can mechanically create two trees of logic,
> one where slow_signal==0 and one where slow_signal==1, and then do
> slow_signal muxing last:
>
> wire step1_slow_is_0 = b;
> wire step2_slow_is_0 = (other_signal) ? step1_slow_is_0 : something_else;
> wire flop_d_slow_is_0 = (other_signal2) ? step2_slow_is_0 : something_else2;
>
> wire step1_slow_is_1 = a;
> wire step2_slow_is_1 = (other_signal) ? step1_slow_is_1 : something_else;
> wire flop_d_slow_is_1 = (other_signal2) ? step2_slow_is_1 : something_else2;
>
> wire flop_d_input = (slow_signal) ? flop_d_slow_is_1 : flop_d_slow_is_0;
<
This looks well though out.
>
> This is annoying to do, and sometimes tools can help, but if you have
> a control signal which keeps messing up your logic, this eliminates it.
> I name signals like "xxx_if_step" or "capture_maybe" to note that the
> early logic steps are not fully qualified. Note: synthesizers sometimes
> catch on and undo your change to reduce LUT count. In that case, you have
> to mark the last signal before the mux as dont_touch to the synthesizer.
> Often, passing the intermediate signal like flop_d_slow_is_1 as a
> "debug output" that then gets optimized away (so pass it to another
> module not being synthesized with this block, so it gets optimized away
> much later) is enough to prevent the synthesis optimization (which
> reduces LUT count but hurst timing).
>
> If you post some details on your ALU implementation and your data cache
> implementation, I'm sure folks could provide pointers on improvements.
>
> Kent

On Monday, October 2, 2023 at 3:11:57 PM UTC-5, Kent Dickey wrote:
> In article <ufetbd$31mip$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >On 10/2/2023 8:56 AM, Kent Dickey wrote:
> >> In article <ufdvs9$2s0qj$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
> >>> I am fiddling around a bit with it, and have been getting the core
> >>> "closer" to being able to boost the speed, but the "Worst Negative
> >>> Slack" is still at around 2.59ns, and is putting up a whole lot of a
> >>> fight here...
> >>>
> >>> Looks like for a lot of the failing paths are sort of like:
> >>> ~ 12-14 levels;
> >>> ~ 50-130 high-fanout;
> >>> ~ 4.5 ns of logic delay;
> >>> ~ 10.5 ns of net-delay.
> >>>
> >>>
> >>> What makes things harder is that I am trying to pull this off while
> >>> staying with 32K L1 caches, ...
> >>
> >> A good rule of thumb is 1ns per LUT level. So if you have 14 levels of LUTs,
> >> then you're aiming at 70MHz or so. Either reduce levels of LUTs, or just
> >> settle for 70MHz.
> >>
> >> Note that 14 levels of LUTs is equivalent to about 30 levels of gates.
> >This is
> >> a slow design independent of it being in an FPGA, and independent of
> >> any FPGA routing issues.
> >>
> >
> >It seems to vary based on what sorts of clock-speeds one synthesizes at:
> > 100MHz seems to give ~ 10 levels;
> > 75MHz seems to give ~ 14 levels;
> > 50MHz seems to give ~ 19 levels.
> >
> >Don't really know how this part works exactly...
> >
> >
> >> If you want to not optimize your control and other logic, that's your
> >> choice. But you're mixing things up. You're saying an ALU cannot be
> >> done within 10ns on an FPGA, and I'm pointing out that's not true.
> >> Definitely doing an ADD over 2 clocks is unnecessary in an FPGA, you
> >> just need to use the DSP48 blocks. Synthesized adders in FPGA have poor
> >> performance when they got to 32bits or wider. And on an FPGA, bad
> >> decisions compound--if you have a huge slow ALU, it makes everything else
> >> slower as well (since everything gets further apart).
> >>
> >
> >General ALU adder logic is sort of like:
> > == Clock edge on input ==
> > tValS = tValRs;
> > tValT = tValRt;
> > tCarryIn = 0;
> > if((opUIxt[5:0] == JX2_ALU_SUB) || (opUIxt[5:0] == JX2_ALU_SBB))
> > tValT = ~tValRt;
> > if(opUIxt[5:0] == JX2_ALU_SUB)
> > tCarryIn = 1;
> > if(opUIxt[5:0] == JX2_ALU_SBB)
> > tCarryIn = !regInSR[0];
> > ...
> >
> > tAddVal0p0 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 0;
> > tAddVal0p1 = { 1'b0, tValS[15: 0] } + { 1'b0, tValT[15: 0] } + 1;
> > tAddVal1p0 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 0;
> > tAddVal1p1 = { 1'b0, tValS[31:16] } + { 1'b0, tValT[31:16] } + 1;
> > tAddVal2p0 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 0;
> > tAddVal2p1 = { 1'b0, tValS[47:32] } + { 1'b0, tValT[47:32] } + 1;
> > tAddVal3p0 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 0;
> > tAddVal3p1 = { 1'b0, tValS[63:48] } + { 1'b0, tValT[63:48] } + 1;
> >
> > tSelBit0 = tCarryIn;
> > tSelBit1 = tSelBit0 ? tAddVal0p1[16]:tAddVal0p0[16];
> > tSelBit2 = tSelBit1 ? tAddVal1p1[16]:tAddVal1p0[16];
> > tSelBit3 = tSelBit2 ? tAddVal2p1[16]:tAddVal2p0[16];
> > tCarryOut = tSelBit3 ? tAddVal3p1[16]:tAddVal3p0[16];
> >
> > tAddVal = {
> > tSelBit3 ? tAddVal3p1[15:0] : tAddVal3p0[15:0],
> > tSelBit2 ? tAddVal2p1[15:0] : tAddVal2p0[15:0],
> > tSelBit1 ? tAddVal1p1[15:0] : tAddVal1p0[15:0],
> > tSelBit0 ? tAddVal0p1[15:0] : tAddVal0p0[15:0]
> > };
> You want the ALU data path to be operating on registered values.
<
Yes, either before forwarding or after forwarding, but this is the time
critical clock edge. I generally used::
<
clock|| Adder Drive Forward ||clock
<
precisely because the forwarding logic is late and fan-out is high.
>
> But: you look at opUIxt[5:0], and if SUB or SBB, then you invert tValT[63:0].
> This then goes in to your adders. This is slow--you have a LUT to look
> at the opUIxt, and THEN you have a LUT to invert the 64 bit data path.
> Then the addition is done. It would be best to push the data inversion
> to the previous clock--you likely are reading the register data and then
> registering it, so doing the inversion is "free" (remember, you get a free
> LUT before each register). That's the least amount of logic.
>
> But at the very least, detect that inversion is needed in the previous
> clock. Then you have a single bit, inversion_needed for this cycle. Then
> do:
>
> wire [63:0] tvalT_maybe_inverted = (inversion_needed) ? ~tValT[63:0] :
> tvalT[63:0];
>
> and then do the addition with tvalT_maybe_inverted.
>
> It would be best to decode opUIxt[5:0] in the previous clock, and then
> use those registered values in this cycle as needed.
>
> As a style suggestion, I always size all Verilog variables. This way,
> it's easy to tell data path from control:
>
> wire var1 = var2 && var3; // Single bit--control logic
> wire [63:0] var5 = var6[63:0] & var7[63:0] // data-path, 64 LUTs
>
> I'll point out your one line "tValT = ~tValRt;" is a major operation
> (it's 64 LUTs) and it's kind of hidden in the code.
>
> I don't understand your cache code, so I cannot really comment. If you
> explain a little better what's going on, I can probably help. I cannot
> tell which signals are registers and which are calculated, so I'm
> completely lost. Maybe walk me through what you're doing over the 2
> cycles of the cache access: There's the cycle you prepare the address and
> send it to the BRAM; then the next cycle the BRAM uses that address (it
> flops it internally), and drives the data out. With the infer style,
> you should write it so the cache access uses a register as the index to
> the Verilog array--this matches how the BRAM works. You also try to
> avoid doing a 96-bit compare to determine cache hit--that's going to
> be slow.
>
> Kent

Subject	Author
Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Timothy McCaffrey
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	EricP
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Kent Dickey
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	robf...@gmail.com
Re: Misc: Another (possible) way to more MHz...	MitchAlsup
Re: Misc: Another (possible) way to more MHz...	BGB
Re: Misc: Another (possible) way to more MHz...	Michael S
Re: Misc: Another (possible) way to more MHz...	Michael S
Re: Misc: Another (possible) way to more MHz...	BGB

The disks are getting full; purge a file today.

devel / comp.arch / Misc: Another (possible) way to more MHz...