Rocksolid Light - comp.arch - Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

On 2/1/2024 5:04 AM, Robert Finch wrote:
> On 2024-02-01 4:45 a.m., BGB wrote:
>> On 2/1/2024 1:54 AM, Robert Finch wrote:
>>> On 2024-02-01 2:11 a.m., BGB wrote:
>>>> On 1/31/2024 9:01 PM, MitchAlsup wrote:
>>>>> BGB-Alt wrote:
>>>>>
>>>>>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>>>>>> BGB-Alt wrote:
>>>>>>> <snip>
>>>>>
>>>>>> Also it would appear as-if the scheduling is assuming 1-cycle ALU
>>>>>> and 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>>>>>
>>>>>> So, at least part of the problem is that GCC is generating code
>>>>>> that is not ideal for my pipeline.
>>>>>
>>>>> Captain Obvious strikes again.
>>>>>
>>>>>> Tried modeling what happens if RV64 had superscalar (in my
>>>>>> emulator), and the interlock issue gets worse, as then jumps up to
>>>>>> around 23%-26% interlock penalty (mostly eating any gains that
>>>>>> superscalar would bring). Where, it seems that superscalar
>>>>>> (according to my CPU's rules) would bundle around 10-15% of the
>>>>>> RV64 ops with '-O3' (or, around 8-12% with '-Os').
>>>>>
>>>>> You are running into the reasons CPU designers went OoO after the
>>>>> 2-wide
>>>>> in-order machine generation.
>>>>>
>>>>
>>>>
>>>> At the moment, it is bad enough to make me question whether even
>>>> 2-wide superscalar makes sense for RV64.
>>>>
>>>> Like, if Instructions/Bundle jumps by 10% but Interlock-Cost jumps
>>>> by 9%, then it would only gain 1% in terms of Instructions/Clock.
>>>>
>>>> This would suck, and not worth the cost of adding all the plumbing
>>>> needed to support superscalar.
>>>>
>>>>
>>>>>> On the other hand, disabling WEX in BJX2 causes interlock
>>>>>> penalties to drop. So, it still maintains a performance advantage
>>>>>> over RV, as the drop in MIPs score is smaller.
>>>>>
>>>>> Your compiler is tuned to your pipeline.
>>>>> But how do you tune your compiler to EVERY conceivable pipeline ??
>>>>>
>>>>
>>>> Possibly so.
>>>>
>>>> Seems that since my CPU and compiler co-evolved, then they fit
>>>> together reasonably well.
>>>>
>>>> Meanwhile, GCC output output seems to assume a different looking
>>>> CPU, and is at a natural disadvantage (independent of the respective
>>>> "goodness" of the ISA's in question).
>>>>
>>>>
>>>> So, it seems like, my ISA runs roughly 22% faster than RV64 on my
>>>> CPU design, with GCC's tuning being sub-optimal.
>>>>
>>>>
>>>> But, both would get a nice speed up if the instruction latency were
>>>> more in-tune with what GCC seems to expect (and what is apparently
>>>> delivered by many of the RV64 chips).
>>>>
>>>> So, in part, the comparably high latency values are hurting
>>>> performance it seems.
>>>>
>>>>
>>>>>> Otherwise, had started work on trying to get RV64G support
>>>>>> working, as this would support a wider variety of programs than
>>>>>> RV64IMA.
>>>>>
>>>>>
>>>>>
>>>>>> In another experiment, had added logic to fold && and || operators
>>>>>> to use bitwise arithmetic for logical expressions (in certain cases).
>>>>>> If both the LHS and RHS represent logical expressions with no side
>>>>>> effects;
>>>>>> If the LHS and RHS are not "too expensive" according to a cost
>>>>>> heuristic (past a certain size, it is cheaper to use short-circuit
>>>>>> branching rather than ALU operations).
>>>>>
>>>>>> Internally, this added various pseudo operators to the compiler:
>>>>>>    &&&, |||: Logical and expressed as bitwise.
>>>>>>    !& : !(a&b)
>>>>>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>>>>>      Exists to be distinct from normal bitwise AND.
>>>>>
>>>>> For the inexpensive cases, PRED was designed to handle the && and ||
>>>>> of HLLs.
>>>>
>>>> Mine didn't handle them, so generally predication only worked with
>>>> trivial conditionals:
>>>>    if(a<0)
>>>>      a=0;
>>>> Would use predication, but more complex cases:
>>>>    if((a<0) && (b>0))
>>>>      a=0;
>>>> Would not, and would always fall back to branching.
>>>>
>>>>
>>>> In the new mechanism, the latter case can partly be folded back into
>>>> the former, and can now allow parts of the conditional expression to
>>>> be subject to shuffling and bundling.
>>>>
>>>> But, it seems, say:
>>>>    CMPxx; MOVT; CMPxx; MOVT; AND; BNE
>>>> Is more bulky than, say:
>>>>    CMPxx; BF; CMPxx; BF;
>>>> And, not always faster.
>>>>
>>>>
>>>> The CMP-3R ops partially address this, but the usefulness of the
>>>> immediate case is severely compromised with a small value range and
>>>> only a few possibilities.
>>>>
>>>> But, don't really have the encoding space left over (in the 32-bit
>>>> space) to add "better" versions.
>>>>
>>>> Like, say:
>>>>    CMPQEQ Rm, Imm9u, Rn
>>>>    CMPQEQ Rm, Imm9n, Rn
>>>>    CMPQNE Rm, Imm9u, Rn
>>>>    CMPQNE Rm, Imm9n, Rn
>>>>    CMPQGT Rm, Imm9u, Rn
>>>>    CMPQGT Rm, Imm9n, Rn
>>>>    CMPQGE Rm, Imm9u, Rn
>>>>    CMPQGE Rm, Imm9n, Rn
>>>>    CMPQLT Rm, Imm9u, Rn
>>>>    CMPQLT Rm, Imm9n, Rn
>>>>    CMPQLE Rm, Imm9u, Rn
>>>>    CMPQLE Rm, Imm9n, Rn
>>>>
>>>> Would deal with all of the cases effectively (and with a single op),
>>>> but at present, there is no encoding pace to add these in the 32-bit
>>>> space (these would be a bit of an ask, even if the space did exist).
>>>>
>>>>
>>>> More viable would be (in XG2):
>>>>    CMPQEQ Rm, Imm6s, Rn
>>>>    CMPQNE Rm, Imm6s, Rn
>>>>    CMPQGT Rm, Imm6s, Rn
>>>>    CMPQGE Rm, Imm6s, Rn
>>>>    CMPQLT Rm, Imm6s, Rn
>>>>    CMPQLE Rm, Imm6s, Rn
>>>>
>>>> But, this is lame, but still more than the current:
>>>>    CMPQEQ Rm, Imm5u, Rn
>>>>    CMPQNE Rm, Imm5u, Rn
>>>>    CMPQGT Rm, Imm5u, Rn
>>>> But, can maybe re-add the GE case:
>>>>    CMPQGE Rm, Imm5u, Rn
>>>>
>>>>
>>>> Theoretically, 6s could get around a 60% hit-rate (vs 40% for 5u).
>>>> The hit-rate for 6u is also pretty close. Having both 6u and 6n
>>>> cases would have a better hit-rate, but is a bit more steep in terms
>>>> of encoding space (and is unlikely to matter enough to justify
>>>> burning 12 instruction spots on it).
>>>>
>>>> Though, there is still the option of throwing a Jumbo prefix on
>>>> these ops getting, say:
>>>>    CMPQEQ Rm, Imm29s, Rn //EQ, Wi=0
>>>>    CMPQNE Rm, Imm29s, Rn //NE, Wi=0
>>>>    CMPQGT Rm, Imm29s, Rn //GT, Wi=0
>>>>    CMPQGE Rm, Imm29s, Rn //GE, Wi=0
>>>>    CMPQLT Rm, Imm29s, Rn //GE, Wi=1
>>>>    CMPQLE Rm, Imm29s, Rn //GT, Wi=1
>>>>
>>>>    CMPQHI Rm, Imm29s, Rn //EQ, Wi=1 (?)
>>>>    CMPQHS Rm, Imm29s, Rn //NE, Wi=1 (?)
>>>>
>>>> But... These would be 64-bit encodings, so would have the usual
>>>> tradeoffs/drawbacks of using 64-bit encodings...
>>>>
>>>> Note that in XG2, the 'Wi' bit would otherwise serve as a sign
>>>> extension bit for the immediate (but, with a Jumbo-Imm prefix, the
>>>> Ei bit serves as the sign bit, and Wi would be left as a possible
>>>> opcode bit, and/or ignored...).
>>>>
>>>>
>>>> And, with WEX, would be hit/miss vs loading the values into
>>>> registers for the value-range of +/- 65535.
>>>>
>>>>
>>>> Also, main reason GE was left out of the current batch Imm5-forms
>>>> was that it seemed to have a comparably lower hit-rate than EQ/NE/GT
>>>> (though, GE does better than GT for the 2-register case, but was a
>>>> lower hit-rate for compare-with-immediate).
>>>>
>>>>
>>>> Arguably, a case could be made for the unsigned compares, these were
>>>> left out for these cases as 64-bit unsigned compare is comparably
>>>> much rarer (and, 64-bit signed-compare works for 32-bit unsigned
>>>> values, in the case where the ABI keeps these values zero-extended,
>>>> unlike the wonk that is RV64 apparently sign-extending 32-bit
>>>> unsigned values to 64 bits).
>>>>
>>>> ...
>>>>
>>>>
>>> Sounds like you hit the 32-bit encoding crunch. I think going with a
>>> wider instruction format for a 64-bit machine is a reasonable choice.
>>> I think they got that right with the Itanium. Being limited to
>>> constants < 12 bits uses extra instructions. If significant
>>> percentage of the constants need extra instructions does using
>>> 32-bits really save space? A decent compare-and-branch can be built
>>> in 40-bits. Compare-and-branch is 10% of the instructions. If one
>>> looks at all the extra bits required to use a 32-bit instruction
>>> instead of a 40-bit one, the difference in code size is likely to be
>>> much smaller than the 25% difference in instruction bit size. I have
>>> been wanting to measure this for a while. I have thought of switching
>>> to 41-bit instructions as three will fit into 128-bits and it may be
>>> possible to simplify the fetch stage if bundles of 128-bits are
>>> fetched for a three-wide machine. But the software for 41-bits is
>>> more challenging.
>>>
>>
>> As can be noted, for 3RI Imm9 encodings in XG2, costs are:
>>    2-bits: Bundle+Predicate Mode
>>    12 bits: Rm/Rn register fields
>>    9 bits: Immediate
>>    9 bits: Remains for opcode/etc.
>>
>> For 3R instructions:
>>    2-bits: Bundle+Predicate Mode
>>    18 bits: Rm/Ro/Rn register fields
>>    12 bits: Remains for opcode/etc.
>>
>> Though, given the ISA has other instructions:
>>    32 spots: Load/Store (Disp9) and JCMP
>>    16 spots: ALU 3R (Imm9)
>>      The F2 block was split in half, with half going to 2RI Imm10 ops.
>>
>> The F0 block holds all of the 3R ops, with a theoretical 9-bit opcode
>> space.
>>    Though, 1/4 of the space was carved initially for Branch ops.
>>      In the original encoding, they were Disp20.
>>      In XG2, they are effectively Disp23.
>>      Half of this space has been semi-reclaimed though.
>>        The BT/BF ops were redefined as being encoded as BRA?T / BRA?F
>>
>> Parts of the 3R space were also carved out for 2R space, etc.
>>
>>
>> The encoding space can be extended with Jumbo Prefixes.
>>    Currently defined as FE and FF, with 24 bits of payload.
>>    FE is solely "Make Disp/Imm field bigger".
>>    FF is mostly "Mostly make Opcode bigger, maybe also extend Immed".
>>
>> In XG2, there are theoretically a number of other jumbo prefixes:
>>    1E/1F/3E/3F/5E/5F/7E/7F/9E/9F/BE/BF/DE/DF
>> But, these are not yet defined for anything, and are reserved.
>>
>> There are also variants of the FA/FB block:
>>    1A/1B/3A/3B/5A/5B/7A/7B/9A/9B/BA/BB/DA/DB
>> Which are similarly reserved (each with a potential of 24 bits of
>> payload).
>>
>>
>> Status of the major blocks:
>>    F0: Mostly full (3R Space)
>>      0/1/2/3/4/5/6: Full
>>      7/8/9: Partly used.
>>      A/B: Still available
>>      C/D: BRA/BSR
>>      E/F: Semi reclaimed (former BT/BF ops)
>>    F1: Basically full (LD/ST)
>>    F2: Full as for 3RI Imm9 ops, some 2RI space remains.
>>    F3: Unused, Intended as User-Extension-Block
>>      Would likely follow same layout as F0 block.
>>    F8: 2RI Imm16 ops, 6/8 used.
>>    F9: Reserved
>>      Likely more 3R space (similar to F0 Block)
>>      May expand to F9 when F0 gets full.
>>      Beyond then, dunno.
>>      Probably no more Imm9 ops though.
>>
>> Note that:
>>    F4..F7 mirrors F0..F3 (but, with the WEX flag set)
>>    FA/FB are some niche, but used indirectly for alternative uses.
>>    FC/FD mirror F8/F9;
>>    FE/FF: Jumbo Prefixes
>> The Ez block follows a similar layout, but represents predicated ops.
>>    E0..E3: F0..F3, but Pred?T
>>    E4..E7: F0..F3, but Pred?F
>>    E8..E9: F8..F9, but Pred?T
>>    EA..EB: F0, F2, but Pred?T and WEX
>>    EC..ED: F8..F9, but Pred?F
>>    EE..EE: F0, F2, but Pred?F and WEX
>>
>> In XG2, all blocks other than Ez/Fz mirror Ez/Fz, but used to encode
>> Bit5 of the register field.
>>
>> In Baseline mode, these mostly encode 16-bit ops (where nominally,
>> everything uses 5-bit register fields, and the handling of R32..R63 is
>> hacky and only works with a limited subset of the ISA; having
>> reclaimed the 7z and 9z blocks from 16-bit land; these were reclaimed
>> from a defunct 24-bit instructions experiment, which had in turn used
>> these because initially "nothing of particular value" was in these
>> parts of the 16-bit map).
>>
>>
>> There has been a slowdown of adding new instructions, and being more
>> conservative when they are added, mostly because there isn't a whole
>> lot of encoding space left in the existing blocks.
>>
>> Apart from F3 and F9, the existing 32-bit encoding space is mostly
>> used up.
>>
>>
>> ...
>>
>>
> Put some work into the compiler and got it to optimize some expressions
> to use the dual-operation instructions. ATM it supports and_or, and_and,
> or_or, and or_and. The HelloWorld! Program produces the following.
>
> integer main(integer argc, char* argv[])
> begin
>     integer x;
>
>     for (x = 1; x < 10; x++) begin
>         if (argc > 10 and argc < 12 or argc==52)
>             puts("Hello World!\n");
>     end
> end
>
>     .sdreg    29
> _main:
> enter 2,32
> ldo s1,32[fp]
> ; for (x = 1; x < 10; x++) begin
> ldi s0,1
> ldi t1,10
> bge s0,t1,.00039
> .00038:
> ; if (argc > 10 and argc < 12 or argc==52)
> zsgt t1,s1,10,1
> zslt t2,s1,12,1
> zseq t3,s1,52,1
> and_or t0,t1,t2,t3
> beqz t0,.00041
> ; puts("Hello World!\n");
> sub sp,sp,8
> lda t0,_main.00016[gp]
> orm t0,_main.00016
> sto t0,0[sp]
> bsr _puts
> .00041:
> .00040:
> ldi t1,10
> iblt s0,t1,.00038
> .00039:
> .00037:
> leave 2,16
>     .type    _main,@function
>     .size    _main,$-_main
>

Click here to read the complete article

BGB wrote:

> On 2/1/2024 5:04 AM, Robert Finch wrote:
>>
>> integer main(integer argc, char* argv[])
>> begin
>>     integer x;
>>
>>     for (x = 1; x < 10; x++) begin
>>         if (argc > 10 and argc < 12 or argc==52)
>>             puts("Hello World!\n");
>>     end
>> end
>>
>>     .sdreg    29
>> _main:
>> enter 2,32
>> ldo s1,32[fp]
>> ; for (x = 1; x < 10; x++) begin
>> ldi s0,1
>> ldi t1,10
>> bge s0,t1,.00039
>> .00038:
>> ; if (argc > 10 and argc < 12 or argc==52)
>> zsgt t1,s1,10,1
>> zslt t2,s1,12,1
>> zseq t3,s1,52,1
>> and_or t0,t1,t2,t3
>> beqz t0,.00041
>> ; puts("Hello World!\n");
>> sub sp,sp,8
>> lda t0,_main.00016[gp]
>> orm t0,_main.00016
>> sto t0,0[sp]
>> bsr _puts
>> .00041:
>> .00040:
>> ldi t1,10
>> iblt s0,t1,.00038
>> .00039:
>> .00037:
>> leave 2,16
>>     .type    _main,@function
>>     .size    _main,$-_main
>>

> Hmm...

> Possible I guess, but 4R ALU ops isn't something my CPU can do as-is,
> and I am not sure it would be used enough to make it worthwhile.

> Though, did go and try a different strategy:
> I noted while skimming the SiFive S76 docs that it specified some
> constraints on the timing of various ops. Memory Load timing depended on
> what was being loaded, as did ALU timing.

> This gave me an idea.

> I could add a "fast path" to the L1 cache where, if the memory access
> satisfied certain requirements, it would be reduced to 2 cycle latency:
> Aligned-Only, 32 or 64 bit Load;
> Normal RAM access (not MMIO or similar);
> Does not trigger a "read-after-write" dependency;
> ...
> This case allowing for cheaper memory access logic which doesn't kill
> the timing (if the result is forwarded directly to the pipeline).

The above was a question poised to me while interviewing with HP in 1988.

The right answer is:: "Do nothing that harms the frequency of the pipeline".
{{Which you my or may not be doing to yourself}}

The second correct right answer is:: "Do nothing that adds 1 to the exponent
of test vector complexity". {{Which you invariably are doing to yourself}}

> Basically, in this case, the L1D$ has an alternate output that is
> directed to EX2 with a flag that encodes whether the value is valid. It
> does not replace the logic in EX3, mostly because (unless something has
> gone terribly wrong), both should always give the same output value.

> Also an alternate "fast case ALU", which reduces ALU to 1-cycle for a
> few common cases:
> ADD{S/U}L, SUB{S/U}L
> ADD/SUB if the input values fall safely into signed 32-bit range.
> Currently +/- 2^30, as this can't overflow the signed 32-bit.
> Skips 64-bit mostly because low-latency 64-bit ADD is harder.
> AND/OR/XOR
> These handle full 64-bit though.

> Currently, ignores all the other operations, and currently applies only
> to Lane 1. As with Load, it doesn't modify the logic in EX2 mostly
> because both should always produce the same result.

> ....

On 2/2/2024 1:39 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 2/1/2024 5:04 AM, Robert Finch wrote:
>>>
>>> integer main(integer argc, char* argv[])
>>> begin
>>>      integer x;
>>>
>>>      for (x = 1; x < 10; x++) begin
>>>          if (argc > 10 and argc < 12 or argc==52)
>>>              puts("Hello World!\n");
>>>      end
>>> end
>>>
>>>      .sdreg    29
>>> _main:
>>>    enter 2,32
>>>    ldo s1,32[fp]
>>> ; for (x = 1; x < 10; x++) begin
>>>    ldi s0,1
>>>    ldi t1,10
>>>    bge s0,t1,.00039
>>> .00038:
>>> ; if (argc > 10 and argc < 12 or argc==52)
>>>    zsgt t1,s1,10,1
>>>    zslt t2,s1,12,1
>>>    zseq t3,s1,52,1
>>>    and_or t0,t1,t2,t3
>>>    beqz t0,.00041
>>> ; puts("Hello World!\n");
>>>    sub sp,sp,8
>>>    lda t0,_main.00016[gp]
>>>    orm t0,_main.00016
>>>    sto t0,0[sp]
>>>    bsr _puts
>>> .00041:
>>> .00040:
>>>    ldi t1,10
>>>    iblt s0,t1,.00038
>>> .00039:
>>> .00037:
>>>    leave 2,16
>>>      .type    _main,@function
>>>      .size    _main,$-_main
>>>
>
>> Hmm...
>
>> Possible I guess, but 4R ALU ops isn't something my CPU can do as-is,
>> and I am not sure it would be used enough to make it worthwhile.
>
>
>> Though, did go and try a different strategy:
>> I noted while skimming the SiFive S76 docs that it specified some
>> constraints on the timing of various ops. Memory Load timing depended
>> on what was being loaded, as did ALU timing.
>
>> This gave me an idea.
>

Basically, it had specified:
32 and 64 bit loads may be 2 or 3 cycles, depending on various stuff;
8 and 16 bit loads were 3 cycle.

Though, the SiFive cores appear to be aligned-only internally (with
unaligned cases triggering a severe performance penalty).

>
>> I could add a "fast path" to the L1 cache where, if the memory access
>> satisfied certain requirements, it would be reduced to 2 cycle latency:
>>    Aligned-Only, 32 or 64 bit Load;
>>    Normal RAM access (not MMIO or similar);
>>    Does not trigger a "read-after-write" dependency;
>>    ...
>> This case allowing for cheaper memory access logic which doesn't kill
>> the timing (if the result is forwarded directly to the pipeline).
>
> The above was a question poised to me while interviewing with HP in 1988.
>
> The right answer is:: "Do nothing that harms the frequency of the
> pipeline".
> {{Which you my or may not be doing to yourself}}
>

Reducing the latency in this way isn't ideal for LUT cost or timing, but
not really like I can get my core much faster than 50 MHz, so...

Supporting a subset of aligned-only 32/64 bit access with a shortcut,
does at least offer a performance advantage (and more viable than trying
to get the general case down to 2 cycles, which is almost guaranteed to
blow the timing constraints).

Though, yeah, the L1 shortcut and "fast ALU" do add roughly 4k LUTs to
the cost of the CPU core. I suspect some of this cost may be that the
register forwarding path seems to mass duplicate any combinatorial logic
which is connected to it (but, the only way to avoid doing this being to
have 2c ALU and 3c Load, so, ...).

> The second correct right answer is:: "Do nothing that adds 1 to the
> exponent of test vector complexity". {{Which you invariably are doing to
> yourself}}
>

Well, if there is one good point of messing around with some core
mechanisms of the CPU core or pipeline, it is that if I screw something
up, typically the core will blow up pretty much immediately in
simulation, making it easier to debug.

Much harder to identify bugs which may take hours of simulation time
before they manifest (or, an unidentified bug where after several days
of running the Quake demo loop, Quake will seemingly try to jump to a
NULL address and crash; but this bug seemingly does not manifest in the
emulator).

Have also observed that the C version of my RP2 decoder breaks in both
the simulation and emulator in RV64 mode; however, I had noted that the
same bug may also appear in an x86-64 build with GCC, and seems to
depend on optimization level and if/when some variables are zeroed. I
think this may be more a case of "something in the code is playing badly
with GCC" though (but not yet identified any "smoking gun" in terms of
UB, using "memcpy()" in place of pointer derefs does not fix the issue,
but was the source of me realizing that GCC inlines memcpy on RV64 using
byte load/store).

Bug seemingly goes away with "-O0" in GCC, but then Doom in unbearably
slow (runs at single-digit speeds). Partial workaround for the RV case
for now being to use the original uncompressed Doom WADs.

But, yeah, my core could be simpler...

Now supporting the common superset of both BJX2 and RV64G (excluding
privileged spec) probably doesn't exactly help.

Though, as noted, despite now extending to RV64G, the BJX2 core still
does not have separate FPRs, but instead the decoder just sorta maps
RV64's FPR's to R32-R63 ...

Some stuff does reveal stuff it might have made sense to do differently
in retrospect, say:
Treating plain ALU ops, Compare Ops, and Conversion ops, as 3 different
entities (as to all being mostly lumped under the ALU umbrella, with
needing separate ALU modules for Lane 1/2/3 due to the ALU having a lot
of logic in Lane 1 that is N/A for Lanes 2 and 3, ...).

Say:
ALU, does exclusively ADD/SUB / AND/OR/XOR
And closely related operations.
CMP, does Integer and FPU comparison.
Ideally with more orthogonal handling of SR.T or GPR output.
As-is, the output-handling part is a little messy.
CNV, does type conversion (likely always 2 cycle).
Don't really need 1-cycle FP-SIMD convert or RGB555 pack/unpack, ...
MOV, does register MOV like operations.
MOV Reg, Reg
MOV Imm, Reg
EXTS.L and EXTU.L
These need to be 1 cycle.
Most other converter ops can remain 2 cycle.

As-is, probably my BJX2 ISA design is bigger and more complex than ideal.

Might have been better if some things were more orthogonal, but
eliminating some cases in favor of orthogonal alternatives requires
having an architectural zero register (with its own pros/cons).

But, my redesign attempts tend to be prone to losing PrWEX, which
although not highly used, is at least "still useful".

Some amount of the listing is used up by cruft from short-lived
experimental features.

For example, the 48-bit ALU ops turned out to be a bit of a dud:
Both unexpectedly expensive for the CPU core, and not offering much of a
performance advantage over the prior workarounds for using 64-bit ALU
ops (such as doing a 64-bit subtract and then sign-extending the result
from 48 to 64 bits).

Granted, this does still leave the annoyance that one either uses
zero-extended pointers in C, or needs to manually work-around the
tagging if bounds-checking is enabled, and leaves a mismatch between
bounds-checked and non-bounds-checked code.

Where, say, relative pointer comparison, no bounds checking:
CMPQGT R5, R4
With bounds-checking:
SUB R4, R5, R2
MOVST R2, R2 //48-bit sign extension
CMPQGT 0, R2
Vs:
CMPPGT R5, R4 //Ignoring high 16 bits

But, despite the overhead of 2 extra ops, the relative performance
impact on code seems to be fairly modest.

Though, as-is, a similar annoyance comes up if comparing function
pointers, which remain tagged even without bounds-checking. Did tweak
the rules for some ops though such that at least function pointers will
always give the same value for the same CPU mode (so == and != work as
expected).

Does mean there is wonk though if wanting to use relative comparisons of
function pointers between ISA modes, or trying to use a function-pointer
as a base address to access memory, but these are mostly non-issues in
practice.

>> Basically, in this case, the L1D$ has an alternate output that is
>> directed to EX2 with a flag that encodes whether the value is valid.
>> It does not replace the logic in EX3, mostly because (unless something
>> has gone terribly wrong), both should always give the same output value.
>
>
>> Also an alternate "fast case ALU", which reduces ALU to 1-cycle for a
>> few common cases:
>>    ADD{S/U}L, SUB{S/U}L
>>    ADD/SUB if the input values fall safely into signed 32-bit range.
>>      Currently +/- 2^30, as this can't overflow the signed 32-bit.
>>      Skips 64-bit mostly because low-latency 64-bit ADD is harder.
>>    AND/OR/XOR
>>      These handle full 64-bit though.
>

Click here to read the complete article

Trespassers will be shot. Survivors will be SHOT AGAIN!

devel / comp.arch / Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

Subject	Author
Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Thomas Koenig
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Anton Ertl
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB-Alt
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB-Alt
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB