Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

Logic doesn't apply to the real world. -- Marvin Minsky

Misc: Preliminary (actual) performance: BJX2 vs RV64

Subject	Author
Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Thomas Koenig
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Anton Ertl
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB-Alt
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB-Alt
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	Robert Finch
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	MitchAlsup1
Re: Misc: Preliminary (actual) performance: BJX2 vs RV64	BGB

Pages:12

Misc: Preliminary (actual) performance: BJX2 vs RV64

<uoimv3$4ahu$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36984&group=comp.arch#36984

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.samoylyk.net!newsfeed.xs3.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Sun, 21 Jan 2024 03:08:48 -0600
Organization: A noiseless patient Spider
Lines: 133
Message-ID: <uoimv3$4ahu$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Jan 2024 09:08:51 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2c6621794b7bd50877ce9ff2ee094aea";
logging-data="141886"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jg/eVnaVJLIfEoMDVhQW6"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:G7CyWoPJG3rMbEyQO0tsDiKLVs8=
Content-Language: en-US

by: BGB - Sun, 21 Jan 2024 09:08 UTC

I have now gotten around to fully implementing the ability to boot BJX2
into RISC-V mode.

Though, this part wasn't the hard-part, rather, more, porting most of
TestKern to be able to build on RISC-V (some parts are still stubbed
out, so using it as a kernel in RV Mode will not yet be possible, but
got enough ported at least to be able to run programs "bare metal" in
RV64 Mode).

Both are using more or less the same C library (TestKern + modified
PDPCLIB).

For the BJX2 side, things are compiled with BGBCC.
For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).

This allows more accurate comparison than, say, on paper analysis or
comparing results between different emulators.

So, first program tested was Doom, with preliminary results (average
framerate):
RV -O3 18.1
RV -Os 15.5
XG2 21.6
This is from running the first 3 demos and stopping at the same spot.

Both give "similar" MIPs values, but the mix differs:
BJX2: Dominated by memory Load/Store followed by branches;
RISC-V: Dominated by ALU operations (particularly ADD and Shift).
Load/Store, and Branches, are a little down the list.

RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
despite having fewer GPRs.

Meanwhile, ADD and SLLI seem to be the top two instructions used in
RISC-V (I will still continue to blame the lack of register-indexed
load/store on this one...).

It does seem to suffer more from spending a higher percentage of its
time with interlocks, particularly with ALU operations (doesn't seem
like a great situation to have 2-cycle latency on ADD and Shift
instructions...).

I had expected RV64 to win for Dhrystone, as some earlier tests (albeit,
not running in my emulator) had implied that "GCC magic" would kick in
and make Dhystone fast.

Actual testing, did not agree.

Initial tests:
XG2 : 61538 (0.70 DMIPS/MHz)
RV64: 40816 (0.46 DMIPS/MHz).

The score for BJX2 has actually dropped a fair bit for some reason.
In the past, had gotten it up to around 79k, but has dropped.
I suspect this may be a case of, what optimizations work for Doom, are
not necessarily best for Dhystone (well, also, various instruction
latency values had been increased as well).

However, this was "suspiciously bad" on RV64's part. It seemed that
performance was getting wrecked pretty bad by falling back to naive
character-by-character implementations of "strcpy()" and "strcmp()".

Switched these out for some less generic logic that works 8 bytes at a time:
RV64: 50632 (0.57 DMIPs/MHz)

This is at least more in-line with the Doom results.

General speedup was based on noting that one can do:
li=(uint64_t *)cs;
lj=(li|(li+0x7F7F7F7F7F7F7F7FULL))&0x8080808080808080ULL;
while(lj==0x8080808080808080ULL)
...
As basically a way of detecting the presence/absence of a NUL byte, for
faster than reading each character and checking against NUL.

Comparing other stats (Dhrystone):
XG2:
Bundle size: 1.10
MIPs : 25.1
Interlock : 12.81%
Cache Miss : 14.4% (Cycles)
L1 Hit Rate: 96.6%
Average Trace Length: 4.9 ops.
Mem Access : 23.1% (Total Cycles)
Branch Miss: 0.1% (Total Cycles)

RV64:
Bundle Size: 1.00
MIPs : 21.7
Interlock : 12.08%
Cache Miss : 3.5% (Cycles)
L1 Hit Rate: 99.0%
Average Trace Length: 4.8 ops.
Mem Access : 6.8% (Total Cycles)
Branch Miss: 4.8% (Total Cycles)

Here, RV64 seems to be spending less of its cycles accessing memory, and
more time running ALU ops and branching. BJX2 seems to be spending more
cycles on memory access instructions.

In this case, RV64 also seems to lose a big chunk of cycles doing slower
64-bit multiply rather than 32-bit widening multiply (doesn't exist in
RV64). This is likely where a big chunk of cycles is going (but the
stats don't currently state a "time spent in high-latency ops" case).
Seems to also be spending more cycles running DIV ops (seemingly using
multiply-by-reciprocal sparingly).

Have noted also that it tends to turn constants into memory loads rather
than encode them inline.

Granted, BJX2 does seem to still have a lot more stack spill-and-fill
than RV64 despite having twice as many GPRs. This is more an issue with
BGBCC though.

....

In any case, in some ways, closer than I would have expected.

RISC-V is still winning for a smaller ".text" section, albeit, not as
much for performance.

....

Any thoughts?...

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uojqpi$bbah$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36989&group=comp.arch#36989

copy link Newsgroups: comp.arch

by: BGB - Sun, 21 Jan 2024 19:20 UTC

On 1/21/2024 3:08 AM, BGB wrote:
> I have now gotten around to fully implementing the ability to boot BJX2
> into RISC-V mode.
>

Technically, the "boot" part was just fixing a few bugs in the Boot
ROM's ELF loader...

Also in the emulator, fixing a few other bugs:
JAL X0, Disp
Was, due to an edge case elsewhere in the emulator, still behaving as a
PC+4 branch, which was incorrect.

Also:
The SLT/SLTU/SLTI/SLTIU instructions were missing;
The MULW instruction was doing the wrong kind of multiply;
...

May see what happens if I attempt a bare metal boot in the Verilog
version. Should hopefully work, unless there are more bugs there as well
(very possible, hasn't been tested much for non-trivial code sequences).

For the bare metal boot, also needed to use a copy/paste edit of the
RV64 linker script, as by default it tried to start binaries at
0x00010000, which isn't valid RAM in my memory map.

Needed to modify the script to load at:
0x01100000
Where, RAM starts at:
0x01000000
But, generally the first 1MB is used for the boot-time stack, so:
0x01000000..0x010FFFF0: Boot Stack
0x01100000..0x011xxxxx: Kernel ".text" and ".data"/".bss"
...
The RAM following ".bss" to the end of the RAM space is generally first
grabbed by the page allocator, with kernel malloc implemented on top of
this.

If virtual memory is enabled (N/A for RV64 Mode ATM), another 32MB or
64MB chunk is allocated (32MB for 128MB of RAM, 64MB for 256MB of RAM),
and then used for the pagefile backed virtual memory.

Much of the rest is left for physical or direct-mapped ranges.
Physical:
Basically just allocating raw memory pages in physical memory;
Generally only accessible in supervisor and superuser mode.
Direct Mapped:
Part of the virtual address space, but not backed by the pagefile;
Basically, like virtual memory that will not be paged out (*).

*: For "reasons", have generally ended up needing to use this for
executable sections and program stacks. Generally, all the data/bss and
heap stuff can be put into normal virtual memory without issue.

> Though, this part wasn't the hard-part, rather, more, porting most of
> TestKern to be able to build on RISC-V (some parts are still stubbed
> out, so using it as a kernel in RV Mode will not yet be possible, but
> got enough ported at least to be able to run programs "bare metal" in
> RV64 Mode).
>

The stuff for the interrupt handlers is currently missing, so this means
no task switching, TLB miss handling, or SYSCALL interrupts.

However... If these did work, there would still be a problem:
RV64 mode wont be able to host BJX2 programs;
RV64 Mode doesn't have all of the registers that BJX2 programs use.
Absent being able to load programs anywhere in the address space,
still can't load RV64 images either.

Granted, I may look into switching from "riscv64-unknown-elf" (with
RV64IMA) to "riscv64-unknown-linux-gnu" with RV64G, where theoretically
telling GCC that it is building for Linux and glibc will re-enable its
use of shared-objects and PIC/PIE binaries.

Otherwise, with "unknown-elf", can tell GCC to make binaries that still
contain ELF relocs, but... Need to go up the learning curve about ELF
relocs and how to get the image relocated on load to an arbitrary
location in the address space.

Might have preferred actually if GCC had supported ELF FDPIC on RISC-V.
Or, say, if it had supported base-relocatable PE/COFF for this target.
Both apparently existed as options for SuperH, but seemingly GCC only
supports limited combinations of object/binary format and target
architecture.

The main issue is that, without the ability to dynamically rebase the
binaries, one will need to put each binary instance in its own virtual
address space, which is undesirable.

But, yeah, all this is more a software-side issue, rather than a CPU/ISA
issue...

It was mostly this issue that had put a roadblock on things in the past,
as merely booting into RV64 mode didn't seem terribly useful. But, does
at least allow verifying that this stuff works, and getting performance
measurements.

Likely running native RV64 would be better served though by a CPU
actually designed for running RISC-V.

> Both are using more or less the same C library (TestKern + modified
> PDPCLIB).
>
> For the BJX2 side, things are compiled with BGBCC.
> For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).
>
> This allows more accurate comparison than, say, on paper analysis or
> comparing results between different emulators.
>
>
> So, first program tested was Doom, with preliminary results (average
> framerate):
> RV -O3 18.1
> RV -Os 15.5
> XG2 21.6
> This is from running the first 3 demos and stopping at the same spot.
>
> Both give "similar" MIPs values, but the mix differs:
> BJX2: Dominated by memory Load/Store followed by branches;
> RISC-V: Dominated by ALU operations (particularly ADD and Shift).
> Load/Store, and Branches, are a little down the list.
>
> RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
> despite having fewer GPRs.
>

Part of the weaker performance is, I suspect, because the design doesn't
use any "clever trickery" to try to compensate for RV64's design
deficiencies.

The performance deltas here are generally larger than what could be
attributed to the WEXifier (or its absence in RV64 mode).

Could maybe add an experimental "what if RV64 superscalar were
supported?" option, but I suspect it won't make that big of a difference
in this case.

In any case, GCC does appear to do a good job at doing what it can with
what it has to work with.

But, generally, performance appears to not beat an arguably "more
capable" ISA design with a comparably poor compiler (it is now looking
like the delta might be larger if I could eliminate more of the
stack-spills and reduce the amount of registers being saved/restored).

Though, presumably, the number of local variables and temporaries isn't
that much different between the ISAs (if starting from the same C code).

>
> Meanwhile, ADD and SLLI seem to be the top two instructions used in
> RISC-V (I will still continue to blame the lack of register-indexed
> load/store on this one...).
>
> It does seem to suffer more from spending a higher percentage of its
> time with interlocks, particularly with ALU operations (doesn't seem
> like a great situation to have 2-cycle latency on ADD and Shift
> instructions...).
>
>
> I had expected RV64 to win for Dhrystone, as some earlier tests (albeit,
> not running in my emulator) had implied that "GCC magic" would kick in
> and make Dhystone fast.
>
> Actual testing, did not agree.
>
> Initial tests:
> XG2 : 61538 (0.70 DMIPS/MHz)
> RV64: 40816 (0.46 DMIPS/MHz).
>
> The score for BJX2 has actually dropped a fair bit for some reason.
> In the past, had gotten it up to around 79k, but has dropped.
> I suspect this may be a case of, what optimizations work for Doom, are
> not necessarily best for Dhystone (well, also, various instruction
> latency values had been increased as well).
>
>
> However, this was "suspiciously bad" on RV64's part. It seemed that
> performance was getting wrecked pretty bad by falling back to naive
> character-by-character implementations of "strcpy()" and "strcmp()".
>
> Switched these out for some less generic logic that works 8 bytes at a
> time:
> RV64: 50632 (0.57 DMIPs/MHz)
>
> This is at least more in-line with the Doom results.
>
> General speedup was based on noting that one can do:
> li=(uint64_t *)cs;
> lj=(li|(li+0x7F7F7F7F7F7F7F7FULL))&0x8080808080808080ULL;
> while(lj==0x8080808080808080ULL)
> ...
> As basically a way of detecting the presence/absence of a NUL byte, for
> faster than reading each character and checking against NUL.
>

Can also note that GCC is clever enough to load the constants into
registers in advance. May need to do this optimization manually (using
variables) as BGBCC does not optimize this case, and seems to encode a
constant-load into a temporary register each time the constant is used
(so, it seems for "strcpy()", this optimization helped RV64 but slightly
hurt BJX2's score as a result; as well as burning 12 bytes of code space
for each instance of the constant).

Have also noted that for things like MMIO addresses, etc, GCC seems to
aggregate constants across all of the functions, rather than encode them
inline. So, as a cost, it involves loading constants from memory, but as
a benefit, many of these constants are reduced to only needing a single
32-bit instruction word (and if the constant is used multiple times, it
is kept pinned in a register, rather than reloaded each time it is used).

Click here to read the complete article

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36996&group=comp.arch#36996

copy link Newsgroups: comp.arch

Date: Sun, 21 Jan 2024 21:22:50 +0000
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$S3OGHjn333G.HhwWDiCLYeOfYSZHINUoaVVdb6M4xli8ZU.KS2uJ2
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uoimv3$4ahu$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>

by: MitchAlsup1 - Sun, 21 Jan 2024 21:22 UTC

BGB wrote:

> I have now gotten around to fully implementing the ability to boot BJX2
> into RISC-V mode.

> Both are using more or less the same C library (TestKern + modified
> PDPCLIB).

> For the BJX2 side, things are compiled with BGBCC.
> For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).

> This allows more accurate comparison than, say, on paper analysis or
> comparing results between different emulators.

> So, first program tested was Doom, with preliminary results (average
> framerate):
> RV -O3 18.1
> RV -Os 15.5
> XG2 21.6
> This is from running the first 3 demos and stopping at the same spot.

> Both give "similar" MIPs values, but the mix differs:
> BJX2: Dominated by memory Load/Store followed by branches;
> RISC-V: Dominated by ALU operations (particularly ADD and Shift).
> Load/Store, and Branches, are a little down the list.

> RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
> despite having fewer GPRs.

> Meanwhile, ADD and SLLI seem to be the top two instructions used in
> RISC-V (I will still continue to blame the lack of register-indexed
> load/store on this one...).

> It does seem to suffer more from spending a higher percentage of its
> time with interlocks, particularly with ALU operations (doesn't seem
> like a great situation to have 2-cycle latency on ADD and Shift
> instructions...).

You might be the first person with a RISC-V that has 2 cycle ADDs.

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uok4v6$csm5$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37005&group=comp.arch#37005

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Sun, 21 Jan 2024 16:13:57 -0600
Organization: A noiseless patient Spider
Lines: 91
Message-ID: <uok4v6$csm5$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 21 Jan 2024 22:13:58 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2c6621794b7bd50877ce9ff2ee094aea";
logging-data="422597"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189BSEnk/uFg1SOT5Un9sjC"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:vfLuEJJWlgACZPWs2drHGaGYKaI=
Content-Language: en-US
In-Reply-To: <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>

by: BGB - Sun, 21 Jan 2024 22:13 UTC

On 1/21/2024 3:22 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> I have now gotten around to fully implementing the ability to boot
>> BJX2 into RISC-V mode.
>
>> Though, this part wasn't the hard-part, rather, more, porting most of
>> TestKern to be able to build on RISC-V (some parts are still stubbed
>> out, so using it as a kernel in RV Mode will not yet be possible, but
>> got enough ported at least to be able to run programs "bare metal" in
>> RV64 Mode).
>
>> Both are using more or less the same C library (TestKern + modified
>> PDPCLIB).
>
>> For the BJX2 side, things are compiled with BGBCC.
>>    For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).
>
>> This allows more accurate comparison than, say, on paper analysis or
>> comparing results between different emulators.
>
>
>> So, first program tested was Doom, with preliminary results (average
>> framerate):
>>    RV -O3 18.1
>>    RV -Os 15.5
>>    XG2     21.6
>> This is from running the first 3 demos and stopping at the same spot.
>
>> Both give "similar" MIPs values, but the mix differs:
>>    BJX2: Dominated by memory Load/Store followed by branches;
>>    RISC-V: Dominated by ALU operations (particularly ADD and Shift).
>>      Load/Store, and Branches, are a little down the list.
>
>> RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
>> despite having fewer GPRs.
>
>
>> Meanwhile, ADD and SLLI seem to be the top two instructions used in
>> RISC-V (I will still continue to blame the lack of register-indexed
>> load/store on this one...).
>
>> It does seem to suffer more from spending a higher percentage of its
>> time with interlocks, particularly with ALU operations (doesn't seem
>> like a great situation to have 2-cycle latency on ADD and Shift
>> instructions...).
>
> You might be the first person with a RISC-V that has 2 cycle ADDs.

Yeah, and probably not an ideal situation for RISC-V, as seemingly it is
one of the most common instructions:
MV Xd, Xs
LI Xd, Imm12s
->
ADDI Xd, Xs, 0
ADDI Xd, X0, Imm12s
....

Shift sees a lot of use as well, as it is also used for both indexed
addressing, and for performing sign an zero extension.

Say:
j=(short)i;
Being, say:
SLLI X11, X10, 16
SRAI X11, X11, 16

As opposed to having dedicated instructions for a lot of these cases (as
in BJX2).

Oh well...

But, necessarily, if running RISC-V on the BJX2 Core, it would be
necessary to have the same instruction timings as BJX2. Granted, one
could argue that BJX2 will also benefit from 1-cycle ADD and Shift
(where, the latter was recently relaxed to 2 cycle mostly because this
increases the amount of timing slack; and having dedicated instructions
for a lot of other cases makes the latency of these instructions less
significant).

Perhaps unsurprisingly, trying to get a RISC-V build of Doom to boot in
the Verilog core, is still needing a bit of debugging... (at the moment,
still crashing very early in start-up).

Granted, a lot of this stuff has thus far been "mostly untested" apart
from some fairly trivial code fragments...

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37008&group=comp.arch#37008

copy link Newsgroups: comp.arch

Date: Sun, 21 Jan 2024 23:40:16 +0000
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$Czuk8lXvPcgYopvj6m/b5O7tHLHWnCuQ58JDbWs8r9LF5ZK/KKVNS
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>

by: MitchAlsup1 - Sun, 21 Jan 2024 23:40 UTC

BGB wrote:

> On 1/21/2024 3:22 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> I have now gotten around to fully implementing the ability to boot
>>> BJX2 into RISC-V mode.
>>
>>> Though, this part wasn't the hard-part, rather, more, porting most of
>>> TestKern to be able to build on RISC-V (some parts are still stubbed
>>> out, so using it as a kernel in RV Mode will not yet be possible, but
>>> got enough ported at least to be able to run programs "bare metal" in
>>> RV64 Mode).
>>
>>> Both are using more or less the same C library (TestKern + modified
>>> PDPCLIB).
>>
>>> For the BJX2 side, things are compiled with BGBCC.
>>>    For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).
>>
>>> This allows more accurate comparison than, say, on paper analysis or
>>> comparing results between different emulators.
>>
>>
>>> So, first program tested was Doom, with preliminary results (average
>>> framerate):
>>>    RV -O3 18.1
>>>    RV -Os 15.5
>>>    XG2     21.6
>>> This is from running the first 3 demos and stopping at the same spot.
>>
>>> Both give "similar" MIPs values, but the mix differs:
>>>    BJX2: Dominated by memory Load/Store followed by branches;
>>>    RISC-V: Dominated by ALU operations (particularly ADD and Shift).
>>>      Load/Store, and Branches, are a little down the list.
>>
>>> RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
>>> despite having fewer GPRs.
>>
>>
>>> Meanwhile, ADD and SLLI seem to be the top two instructions used in
>>> RISC-V (I will still continue to blame the lack of register-indexed
>>> load/store on this one...).
>>
>>> It does seem to suffer more from spending a higher percentage of its
>>> time with interlocks, particularly with ALU operations (doesn't seem
>>> like a great situation to have 2-cycle latency on ADD and Shift
>>> instructions...).
>>
>> You might be the first person with a RISC-V that has 2 cycle ADDs.

> Yeah, and probably not an ideal situation for RISC-V, as seemingly it is
> one of the most common instructions:
> MV Xd, Xs
> LI Xd, Imm12s
> ->
> ADDI Xd, Xs, 0
> ADDI Xd, X0, Imm12s
> ....

For move they could use OR Rd,Rs,#0 or do you have 2 cycle logicals ??

> Shift sees a lot of use as well, as it is also used for both indexed
> addressing, and for performing sign an zero extension.

> Say:
> j=(short)i;
> Being, say:
> SLLI X11, X10, 16
> SRAI X11, X11, 16

Which I do in 1 instruction
SLL R11,R10,<16,0>
{Extract the lower 16 bits at offset 0}
I started calling this a Smash -- Smash this long into a short.
This is what happens when shifts are subset of bit manipulation

> As opposed to having dedicated instructions for a lot of these cases (as
> in BJX2).

See; mine are not dedicated, they just as easily perform

struct { long i : 17,
j : 9,
k : 3,
... } st;
short s = st.k;

SLL Rs,Rst,<3,26>

> Oh well...

I have the director of Norther Telecom circa 1984 for this. I BLEW the
88K implementation by putting the two 5-bit fields back to back and used
the 16-bit immediate encoding, wasting bits and tying my hands into
the future at the same time. My 66000 has essentially the same instrs;
but the immediate form is XOM7 and uses a 12-bit immediate field. When
this pattern is decoded, the two 5-bit fields are routed onto the Rs2
operand bus at position<37..32> and position<5..0>. No 32-bit or smaller
data value (replacing the immediate) can access the extract functionality
and the 64-bitters than can are limited to putting SANE bit patterns
there when they do. This Lower field is limited from 0..63 the upper one
from 0..64, and all intermediate bits are checked for zeros.

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uoknuo$j55p$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37021&group=comp.arch#37021

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Sun, 21 Jan 2024 21:38:00 -0600
Organization: A noiseless patient Spider
Lines: 164
Message-ID: <uoknuo$j55p$2@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Jan 2024 03:38:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c3f5f7bcf3cbeca25160f5a4be788e24";
logging-data="627897"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+KYWnJcx/nfTIwuKUKgtv/"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:3co5qHtwaOi3WrgEaiKIV6vW7bU=
In-Reply-To: <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
Content-Language: en-US

by: BGB - Mon, 22 Jan 2024 03:38 UTC

On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/21/2024 3:22 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> I have now gotten around to fully implementing the ability to boot
>>>> BJX2 into RISC-V mode.
>>>
>>>> Though, this part wasn't the hard-part, rather, more, porting most
>>>> of TestKern to be able to build on RISC-V (some parts are still
>>>> stubbed out, so using it as a kernel in RV Mode will not yet be
>>>> possible, but got enough ported at least to be able to run programs
>>>> "bare metal" in RV64 Mode).
>>>
>>>> Both are using more or less the same C library (TestKern + modified
>>>> PDPCLIB).
>>>
>>>> For the BJX2 side, things are compiled with BGBCC.
>>>>    For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).
>>>
>>>> This allows more accurate comparison than, say, on paper analysis or
>>>> comparing results between different emulators.
>>>
>>>
>>>> So, first program tested was Doom, with preliminary results (average
>>>> framerate):
>>>>    RV -O3 18.1
>>>>    RV -Os 15.5
>>>>    XG2     21.6
>>>> This is from running the first 3 demos and stopping at the same spot.
>>>
>>>> Both give "similar" MIPs values, but the mix differs:
>>>>    BJX2: Dominated by memory Load/Store followed by branches;
>>>>    RISC-V: Dominated by ALU operations (particularly ADD and Shift).
>>>>      Load/Store, and Branches, are a little down the list.
>>>
>>>> RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
>>>> despite having fewer GPRs.
>>>
>>>
>>>> Meanwhile, ADD and SLLI seem to be the top two instructions used in
>>>> RISC-V (I will still continue to blame the lack of register-indexed
>>>> load/store on this one...).
>>>
>>>> It does seem to suffer more from spending a higher percentage of its
>>>> time with interlocks, particularly with ALU operations (doesn't seem
>>>> like a great situation to have 2-cycle latency on ADD and Shift
>>>> instructions...).
>>>
>>> You might be the first person with a RISC-V that has 2 cycle ADDs.
>
>> Yeah, and probably not an ideal situation for RISC-V, as seemingly it
>> is one of the most common instructions:
>>    MV Xd, Xs
>>    LI Xd, Imm12s
>> ->
>>    ADDI Xd, Xs, 0
>>    ADDI Xd, X0, Imm12s
>> ....
>
> For move they could use OR Rd,Rs,#0 or do you have 2 cycle logicals ??
>

All of the ALU ops are 2-cycle at present.

At present, the only 1-cycle ops in the BJX2 core are:
MOV Rm, Rn
LDIx Imm, Rn
EXTS.L / EXTU.L (Sign and Zero extend a 32-bit value)

Where, ironically, the RISC-V decoder doesn't use any of these.
Could potentially special-case ADDI/ORI with 0 as MOV in the decoder.

Most other ops are 2-cytcle.
Things like Load, MUL, etc, are 3-cycle.

Granted, the core is pipelined, so they will behave like 1-cycle ops if
one doesn't try to use the results immediately.

Seems like GCC assumes that a lot of these ops are 1-cycle though.

There are define's that can switch various ops back to being 1-cycle,
but doing so comes at the cost of FPGA timing.

>> Shift sees a lot of use as well, as it is also used for both indexed
>> addressing, and for performing sign an zero extension.
>
>> Say:
>>    j=(short)i;
>> Being, say:
>>    SLLI X11, X10, 16
>>    SRAI X11, X11, 16
>
> Which I do in 1 instruction
>     SLL R11,R10,<16,0>
> {Extract the lower 16 bits at offset 0}
> I started calling this a Smash -- Smash this long into a short.
> This is what happens when shifts are subset of bit manipulation
>> As opposed to having dedicated instructions for a lot of these cases
>> (as in BJX2).
>
> See; mine are not dedicated, they just as easily perform
>
>     struct { long i : 17,
>                   j : 9,
>                   k : 3,
>                  ...      } st;
>     short s = st.k;
>
>     SLL     Rs,Rst,<3,26>
>

Possibly, if one has a big enough immediate field to encode it.
Could have made sense as a use for the 12-bit Immed fields in RISC-V,
but it can be noted that they did not do so (and chose instead to use
pairs of shifts).

In my case, I used the extra bits from the 9-bit immediate fields to
encode a few extra cases, mostly overloading shuffles with shifts and
similar.

where, the shifts are basically encoded as, say:
000..0FF: Shift
Understood as a signed value between -63 and 63.
The values +/- 64..127 are unused at present.
Or, 32..127 for 32-bit shifts.
100..1FF: Packed Shuffle

For the SHADX/SHLDX instructions, the full range is used (+/- 127).

I think, early on, I had considered using a sort of unary-coding scheme
to encode shift ranges, say:
0xxxxxxxx +/- 127 (128-bit)
10xxxxxxx +/- 63 (64-bit)
110xxxxxx +/- 31 (32-bit)

This could have allowed consolidating all 3 sizes into the same opcodes,
but IIRC decided against this as it would have made decoding more expensive.

>> Oh well...
>
> I have the director of Norther Telecom circa 1984 for this. I BLEW the
> 88K implementation by putting the two 5-bit fields back to back and used
> the 16-bit immediate encoding, wasting bits and tying my hands into
> the future at the same time. My 66000 has essentially the same instrs;
> but the immediate form is XOM7 and uses a 12-bit immediate field. When
> this pattern is decoded, the two 5-bit fields are routed onto the Rs2
> operand bus at position<37..32> and position<5..0>. No 32-bit or smaller
> data value (replacing the immediate) can access the extract functionality
> and the 64-bitters than can are limited to putting SANE bit patterns
> there when they do. This Lower field is limited from 0..63 the upper one
> from 0..64, and all intermediate bits are checked for zeros.

OK.

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uomd6h$2u82f$2@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37033&group=comp.arch#37033

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-6a1d-0-be0d-dfb2-664e-f1e4.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Mon, 22 Jan 2024 18:46:41 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uomd6h$2u82f$2@newsreader4.netcologne.de>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me>
Injection-Date: Mon, 22 Jan 2024 18:46:41 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-6a1d-0-be0d-dfb2-664e-f1e4.ipv6dyn.netcologne.de:2001:4dd7:6a1d:0:be0d:dfb2:664e:f1e4";
logging-data="3088463"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 22 Jan 2024 18:46 UTC

BGB <cr88192@gmail.com> schrieb:

> All of the ALU ops are 2-cycle at present.

You're imitating POWER, are you? :-)

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<6a04abfd541e1ee200c7c38db215452e@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37034&group=comp.arch#37034

copy link Newsgroups: comp.arch

Date: Mon, 22 Jan 2024 19:01:35 +0000
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$mVIUNMXhiqXkHtncOmbkGuicwHk3agzKfPpedbsCE.J1j4mQu2q5.
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me>
Organization: Rocksolid Light
Message-ID: <6a04abfd541e1ee200c7c38db215452e@www.novabbs.org>

by: MitchAlsup1 - Mon, 22 Jan 2024 19:01 UTC

BGB wrote:

> On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
>> BGB wrote:

>>> Shift sees a lot of use as well, as it is also used for both indexed
>>> addressing, and for performing sign an zero extension.
>>
>>> Say:
>>>    j=(short)i;
>>> Being, say:
>>>    SLLI X11, X10, 16
>>>    SRAI X11, X11, 16
>>
>> Which I do in 1 instruction
>>     SLL R11,R10,<16,0>
>> {Extract the lower 16 bits at offset 0}
>> I started calling this a Smash -- Smash this long into a short.
>> This is what happens when shifts are subset of bit manipulation
>>> As opposed to having dedicated instructions for a lot of these cases
>>> (as in BJX2).
>>
>> See; mine are not dedicated, they just as easily perform
>>
>>     struct { long i : 17,
>>                   j : 9,
>>                   k : 3,
>>                  ...      } st;
>>     short s = st.k;
>>
>>     SLL     Rs,Rst,<3,26>
>>

> Possibly, if one has a big enough immediate field to encode it.

It is 12-bits, 2×6-bit fields.

> Could have made sense as a use for the 12-bit Immed fields in RISC-V,
> but it can be noted that they did not do so (and chose instead to use
> pairs of shifts).

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uomgoh$sck8$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37040&group=comp.arch#37040

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Mon, 22 Jan 2024 13:47:26 -0600
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <uomgoh$sck8$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Jan 2024 19:47:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c3f5f7bcf3cbeca25160f5a4be788e24";
logging-data="930440"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/LmD9pCEe9lJDJDeOPV1ey"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:m3ulDyzo/xX9RnwlFmHD3S5NXAg=
Content-Language: en-US
In-Reply-To: <uomd6h$2u82f$2@newsreader4.netcologne.de>

by: BGB - Mon, 22 Jan 2024 19:47 UTC

On 1/22/2024 12:46 PM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>
>> All of the ALU ops are 2-cycle at present.
>
> You're imitating POWER, are you? :-)

This makes it a lot easier to pass timing in the FPGA, and for the most
part the performance difference is "relatively minor" in the BJX2 ISA
(it was mostly the "MOV Reg,Reg" and "MOV Imm,Reg" instructions which
had a more obvious effect on performance).

However, 2-cycle ADD and Shift doesn't really help RISC-V's case, as the
ISA both uses these instructions a lot more heavily, and far more often
manages to step on the interlock penalties from the 2c latency (by using
the results directly, rather than interleave them with other instructions).

Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V
spec had said to use "ORI" for these. Though, if "ORI" is 2-cycle as
well, doesn't really help much.

But, yeah, a combination of factors seems to lead to the RISC-V code
running at roughly 19 to 21 MIPs (at 50MHz) with the instruction timings
used in the BJX2 core (while, unlike BJX2 code, spending much less of
its time waiting for memory access).

But, yeah, it looks like if one were implementing a dedicated RISC-V
CPU, having 1-cycle latency on ALU ops and similar would be a priority...

Granted, yes, 1-cycle ALU ops would also help with performance for BJX2
code, but the gains would be smaller.

I suspect increasing some of the instruction latency values is why
Dhrystone had dropped from 79k to 61k for BJX2 at 50MHz, but had not
seen such an obvious drop in other contexts.

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uomhcb$sfim$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37041&group=comp.arch#37041

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Mon, 22 Jan 2024 13:58:00 -0600
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <uomhcb$sfim$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me>
<6a04abfd541e1ee200c7c38db215452e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Jan 2024 19:58:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c3f5f7bcf3cbeca25160f5a4be788e24";
logging-data="933462"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Dy1Zt1HIEQMagVgPAkFow"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:PDDFLGemq3Ei3MIVrG0J2Vn9Mug=
In-Reply-To: <6a04abfd541e1ee200c7c38db215452e@www.novabbs.org>
Content-Language: en-US

by: BGB - Mon, 22 Jan 2024 19:58 UTC

On 1/22/2024 1:01 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>
>>>> Shift sees a lot of use as well, as it is also used for both indexed
>>>> addressing, and for performing sign an zero extension.
>>>
>>>> Say:
>>>>    j=(short)i;
>>>> Being, say:
>>>>    SLLI X11, X10, 16
>>>>    SRAI X11, X11, 16
>>>
>>> Which I do in 1 instruction
>>>      SLL R11,R10,<16,0>
>>> {Extract the lower 16 bits at offset 0}
>>> I started calling this a Smash -- Smash this long into a short.
>>> This is what happens when shifts are subset of bit manipulation
>>>> As opposed to having dedicated instructions for a lot of these cases
>>>> (as in BJX2).
>>>
>>> See; mine are not dedicated, they just as easily perform
>>>
>>>      struct { long i : 17,
>>>                    j : 9,
>>>                    k : 3,
>>>                   ...      } st;
>>>      short s = st.k;
>>>
>>>      SLL     Rs,Rst,<3,26>
>>>
>
>> Possibly, if one has a big enough immediate field to encode it.
>
> It is 12-bits, 2×6-bit fields.
>

Yes, but 12-bits was bigger than the 9-bit fields I was originally
using, or the Imm5 encodings in some other contexts.

Granted, XG2 expands these to 10 and 6 bits.

Or, could use a Jumbo encoding, or, ...

In most cases, having EXT{S/U}.{B/W/L} works well enough, and deals with
all of the common cases (and is faster than using a pair of shifts,
particularly when these shifts each have a 2 cycle latency...).

Though, ended up doing it where only EXTS.L and EXTU.L have 1-cycle
latency, B and W having 2-cycle. Mostly because casts involving 'int'
and 'unsigned int' happened to be a lot more common than 'signed char'
and 'short' and similar.

Comparably, arbitrary bit-fields are fairly rare, vs needing to make
sure a value is still in 'int' range (and preserves the de-facto
standard "wrap on overflow" semantics).

>> Could have made sense as a use for the 12-bit Immed fields in RISC-V,
>> but it can be noted that they did not do so (and chose instead to use
>> pairs of shifts).

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<9337063f3960f1f53fecb269b4eb9161@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37042&group=comp.arch#37042

copy link Newsgroups: comp.arch

Date: Mon, 22 Jan 2024 20:15:00 +0000
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$7CMg5Oydqnhmr5xU88.bfuUMJBqplJZjX2WrpNB4JNZAdVvccBN82
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me> <6a04abfd541e1ee200c7c38db215452e@www.novabbs.org> <uomhcb$sfim$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <9337063f3960f1f53fecb269b4eb9161@www.novabbs.org>

by: MitchAlsup1 - Mon, 22 Jan 2024 20:15 UTC

BGB wrote:

> On 1/22/2024 1:01 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
>>>> BGB wrote:
>>
>>>>> Shift sees a lot of use as well, as it is also used for both indexed
>>>>> addressing, and for performing sign an zero extension.
>>>>
>>>>> Say:
>>>>>    j=(short)i;
>>>>> Being, say:
>>>>>    SLLI X11, X10, 16
>>>>>    SRAI X11, X11, 16
>>>>
>>>> Which I do in 1 instruction
>>>>      SLL R11,R10,<16,0>
>>>> {Extract the lower 16 bits at offset 0}
>>>> I started calling this a Smash -- Smash this long into a short.
>>>> This is what happens when shifts are subset of bit manipulation
>>>>> As opposed to having dedicated instructions for a lot of these cases
>>>>> (as in BJX2).
>>>>
>>>> See; mine are not dedicated, they just as easily perform
>>>>
>>>>      struct { long i : 17,
>>>>                    j : 9,
>>>>                    k : 3,
>>>>                   ...      } st;
>>>>      short s = st.k;
>>>>
>>>>      SLL     Rs,Rst,<3,26>
>>>>
>>
>>> Possibly, if one has a big enough immediate field to encode it.
>>
>> It is 12-bits, 2×6-bit fields.
>>

> Yes, but 12-bits was bigger than the 9-bit fields I was originally
> using, or the Imm5 encodings in some other contexts.

> Granted, XG2 expands these to 10 and 6 bits.

> Or, could use a Jumbo encoding, or, ...

> In most cases, having EXT{S/U}.{B/W/L} works well enough, and deals with
> all of the common cases (and is faster than using a pair of shifts,
> particularly when these shifts each have a 2 cycle latency...).

And here we have the classical chicken and egg problem.

Bit fields are not as fast as {B,H,W,D} so few people use them;
Bit fields are not well supported in ISA so few compilers optimize them;
EVEN if they are ideal for the situation at hand.

When the HW cost of properly supporting them is essentially free !!

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<2024Jan22.230946@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37052&group=comp.arch#37052

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Mon, 22 Jan 2024 22:09:46 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 24
Message-ID: <2024Jan22.230946@mips.complang.tuwien.ac.at>
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de> <uomgoh$sck8$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="0fe7cf970a6b0519ac620d305a3c7ac3";
logging-data="980534"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Ld/Uo06I4DMydIoO9B1qT"
Cancel-Lock: sha1:6HYAW0ZDasmHQmbiSrDr2d7we+Y=
X-newsreader: xrn 10.11

by: Anton Ertl - Mon, 22 Jan 2024 22:09 UTC

BGB <cr88192@gmail.com> writes:
>Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V
>spec had said to use "ORI" for these.

What makes you think so? According to
<https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
page 13:

|ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
|pseudo-instruction.

And on page 76:

|C.LI expands into addi rd, x0, imm[5:0]

C.LI is a separate instruction. I did not find anything about a
non-compact LI, but given how C.LI expands (why does the ISA manual
actually specify that?), I expect that LI is a pseudo-instruction that
is actually "addi rd, x0, imm".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uomsiu$ub0e$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37056&group=comp.arch#37056

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Mon, 22 Jan 2024 17:09:18 -0600
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <uomsiu$ub0e$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Jan 2024 23:09:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="247643e41060dbc18b0fab689a3d6aab";
logging-data="994318"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IwG3ehu0Sa9HMqeqHRK5qjaC0wvfMRHI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:wyBInb7JIu7ffdCopMhahse3IaY=
Content-Language: en-US
In-Reply-To: <2024Jan22.230946@mips.complang.tuwien.ac.at>

by: BGB-Alt - Mon, 22 Jan 2024 23:09 UTC

On 1/22/2024 4:09 PM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V
>> spec had said to use "ORI" for these.
>
> What makes you think so? According to
> <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
> page 13:
>
> |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
> |pseudo-instruction.
>
> And on page 76:
>
> |C.LI expands into addi rd, x0, imm[5:0]
>
> C.LI is a separate instruction. I did not find anything about a
> non-compact LI, but given how C.LI expands (why does the ISA manual
> actually specify that?), I expect that LI is a pseudo-instruction that
> is actually "addi rd, x0, imm".
>

OK.

I had thought when I had looked it up, that it had said that these
mapped to ORI.

But, if it is ADDI, then GCC is behaving according to the spec.
Either way, the end-result is the same in this case.

In theory, could hack over these in the decoder by
detecting/special-casing things when the immediate is 0 (to map these
over to the MOV logic).

I guess, more immediate priority is getting Doom to boot in the Verilog
implementation. As-is, it prints some stuff and then crashes. Need to
look at it some more.

For example, did, among other things end up needing to tweak the
behavior of BLTU/BGEU, as (due to a minor logic issue) they seemed to be
doing LE and GT instead.

Did read reference to mention of the possibility of using JALR rather
then AUIPC to get PC-relative addresses (but discouraged doing so).

This was one concern as my implementation produces non-standard output
for JAL/JALR (it uses pointer tagging to encode that the return address
is in RISC-V Mode), and trying to use these values in address
calculations (for at least for non-function-pointers) may lead to
incorrect results.

Though, GCC seems to always use AUIPC for forming PC-relative addresses,
which is good in this case. Otherwise, would have needed to further
tweak some things to better hide the "weirdness" associated with how
RISC-V mode operates on the BJX2 core (eg, not using pointer tagging for
JAL/JALR; but then needing extra care for any possible inter-ISA thunking).

> - anton

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<6a35c4271ca330a56225c514686727ca@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37058&group=comp.arch#37058

copy link Newsgroups: comp.arch

Date: Tue, 23 Jan 2024 00:49:11 +0000
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$BGEXL7mossgMYz.rwmhY7OCjHIbDOuESFBm085nRwMvMbq2k47YVG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de> <uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at> <uomsiu$ub0e$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <6a35c4271ca330a56225c514686727ca@www.novabbs.org>

by: MitchAlsup1 - Tue, 23 Jan 2024 00:49 UTC

BGB-Alt wrote:

> On 1/22/2024 4:09 PM, Anton Ertl wrote:
>> BGB <cr88192@gmail.com> writes:
>>> Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V
>>> spec had said to use "ORI" for these.
>>
>> What makes you think so? According to
>> <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
>> page 13:
>>
>> |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
>> |pseudo-instruction.
>>
>> And on page 76:
>>
>> |C.LI expands into addi rd, x0, imm[5:0]
>>
>> C.LI is a separate instruction. I did not find anything about a
>> non-compact LI, but given how C.LI expands (why does the ISA manual
>> actually specify that?), I expect that LI is a pseudo-instruction that
>> is actually "addi rd, x0, imm".
>>

> OK.

> I had thought when I had looked it up, that it had said that these
> mapped to ORI.

> But, if it is ADDI, then GCC is behaving according to the spec.
> Either way, the end-result is the same in this case.

> In theory, could hack over these in the decoder by
> detecting/special-casing things when the immediate is 0 (to map these
> over to the MOV logic).

I made My 66000 have a MOV OpCode for a particular reason::
{MOV, ABS, NEG, INV} can be performed in 0-cycles in the forwarding
network--if your FUs are designed to put up with this as inputs.

It would have not been "that hard" to just special case decode,
but who is going to do this when the opcode set includes FP and
SIMD that needs these ??

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<uoqpji$1p5a4$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37082&group=comp.arch#37082

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Wed, 24 Jan 2024 04:42:57 -0600
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <uoqpji$1p5a4$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me>
<6a04abfd541e1ee200c7c38db215452e@www.novabbs.org>
<uomhcb$sfim$1@dont-email.me>
<9337063f3960f1f53fecb269b4eb9161@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Jan 2024 10:42:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="77eb85471f1ee329bbdf40286549d717";
logging-data="1873220"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18867fkn2AkmPz/qnOGI9hg"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:f+YHe70GAUf9cwnbTiQAem++7mo=
Content-Language: en-US
In-Reply-To: <9337063f3960f1f53fecb269b4eb9161@www.novabbs.org>

by: BGB - Wed, 24 Jan 2024 10:42 UTC

On 1/22/2024 2:15 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/22/2024 1:01 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
>>>>> BGB wrote:
>>>
>>>>>> Shift sees a lot of use as well, as it is also used for both
>>>>>> indexed addressing, and for performing sign an zero extension.
>>>>>
>>>>>> Say:
>>>>>>    j=(short)i;
>>>>>> Being, say:
>>>>>>    SLLI X11, X10, 16
>>>>>>    SRAI X11, X11, 16
>>>>>
>>>>> Which I do in 1 instruction
>>>>>      SLL R11,R10,<16,0>
>>>>> {Extract the lower 16 bits at offset 0}
>>>>> I started calling this a Smash -- Smash this long into a short.
>>>>> This is what happens when shifts are subset of bit manipulation
>>>>>> As opposed to having dedicated instructions for a lot of these
>>>>>> cases (as in BJX2).
>>>>>
>>>>> See; mine are not dedicated, they just as easily perform
>>>>>
>>>>>      struct { long i : 17,
>>>>>                    j : 9,
>>>>>                    k : 3,
>>>>>                   ...      } st;
>>>>>      short s = st.k;
>>>>>
>>>>>      SLL     Rs,Rst,<3,26>
>>>>>
>>>
>>>> Possibly, if one has a big enough immediate field to encode it.
>>>
>>> It is 12-bits, 2×6-bit fields.
>>>
>
>> Yes, but 12-bits was bigger than the 9-bit fields I was originally
>> using, or the Imm5 encodings in some other contexts.
>
>> Granted, XG2 expands these to 10 and 6 bits.
>
>> Or, could use a Jumbo encoding, or, ...
>
>> In most cases, having EXT{S/U}.{B/W/L} works well enough, and deals
>> with all of the common cases (and is faster than using a pair of
>> shifts, particularly when these shifts each have a 2 cycle latency...).
>
> And here we have the classical chicken and egg problem.
>
> Bit fields are not as fast as {B,H,W,D} so few people use them;
> Bit fields are not well supported in ISA so few compilers optimize them;
> EVEN if they are ideal for the situation at hand.
>
> When the HW cost of properly supporting them is essentially free !!

Possibly, but typical C type sizes and structure layouts are not defined
in bits, but rather as aggregates of power-of-two sized types typically
also with power-of-two alignments.

So, whatever would make effective use of bitfield instructions, probably
isn't typical C code (nor any of the other commonly used languages).

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upekig$1nogr$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37164&group=comp.arch#37164

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Wed, 31 Jan 2024 17:19:44 -0600
Organization: A noiseless patient Spider
Lines: 137
Message-ID: <upekig$1nogr$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 31 Jan 2024 23:19:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="50a4df111a60cc83bd1d5265f8dfd02f";
logging-data="1827355"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wnA7yt7EiUbh5B9G3x51Ffm4uHcDZSXU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:epSbzgzu/gq6XO9l3GZTnYYr2+s=
In-Reply-To: <6a35c4271ca330a56225c514686727ca@www.novabbs.org>
Content-Language: en-US

by: BGB-Alt - Wed, 31 Jan 2024 23:19 UTC

On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 1/22/2024 4:09 PM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the
>>>> RISC-V
>>>> spec had said to use "ORI" for these.
>>>
>>> What makes you think so? According to
>>> <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
>>> page 13:
>>>
>>> |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
>>> |pseudo-instruction.
>>>
>>> And on page 76:
>>>
>>> |C.LI expands into addi rd, x0, imm[5:0]
>>>
>>> C.LI is a separate instruction. I did not find anything about a
>>> non-compact LI, but given how C.LI expands (why does the ISA manual
>>> actually specify that?), I expect that LI is a pseudo-instruction that
>>> is actually "addi rd, x0, imm".
>>>
>
>> OK.
>
>> I had thought when I had looked it up, that it had said that these
>> mapped to ORI.
>
>> But, if it is ADDI, then GCC is behaving according to the spec.
>> Either way, the end-result is the same in this case.
>
>> In theory, could hack over these in the decoder by
>> detecting/special-casing things when the immediate is 0 (to map these
>> over to the MOV logic).
>
> I made My 66000 have a MOV OpCode for a particular reason::
> {MOV, ABS, NEG, INV} can be performed in 0-cycles in the forwarding
> network--if your FUs are designed to put up with this as inputs.
>
> It would have not been "that hard" to just special case decode,
> but who is going to do this when the opcode set includes FP and
> SIMD that needs these ??

In this case, it is merely 1 cycle, vs 2 cycle for the ALU ops, but, yeah...

But, yeah, looks like RV64 performance issues are partly:
Needs to fake indexed load/store;
Lots of interlock penalties;
GCC seems to assume aligned-only access;

If one tries to do "memcpy(d, s, 8);", it handles it as 8 byte moves,
which is bad in my case. Seems GCC is doing a function call for anything
bigger than 8 bytes.

Also it would appear as-if the scheduling is assuming 1-cycle ALU and
2-cycle load, vs 2-cycle ALU and 3-cycle load.

So, at least part of the problem is that GCC is generating code that is
not ideal for my pipeline.

Tried modeling what happens if RV64 had superscalar (in my emulator),
and the interlock issue gets worse, as then jumps up to around 23%-26%
interlock penalty (mostly eating any gains that superscalar would
bring). Where, it seems that superscalar (according to my CPU's rules)
would bundle around 10-15% of the RV64 ops with '-O3' (or, around 8-12%
with '-Os').

On the other hand, disabling WEX in BJX2 causes interlock penalties to
drop. So, it still maintains a performance advantage over RV, as the
drop in MIPs score is smaller.

Otherwise, had started work on trying to get RV64G support working, as
this would support a wider variety of programs than RV64IMA.

In another experiment, had added logic to fold && and || operators to
use bitwise arithmetic for logical expressions (in certain cases).
If both the LHS and RHS represent logical expressions with no side effects;
If the LHS and RHS are not "too expensive" according to a cost heuristic
(past a certain size, it is cheaper to use short-circuit branching
rather than ALU operations).

Internally, this added various pseudo operators to the compiler:
&&&, |||: Logical and expressed as bitwise.
!& : !(a&b)
!!&: !(!(a&b)), Normal TEST operator, with a logic result.
Exists to be distinct from normal bitwise AND.

This did at least help some with speed, but was initially bad for code
density (each compare needs 2 operations, CMPxx+MOVT/MOVNT).

Did partly compensate for the code-size increase by adding some
experimental 3R CMPxx ops:
CMPQEQ, CMPQNE, CMPQGT, CMPQGE

Currently only available in 64-bit forms, which can handle signed and
unsigned 32-bit values along with signed 64-bit values (unsigned 64-bit
would require a 3R CMPQHI instruction, and is less likely to be used as
often).

Where:
CMPQEQ Rs, Rt, Rn
CMPQNE Rs, Rt, Rn
CMPQGT Rs, Rt, Rn
CMPQGE Rs, Rt, Rn
Does:
Rn = (Rs == Rt);
Rn = (Rs != Rt);
Rn = (Rs > Rt);
Rn = (Rs >= Rt);
Where, < and <= can be done by flipping the arguments.

The CMPQ{EQ/NE/GT} cases are also available in an Imm5u form (TBD if it
will use the expansion to Imm6u or Imm6s in XG2 mode). Currently these
have a comparably lower hit rate.

It is less clear if the "better" fallback case is to load a constant
into a register and use the 3R CMPxx ops, or to fall-back to the
original CMPxx+MOVT/MOVNT.

At present, the CMPxx+MOVT/MOVNT fallback strategy seems to be winning
(though, the 3R CMPxx fallback is likely to be better when the value
falls outside the range of the "CMPxx Imm10{u/n}, Rn" operations).

....

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upevi8$1t2hs$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37166&group=comp.arch#37166

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.netnews.com!news.alt.net!us1.netnews.com!3.us.feeder.erje.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Wed, 31 Jan 2024 21:27:19 -0500
Organization: A noiseless patient Spider
Lines: 158
Message-ID: <upevi8$1t2hs$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 02:27:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e9df2e335095ed396a9aacecd54a48c0";
logging-data="2001468"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19MffFiAOaO88y+IbGlaWfL/BvAjNiqQPA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:FTiRBiuy00oHLmDlnlGAYK/mZ34=
In-Reply-To: <upekig$1nogr$1@dont-email.me>
Content-Language: en-US
X-Received-Bytes: 7547

by: Robert Finch - Thu, 1 Feb 2024 02:27 UTC

On 2024-01-31 6:19 p.m., BGB-Alt wrote:
> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
>>
>>> On 1/22/2024 4:09 PM, Anton Ertl wrote:
>>>> BGB <cr88192@gmail.com> writes:
>>>>> Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the
>>>>> RISC-V
>>>>> spec had said to use "ORI" for these.
>>>>
>>>> What makes you think so? According to
>>>> <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
>>>> page 13:
>>>>
>>>> |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
>>>> |pseudo-instruction.
>>>>
>>>> And on page 76:
>>>>
>>>> |C.LI expands into addi rd, x0, imm[5:0]
>>>>
>>>> C.LI is a separate instruction. I did not find anything about a
>>>> non-compact LI, but given how C.LI expands (why does the ISA manual
>>>> actually specify that?), I expect that LI is a pseudo-instruction that
>>>> is actually "addi rd, x0, imm".
>>>>
>>
>>> OK.
>>
>>> I had thought when I had looked it up, that it had said that these
>>> mapped to ORI.
>>
>>> But, if it is ADDI, then GCC is behaving according to the spec.
>>> Either way, the end-result is the same in this case.
>>
>>> In theory, could hack over these in the decoder by
>>> detecting/special-casing things when the immediate is 0 (to map these
>>> over to the MOV logic).
>>
>> I made My 66000 have a MOV OpCode for a particular reason::
>> {MOV, ABS, NEG, INV} can be performed in 0-cycles in the forwarding
>> network--if your FUs are designed to put up with this as inputs.
>>
>> It would have not been "that hard" to just special case decode,
>> but who is going to do this when the opcode set includes FP and
>> SIMD that needs these ??
>
>
> In this case, it is merely 1 cycle, vs 2 cycle for the ALU ops, but,
> yeah...
>
> But, yeah, looks like RV64 performance issues are partly:
> Needs to fake indexed load/store;
> Lots of interlock penalties;
> GCC seems to assume aligned-only access;
>
>
> If one tries to do "memcpy(d, s, 8);", it handles it as 8 byte moves,
> which is bad in my case. Seems GCC is doing a function call for anything
> bigger than 8 bytes.
>
> Also it would appear as-if the scheduling is assuming 1-cycle ALU and
> 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>
> So, at least part of the problem is that GCC is generating code that is
> not ideal for my pipeline.
>
>
> Tried modeling what happens if RV64 had superscalar (in my emulator),
> and the interlock issue gets worse, as then jumps up to around 23%-26%
> interlock penalty (mostly eating any gains that superscalar would
> bring). Where, it seems that superscalar (according to my CPU's rules)
> would bundle around 10-15% of the RV64 ops with '-O3' (or, around 8-12%
> with '-Os').
>
Is that with register renaming to remove dependencies?
>
> On the other hand, disabling WEX in BJX2 causes interlock penalties to
> drop. So, it still maintains a performance advantage over RV, as the
> drop in MIPs score is smaller.
>
> Otherwise, had started work on trying to get RV64G support working, as
> this would support a wider variety of programs than RV64IMA.
>
>
>
> In another experiment, had added logic to fold && and || operators to
> use bitwise arithmetic for logical expressions (in certain cases).
> If both the LHS and RHS represent logical expressions with no side effects;
> If the LHS and RHS are not "too expensive" according to a cost heuristic
> (past a certain size, it is cheaper to use short-circuit branching
> rather than ALU operations).
>
> Internally, this added various pseudo operators to the compiler:
> &&&, |||: Logical and expressed as bitwise.
> !& : !(a&b)
> !!&: !(!(a&b)), Normal TEST operator, with a logic result.
> Exists to be distinct from normal bitwise AND.
>
The arpl (cc64) compiler has the same ops I think using the same
symbols. Called 'safe-and' and 'safe-or' which can be specified with &&&
and |||.
>
> This did at least help some with speed, but was initially bad for code
> density (each compare needs 2 operations, CMPxx+MOVT/MOVNT).
>
> Did partly compensate for the code-size increase by adding some
> experimental 3R CMPxx ops:
> CMPQEQ, CMPQNE, CMPQGT, CMPQGE
>
> Currently only available in 64-bit forms, which can handle signed and
> unsigned 32-bit values along with signed 64-bit values (unsigned 64-bit
> would require a 3R CMPQHI instruction, and is less likely to be used as
> often).
>
> Where:
> CMPQEQ Rs, Rt, Rn
> CMPQNE Rs, Rt, Rn
> CMPQGT Rs, Rt, Rn
> CMPQGE Rs, Rt, Rn
> Does:
> Rn = (Rs == Rt);
> Rn = (Rs != Rt);
> Rn = (Rs > Rt);
> Rn = (Rs >= Rt);
> Where, < and <= can be done by flipping the arguments.
>
These instructions are also called 'set' instructions in some
architectures. Useful enough to include IMO. Q+ calls the 'ZSxx' for
zero or set (from the MMIX CPU) so they are not confused with
instructions that only set, which are called 'Sxx' instructions. I think
the Itanium calls them CMPxx instructions. I have been experimenting
with the option of having them cumulate values like the Itanium does.
Needs more opcode bits though.

Q+ has
Rt = (Ra==Rb) ? Rc : 0; // ZSEQ
Rt = (Ra==Rb) ? Imm8 : 0;
Rt = (Ra==Rb) ? Rc : Rt; // SEQ
Rt = (Ra==Rb) ? Imm8 : Rt;
Plus other ops besides ==
>
> The CMPQ{EQ/NE/GT} cases are also available in an Imm5u form (TBD if it
> will use the expansion to Imm6u or Imm6s in XG2 mode). Currently these
> have a comparably lower hit rate.
>
> It is less clear if the "better" fallback case is to load a constant
> into a register and use the 3R CMPxx ops, or to fall-back to the
> original CMPxx+MOVT/MOVNT.
>
> At present, the CMPxx+MOVT/MOVNT fallback strategy seems to be winning
> (though, the 3R CMPxx fallback is likely to be better when the value
> falls outside the range of the "CMPxx Imm10{u/n}, Rn" operations).
>
> ...
>
>

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37167&group=comp.arch#37167

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 03:01:36 +0000
Organization: novaBBS
Message-ID: <e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de> <uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at> <uomsiu$ub0e$1@dont-email.me> <6a35c4271ca330a56225c514686727ca@www.novabbs.org> <upekig$1nogr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1265881"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Rslight-Site: $2y$10$ksQ24LZBRekVzl/lPXJOCu4CclCHb8i64u.Ku3SA.NS5pZa/rYmBq
X-Spam-Checker-Version: SpamAssassin 4.0.0

by: MitchAlsup - Thu, 1 Feb 2024 03:01 UTC

BGB-Alt wrote:

> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
>> <snip>

> Also it would appear as-if the scheduling is assuming 1-cycle ALU and
> 2-cycle load, vs 2-cycle ALU and 3-cycle load.

> So, at least part of the problem is that GCC is generating code that is
> not ideal for my pipeline.

Captain Obvious strikes again.

> Tried modeling what happens if RV64 had superscalar (in my emulator),
> and the interlock issue gets worse, as then jumps up to around 23%-26%
> interlock penalty (mostly eating any gains that superscalar would
> bring). Where, it seems that superscalar (according to my CPU's rules)
> would bundle around 10-15% of the RV64 ops with '-O3' (or, around 8-12%
> with '-Os').

You are running into the reasons CPU designers went OoO after the 2-wide
in-order machine generation.

> On the other hand, disabling WEX in BJX2 causes interlock penalties to
> drop. So, it still maintains a performance advantage over RV, as the
> drop in MIPs score is smaller.

Your compiler is tuned to your pipeline.
But how do you tune your compiler to EVERY conceivable pipeline ??

> Otherwise, had started work on trying to get RV64G support working, as
> this would support a wider variety of programs than RV64IMA.

> In another experiment, had added logic to fold && and || operators to
> use bitwise arithmetic for logical expressions (in certain cases).
> If both the LHS and RHS represent logical expressions with no side effects;
> If the LHS and RHS are not "too expensive" according to a cost heuristic
> (past a certain size, it is cheaper to use short-circuit branching
> rather than ALU operations).

> Internally, this added various pseudo operators to the compiler:
> &&&, |||: Logical and expressed as bitwise.
> !& : !(a&b)
> !!&: !(!(a&b)), Normal TEST operator, with a logic result.
> Exists to be distinct from normal bitwise AND.

For the inexpensive cases, PRED was designed to handle the && and ||
of HLLs.

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<3ded4cbb3a5c98e285ec80fb64f14b52@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37168&group=comp.arch#37168

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 03:10:42 +0000
Organization: novaBBS
Message-ID: <3ded4cbb3a5c98e285ec80fb64f14b52@www.novabbs.com>
References: <uoimv3$4ahu$1@dont-email.me> <cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org> <uok4v6$csm5$1@dont-email.me> <959e309e1b96f7d57a5265408ca358c3@www.novabbs.org> <uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de> <uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at> <uomsiu$ub0e$1@dont-email.me> <6a35c4271ca330a56225c514686727ca@www.novabbs.org> <upekig$1nogr$1@dont-email.me> <upevi8$1t2hs$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1266341"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$ImQENNohZ7QJnDnNqY0CWO.iD6hgkGzXtUMipupiDC9fg7fQT75Si

by: MitchAlsup - Thu, 1 Feb 2024 03:10 UTC

Robert Finch wrote:

> On 2024-01-31 6:19 p.m., BGB-Alt wrote:
>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>><snip>
>>
>> Did partly compensate for the code-size increase by adding some
>> experimental 3R CMPxx ops:
>> CMPQEQ, CMPQNE, CMPQGT, CMPQGE
>>
>> Currently only available in 64-bit forms, which can handle signed and
>> unsigned 32-bit values along with signed 64-bit values (unsigned 64-bit
>> would require a 3R CMPQHI instruction, and is less likely to be used as
>> often).
>>
>> Where:
>> CMPQEQ Rs, Rt, Rn
>> CMPQNE Rs, Rt, Rn
>> CMPQGT Rs, Rt, Rn
>> CMPQGE Rs, Rt, Rn
>> Does:
>> Rn = (Rs == Rt);
>> Rn = (Rs != Rt);
>> Rn = (Rs > Rt);
>> Rn = (Rs >= Rt);
>> Where, < and <= can be done by flipping the arguments.
>>
> These instructions are also called 'set' instructions in some
> architectures. Useful enough to include IMO. Q+ calls the 'ZSxx' for
> zero or set (from the MMIX CPU) so they are not confused with
> instructions that only set, which are called 'Sxx' instructions. I think
> the Itanium calls them CMPxx instructions. I have been experimenting
> with the option of having them cumulate values like the Itanium does.
> Needs more opcode bits though.

> Q+ has
> Rt = (Ra==Rb) ? Rc : 0; // ZSEQ
> Rt = (Ra==Rb) ? Imm8 : 0;
> Rt = (Ra==Rb) ? Rc : Rt; // SEQ
> Rt = (Ra==Rb) ? Imm8 : Rt;
> Plus other ops besides ==

My 66000 has compare instructions that generate a bit-vector of output
conditions:: one for all forms of integer, and one for FP. In the case of
FP, it generates a set bit when NaN comparisons should go to the else-clause
and a different bit when that same comparison should deliver NaNs to the
then-clause. This enables the compiler to flip the then-else-clauses when
it chooses to do so.

In addition, the integer version has range comparisons (0 <[=] Rs1, <[=] Rs2)
for array limit comparisons. Any Byte or Any Half and Either Word comparisons
can be added later should anyone choose, but it looks like for now VVM supersedes
these needs.

Thus I have 1 integer CMP and one FP CMP instruction rather than a multitude.

If you want True/False, you can extract the bit you want::

CMP Rt,Rs1,Rs3
SLL Rd,Rt,<1,EQ> // {0, +1}
SLLs Re,Rt,<1,EQ> // {0, -1}

>>
>> The CMPQ{EQ/NE/GT} cases are also available in an Imm5u form (TBD if it
>> will use the expansion to Imm6u or Imm6s in XG2 mode). Currently these
>> have a comparably lower hit rate.
>>
>> It is less clear if the "better" fallback case is to load a constant
>> into a register and use the 3R CMPxx ops, or to fall-back to the
>> original CMPxx+MOVT/MOVNT.
>>
>> At present, the CMPxx+MOVT/MOVNT fallback strategy seems to be winning
>> (though, the 3R CMPxx fallback is likely to be better when the value
>> falls outside the range of the "CMPxx Imm10{u/n}, Rn" operations).
>>
>> ...
>>
>>

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upf40l$1tm16$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37169&group=comp.arch#37169

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.samoylyk.net!3.eu.feeder.erje.net!feeder.erje.net!news2.arglkargh.de!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Wed, 31 Jan 2024 22:43:15 -0500
Organization: A noiseless patient Spider
Lines: 102
Message-ID: <upf40l$1tm16$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me> <upevi8$1t2hs$1@dont-email.me>
<3ded4cbb3a5c98e285ec80fb64f14b52@www.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 03:43:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e9df2e335095ed396a9aacecd54a48c0";
logging-data="2021414"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wgJDen2ay2RKhgvA3bzT49kga/II/lB8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:hwc2K0xC9h6Bc8af3UP8+tyAWP4=
In-Reply-To: <3ded4cbb3a5c98e285ec80fb64f14b52@www.novabbs.com>
Content-Language: en-US

by: Robert Finch - Thu, 1 Feb 2024 03:43 UTC

On 2024-01-31 10:10 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2024-01-31 6:19 p.m., BGB-Alt wrote:
>>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>>> BGB-Alt wrote:
>>>> <snip>
>>>
>>> Did partly compensate for the code-size increase by adding some
>>> experimental 3R CMPxx ops:
>>>    CMPQEQ, CMPQNE, CMPQGT, CMPQGE
>>>
>>> Currently only available in 64-bit forms, which can handle signed and
>>> unsigned 32-bit values along with signed 64-bit values (unsigned
>>> 64-bit would require a 3R CMPQHI instruction, and is less likely to
>>> be used as often).
>>>
>>> Where:
>>>    CMPQEQ Rs, Rt, Rn
>>>    CMPQNE Rs, Rt, Rn
>>>    CMPQGT Rs, Rt, Rn
>>>    CMPQGE Rs, Rt, Rn
>>> Does:
>>>    Rn = (Rs == Rt);
>>>    Rn = (Rs != Rt);
>>>    Rn = (Rs > Rt);
>>>    Rn = (Rs >= Rt);
>>> Where, < and <= can be done by flipping the arguments.
>>>
>> These instructions are also called 'set' instructions in some
>> architectures. Useful enough to include IMO. Q+ calls the 'ZSxx' for
>> zero or set (from the MMIX CPU) so they are not confused with
>> instructions that only set, which are called 'Sxx' instructions. I
>> think the Itanium calls them CMPxx instructions. I have been
>> experimenting with the option of having them cumulate values like the
>> Itanium does. Needs more opcode bits though.
>
>> Q+ has
>>     Rt = (Ra==Rb) ? Rc : 0;    // ZSEQ
>>     Rt = (Ra==Rb) ? Imm8 : 0;
>>     Rt = (Ra==Rb) ? Rc : Rt;   // SEQ
>>     Rt = (Ra==Rb) ? Imm8 : Rt;
>> Plus other ops besides ==
>
> My 66000 has compare instructions that generate a bit-vector of output
> conditions:: one for all forms of integer, and one for FP. In the case
> of FP, it generates a set bit when NaN comparisons should go to the
> else-clause
> and a different bit when that same comparison should deliver NaNs to the
> then-clause. This enables the compiler to flip the then-else-clauses when
> it chooses to do so.
>
> In addition, the integer version has range comparisons (0 <[=] Rs1, <[=]
> Rs2)
> for array limit comparisons. Any Byte or Any Half and Either Word
> comparisons
> can be added later should anyone choose, but it looks like for now VVM
> supersedes
> these needs.
>
> Thus I have 1 integer CMP and one FP CMP instruction rather than a
> multitude.
>
> If you want True/False, you can extract the bit you want::
>
>     CMP   Rt,Rs1,Rs3
>     SLL   Rd,Rt,<1,EQ> // {0, +1}
>     SLLs Re,Rt,<1,EQ> // {0, -1}
>
Q+ also has the compare instructions returning bit vectors. In Q+ case
three, one for signed, one for unsigned and one for FP ops. There were
separate signed and unsigned compares so the result vector would fit
into eight bits allowing it to be used with SIMD instructions.

CMP Rt,Ra,Rb
EXTU Rd,Rt,EQ,EQ
EXT Rd,Rt,EQ,EQ

ZSEQ Rt,Ra,Rb,2
ZSEQ Rt,Ra,Rb,-2

The set instructions sometimes save an instruction over using a compare
then extract, so increase code density. It is a little redundant. ALU
ops are inexpensive and the opcode space was available.
>>>
>>> The CMPQ{EQ/NE/GT} cases are also available in an Imm5u form (TBD if
>>> it will use the expansion to Imm6u or Imm6s in XG2 mode). Currently
>>> these have a comparably lower hit rate.
>>>
>>> It is less clear if the "better" fallback case is to load a constant
>>> into a register and use the 3R CMPxx ops, or to fall-back to the
>>> original CMPxx+MOVT/MOVNT.
>>>
>>> At present, the CMPxx+MOVT/MOVNT fallback strategy seems to be
>>> winning (though, the 3R CMPxx fallback is likely to be better when
>>> the value falls outside the range of the "CMPxx Imm10{u/n}, Rn"
>>> operations).
>>>
>>> ...
>>>
>>>

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upfbn8$1umih$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37170&group=comp.arch#37170

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Wed, 31 Jan 2024 23:54:40 -0600
Organization: A noiseless patient Spider
Lines: 249
Message-ID: <upfbn8$1umih$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me> <upevi8$1t2hs$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 05:54:48 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="74d989d7c65c073a52f4a677ceb8bb32";
logging-data="2054737"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19WSdavMeNn5PstV40P6SFq"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Yv/x0U33CARymefOKGNGxSzZ/GY=
In-Reply-To: <upevi8$1t2hs$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 1 Feb 2024 05:54 UTC

On 1/31/2024 8:27 PM, Robert Finch wrote:
> On 2024-01-31 6:19 p.m., BGB-Alt wrote:
>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>>
>>>> On 1/22/2024 4:09 PM, Anton Ertl wrote:
>>>>> BGB <cr88192@gmail.com> writes:
>>>>>> Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the
>>>>>> RISC-V
>>>>>> spec had said to use "ORI" for these.
>>>>>
>>>>> What makes you think so? According to
>>>>> <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
>>>>> page 13:
>>>>>
>>>>> |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
>>>>> |pseudo-instruction.
>>>>>
>>>>> And on page 76:
>>>>>
>>>>> |C.LI expands into addi rd, x0, imm[5:0]
>>>>>
>>>>> C.LI is a separate instruction. I did not find anything about a
>>>>> non-compact LI, but given how C.LI expands (why does the ISA manual
>>>>> actually specify that?), I expect that LI is a pseudo-instruction that
>>>>> is actually "addi rd, x0, imm".
>>>>>
>>>
>>>> OK.
>>>
>>>> I had thought when I had looked it up, that it had said that these
>>>> mapped to ORI.
>>>
>>>> But, if it is ADDI, then GCC is behaving according to the spec.
>>>> Either way, the end-result is the same in this case.
>>>
>>>> In theory, could hack over these in the decoder by
>>>> detecting/special-casing things when the immediate is 0 (to map
>>>> these over to the MOV logic).
>>>
>>> I made My 66000 have a MOV OpCode for a particular reason::
>>> {MOV, ABS, NEG, INV} can be performed in 0-cycles in the forwarding
>>> network--if your FUs are designed to put up with this as inputs.
>>>
>>> It would have not been "that hard" to just special case decode,
>>> but who is going to do this when the opcode set includes FP and
>>> SIMD that needs these ??
>>
>>
>> In this case, it is merely 1 cycle, vs 2 cycle for the ALU ops, but,
>> yeah...
>>
>> But, yeah, looks like RV64 performance issues are partly:
>>    Needs to fake indexed load/store;
>>    Lots of interlock penalties;
>>    GCC seems to assume aligned-only access;
>>
>>
>> If one tries to do "memcpy(d, s, 8);", it handles it as 8 byte moves,
>> which is bad in my case. Seems GCC is doing a function call for
>> anything bigger than 8 bytes.
>>
>> Also it would appear as-if the scheduling is assuming 1-cycle ALU and
>> 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>>
>> So, at least part of the problem is that GCC is generating code that
>> is not ideal for my pipeline.
>>
>>
>> Tried modeling what happens if RV64 had superscalar (in my emulator),
>> and the interlock issue gets worse, as then jumps up to around 23%-26%
>> interlock penalty (mostly eating any gains that superscalar would
>> bring). Where, it seems that superscalar (according to my CPU's rules)
>> would bundle around 10-15% of the RV64 ops with '-O3' (or, around
>> 8-12% with '-Os').
>>
> Is that with register renaming to remove dependencies?

No.

But, yeah, ideally one wants each newly used register to not conflict
with previously used registers (within a certain scope), so in my case
the register allocator uses heuristics to try to figure which register
to use:
It prefers to use registers that are reserved but not used in the
current basic-block, if available;
Otherwise, it uses ranking heuristics to evaluate which to evict;
If applicable (if it had to evict something), it evaluates whether to
reserve additional registers in the stack-frame.

Though, static-assigned variables will always be mapped to the same
registers.

Comparably, my compiler does seem to use a lot more registers than GCC,
but it seems, it is less prone to quickly reuse the same registers.

>>
>> On the other hand, disabling WEX in BJX2 causes interlock penalties to
>> drop. So, it still maintains a performance advantage over RV, as the
>> drop in MIPs score is smaller.
>>
>> Otherwise, had started work on trying to get RV64G support working, as
>> this would support a wider variety of programs than RV64IMA.
>>

Looks like a partial workaround at least is to use "-mtune" to claim
that my CPU is a "SiFive S76", which appears to have timing values
closer to my CPU core than the "Rocket Chip" (and seems to perform at
least slightly better).

Also apparently the SiFive chip was (like mine) designed around an
8-stage pipeline rather than a 5-stage pipeline.

Looks also (in some small-scale in-emulator experiments), that if ALU
and Load could be reduced to 1 and 2 cycles, there would be a fairly
significant speed-up (both for RISC-V and BJX2 code).

But, at the moment, this would be asking a bit much.

>>
>>
>> In another experiment, had added logic to fold && and || operators to
>> use bitwise arithmetic for logical expressions (in certain cases).
>> If both the LHS and RHS represent logical expressions with no side
>> effects;
>> If the LHS and RHS are not "too expensive" according to a cost
>> heuristic (past a certain size, it is cheaper to use short-circuit
>> branching rather than ALU operations).
>>
>> Internally, this added various pseudo operators to the compiler:
>>    &&&, |||: Logical and expressed as bitwise.
>>    !& : !(a&b)
>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>      Exists to be distinct from normal bitwise AND.
>>
> The arpl (cc64) compiler has the same ops I think using the same
> symbols. Called 'safe-and' and 'safe-or' which can be specified with &&&
> and |||.

These don't exist at the language level, but are generated internally in
the "reducer" stage.

So, in this case, if && or || sees that both the LHS and RHS represent a
logical operator, it may quietly turn it into &&& or ||| (keeping the
original short-circuit operators if either side contains an expression
with side-effects or represents a non-logic result).

The new operators were added as at this stage the compiler needs to be
able to keep track of the difference between logical and bitwise
operators (this distinction is lost in the back-end; technically only &
and | exist, as both && and || decompose into if-goto logic in the RIL3
IR stage).

>>
>> This did at least help some with speed, but was initially bad for code
>> density (each compare needs 2 operations, CMPxx+MOVT/MOVNT).
>>
>> Did partly compensate for the code-size increase by adding some
>> experimental 3R CMPxx ops:
>>    CMPQEQ, CMPQNE, CMPQGT, CMPQGE
>>
>> Currently only available in 64-bit forms, which can handle signed and
>> unsigned 32-bit values along with signed 64-bit values (unsigned
>> 64-bit would require a 3R CMPQHI instruction, and is less likely to be
>> used as often).
>>
>> Where:
>>    CMPQEQ Rs, Rt, Rn
>>    CMPQNE Rs, Rt, Rn
>>    CMPQGT Rs, Rt, Rn
>>    CMPQGE Rs, Rt, Rn
>> Does:
>>    Rn = (Rs == Rt);
>>    Rn = (Rs != Rt);
>>    Rn = (Rs > Rt);
>>    Rn = (Rs >= Rt);
>> Where, < and <= can be done by flipping the arguments.
>>
> These instructions are also called 'set' instructions in some
> architectures. Useful enough to include IMO. Q+ calls the 'ZSxx' for
> zero or set (from the MMIX CPU) so they are not confused with
> instructions that only set, which are called 'Sxx' instructions. I think
> the Itanium calls them CMPxx instructions. I have been experimenting
> with the option of having them cumulate values like the Itanium does.
> Needs more opcode bits though.
>
> Q+ has
>    Rt = (Ra==Rb) ? Rc : 0;    // ZSEQ
>    Rt = (Ra==Rb) ? Imm8 : 0;
>    Rt = (Ra==Rb) ? Rc : Rt;   // SEQ
>    Rt = (Ra==Rb) ? Imm8 : Rt;
> Plus other ops besides ==

I called them CMPxx mostly because I already had the mnemonics, and the
distinction between a 2R op and a 3R op is "obvious enough" (and, unlike
RISC-V, I don't add new mnemonics to distinguish register from immediate
cases either).

Click here to read the complete article

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upfg7b$1va2n$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37171&group=comp.arch#37171

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 01:11:31 -0600
Organization: A noiseless patient Spider
Lines: 198
Message-ID: <upfg7b$1va2n$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me>
<e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 07:11:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="74d989d7c65c073a52f4a677ceb8bb32";
logging-data="2074711"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19EUXdT3m8EaQ3N+8KKySo3"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:egRIFnLZdN2gLgOW95vgKm463M8=
In-Reply-To: <e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
Content-Language: en-US

by: BGB - Thu, 1 Feb 2024 07:11 UTC

On 1/31/2024 9:01 PM, MitchAlsup wrote:
> BGB-Alt wrote:
>
>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>> <snip>
>
>> Also it would appear as-if the scheduling is assuming 1-cycle ALU and
>> 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>
>> So, at least part of the problem is that GCC is generating code that
>> is not ideal for my pipeline.
>
> Captain Obvious strikes again.
>
>> Tried modeling what happens if RV64 had superscalar (in my emulator),
>> and the interlock issue gets worse, as then jumps up to around 23%-26%
>> interlock penalty (mostly eating any gains that superscalar would
>> bring). Where, it seems that superscalar (according to my CPU's rules)
>> would bundle around 10-15% of the RV64 ops with '-O3' (or, around
>> 8-12% with '-Os').
>
> You are running into the reasons CPU designers went OoO after the 2-wide
> in-order machine generation.
>

At the moment, it is bad enough to make me question whether even 2-wide
superscalar makes sense for RV64.

Like, if Instructions/Bundle jumps by 10% but Interlock-Cost jumps by
9%, then it would only gain 1% in terms of Instructions/Clock.

This would suck, and not worth the cost of adding all the plumbing
needed to support superscalar.

>> On the other hand, disabling WEX in BJX2 causes interlock penalties to
>> drop. So, it still maintains a performance advantage over RV, as the
>> drop in MIPs score is smaller.
>
> Your compiler is tuned to your pipeline.
> But how do you tune your compiler to EVERY conceivable pipeline ??
>

Possibly so.

Seems that since my CPU and compiler co-evolved, then they fit together
reasonably well.

Meanwhile, GCC output output seems to assume a different looking CPU,
and is at a natural disadvantage (independent of the respective
"goodness" of the ISA's in question).

So, it seems like, my ISA runs roughly 22% faster than RV64 on my CPU
design, with GCC's tuning being sub-optimal.

But, both would get a nice speed up if the instruction latency were more
in-tune with what GCC seems to expect (and what is apparently delivered
by many of the RV64 chips).

So, in part, the comparably high latency values are hurting performance
it seems.

>> Otherwise, had started work on trying to get RV64G support working, as
>> this would support a wider variety of programs than RV64IMA.
>
>
>
>> In another experiment, had added logic to fold && and || operators to
>> use bitwise arithmetic for logical expressions (in certain cases).
>> If both the LHS and RHS represent logical expressions with no side
>> effects;
>> If the LHS and RHS are not "too expensive" according to a cost
>> heuristic (past a certain size, it is cheaper to use short-circuit
>> branching rather than ALU operations).
>
>> Internally, this added various pseudo operators to the compiler:
>>    &&&, |||: Logical and expressed as bitwise.
>>    !& : !(a&b)
>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>      Exists to be distinct from normal bitwise AND.
>
> For the inexpensive cases, PRED was designed to handle the && and ||
> of HLLs.

Mine didn't handle them, so generally predication only worked with
trivial conditionals:
if(a<0)
a=0;
Would use predication, but more complex cases:
if((a<0) && (b>0))
a=0;
Would not, and would always fall back to branching.

In the new mechanism, the latter case can partly be folded back into the
former, and can now allow parts of the conditional expression to be
subject to shuffling and bundling.

But, it seems, say:
CMPxx; MOVT; CMPxx; MOVT; AND; BNE
Is more bulky than, say:
CMPxx; BF; CMPxx; BF;
And, not always faster.

The CMP-3R ops partially address this, but the usefulness of the
immediate case is severely compromised with a small value range and only
a few possibilities.

But, don't really have the encoding space left over (in the 32-bit
space) to add "better" versions.

Like, say:
CMPQEQ Rm, Imm9u, Rn
CMPQEQ Rm, Imm9n, Rn
CMPQNE Rm, Imm9u, Rn
CMPQNE Rm, Imm9n, Rn
CMPQGT Rm, Imm9u, Rn
CMPQGT Rm, Imm9n, Rn
CMPQGE Rm, Imm9u, Rn
CMPQGE Rm, Imm9n, Rn
CMPQLT Rm, Imm9u, Rn
CMPQLT Rm, Imm9n, Rn
CMPQLE Rm, Imm9u, Rn
CMPQLE Rm, Imm9n, Rn

Would deal with all of the cases effectively (and with a single op), but
at present, there is no encoding pace to add these in the 32-bit space
(these would be a bit of an ask, even if the space did exist).

More viable would be (in XG2):
CMPQEQ Rm, Imm6s, Rn
CMPQNE Rm, Imm6s, Rn
CMPQGT Rm, Imm6s, Rn
CMPQGE Rm, Imm6s, Rn
CMPQLT Rm, Imm6s, Rn
CMPQLE Rm, Imm6s, Rn

But, this is lame, but still more than the current:
CMPQEQ Rm, Imm5u, Rn
CMPQNE Rm, Imm5u, Rn
CMPQGT Rm, Imm5u, Rn
But, can maybe re-add the GE case:
CMPQGE Rm, Imm5u, Rn

Theoretically, 6s could get around a 60% hit-rate (vs 40% for 5u). The
hit-rate for 6u is also pretty close. Having both 6u and 6n cases would
have a better hit-rate, but is a bit more steep in terms of encoding
space (and is unlikely to matter enough to justify burning 12
instruction spots on it).

Though, there is still the option of throwing a Jumbo prefix on these
ops getting, say:
CMPQEQ Rm, Imm29s, Rn //EQ, Wi=0
CMPQNE Rm, Imm29s, Rn //NE, Wi=0
CMPQGT Rm, Imm29s, Rn //GT, Wi=0
CMPQGE Rm, Imm29s, Rn //GE, Wi=0
CMPQLT Rm, Imm29s, Rn //GE, Wi=1
CMPQLE Rm, Imm29s, Rn //GT, Wi=1

CMPQHI Rm, Imm29s, Rn //EQ, Wi=1 (?)
CMPQHS Rm, Imm29s, Rn //NE, Wi=1 (?)

But... These would be 64-bit encodings, so would have the usual
tradeoffs/drawbacks of using 64-bit encodings...

Note that in XG2, the 'Wi' bit would otherwise serve as a sign extension
bit for the immediate (but, with a Jumbo-Imm prefix, the Ei bit serves
as the sign bit, and Wi would be left as a possible opcode bit, and/or
ignored...).

And, with WEX, would be hit/miss vs loading the values into registers
for the value-range of +/- 65535.

Also, main reason GE was left out of the current batch Imm5-forms was
that it seemed to have a comparably lower hit-rate than EQ/NE/GT
(though, GE does better than GT for the 2-register case, but was a lower
hit-rate for compare-with-immediate).

Arguably, a case could be made for the unsigned compares, these were
left out for these cases as 64-bit unsigned compare is comparably much
rarer (and, 64-bit signed-compare works for 32-bit unsigned values, in
the case where the ABI keeps these values zero-extended, unlike the wonk
that is RV64 apparently sign-extending 32-bit unsigned values to 64 bits).

....

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upfing$1vlj1$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37172&group=comp.arch#37172

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 02:54:23 -0500
Organization: A noiseless patient Spider
Lines: 214
Message-ID: <upfing$1vlj1$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me>
<e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
<upfg7b$1va2n$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 07:54:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e9df2e335095ed396a9aacecd54a48c0";
logging-data="2086497"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IK6HgMuiri9sgzFr2zZqb7QIoda/1kAk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:kVyseGYcZW34FPp/rQvHSwlH2is=
Content-Language: en-US
In-Reply-To: <upfg7b$1va2n$1@dont-email.me>

by: Robert Finch - Thu, 1 Feb 2024 07:54 UTC

On 2024-02-01 2:11 a.m., BGB wrote:
> On 1/31/2024 9:01 PM, MitchAlsup wrote:
>> BGB-Alt wrote:
>>
>>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>>> BGB-Alt wrote:
>>>> <snip>
>>
>>> Also it would appear as-if the scheduling is assuming 1-cycle ALU and
>>> 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>>
>>> So, at least part of the problem is that GCC is generating code that
>>> is not ideal for my pipeline.
>>
>> Captain Obvious strikes again.
>>
>>> Tried modeling what happens if RV64 had superscalar (in my emulator),
>>> and the interlock issue gets worse, as then jumps up to around
>>> 23%-26% interlock penalty (mostly eating any gains that superscalar
>>> would bring). Where, it seems that superscalar (according to my CPU's
>>> rules) would bundle around 10-15% of the RV64 ops with '-O3' (or,
>>> around 8-12% with '-Os').
>>
>> You are running into the reasons CPU designers went OoO after the 2-wide
>> in-order machine generation.
>>
>
>
> At the moment, it is bad enough to make me question whether even 2-wide
> superscalar makes sense for RV64.
>
> Like, if Instructions/Bundle jumps by 10% but Interlock-Cost jumps by
> 9%, then it would only gain 1% in terms of Instructions/Clock.
>
> This would suck, and not worth the cost of adding all the plumbing
> needed to support superscalar.
>
>
>>> On the other hand, disabling WEX in BJX2 causes interlock penalties
>>> to drop. So, it still maintains a performance advantage over RV, as
>>> the drop in MIPs score is smaller.
>>
>> Your compiler is tuned to your pipeline.
>> But how do you tune your compiler to EVERY conceivable pipeline ??
>>
>
> Possibly so.
>
> Seems that since my CPU and compiler co-evolved, then they fit together
> reasonably well.
>
> Meanwhile, GCC output output seems to assume a different looking CPU,
> and is at a natural disadvantage (independent of the respective
> "goodness" of the ISA's in question).
>
>
> So, it seems like, my ISA runs roughly 22% faster than RV64 on my CPU
> design, with GCC's tuning being sub-optimal.
>
>
> But, both would get a nice speed up if the instruction latency were more
> in-tune with what GCC seems to expect (and what is apparently delivered
> by many of the RV64 chips).
>
> So, in part, the comparably high latency values are hurting performance
> it seems.
>
>
>>> Otherwise, had started work on trying to get RV64G support working,
>>> as this would support a wider variety of programs than RV64IMA.
>>
>>
>>
>>> In another experiment, had added logic to fold && and || operators to
>>> use bitwise arithmetic for logical expressions (in certain cases).
>>> If both the LHS and RHS represent logical expressions with no side
>>> effects;
>>> If the LHS and RHS are not "too expensive" according to a cost
>>> heuristic (past a certain size, it is cheaper to use short-circuit
>>> branching rather than ALU operations).
>>
>>> Internally, this added various pseudo operators to the compiler:
>>>    &&&, |||: Logical and expressed as bitwise.
>>>    !& : !(a&b)
>>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>>      Exists to be distinct from normal bitwise AND.
>>
>> For the inexpensive cases, PRED was designed to handle the && and ||
>> of HLLs.
>
> Mine didn't handle them, so generally predication only worked with
> trivial conditionals:
> if(a<0)
>     a=0;
> Would use predication, but more complex cases:
> if((a<0) && (b>0))
>     a=0;
> Would not, and would always fall back to branching.
>
>
> In the new mechanism, the latter case can partly be folded back into the
> former, and can now allow parts of the conditional expression to be
> subject to shuffling and bundling.
>
> But, it seems, say:
> CMPxx; MOVT; CMPxx; MOVT; AND; BNE
> Is more bulky than, say:
> CMPxx; BF; CMPxx; BF;
> And, not always faster.
>
>
> The CMP-3R ops partially address this, but the usefulness of the
> immediate case is severely compromised with a small value range and only
> a few possibilities.
>
> But, don't really have the encoding space left over (in the 32-bit
> space) to add "better" versions.
>
> Like, say:
> CMPQEQ Rm, Imm9u, Rn
> CMPQEQ Rm, Imm9n, Rn
> CMPQNE Rm, Imm9u, Rn
> CMPQNE Rm, Imm9n, Rn
> CMPQGT Rm, Imm9u, Rn
> CMPQGT Rm, Imm9n, Rn
> CMPQGE Rm, Imm9u, Rn
> CMPQGE Rm, Imm9n, Rn
> CMPQLT Rm, Imm9u, Rn
> CMPQLT Rm, Imm9n, Rn
> CMPQLE Rm, Imm9u, Rn
> CMPQLE Rm, Imm9n, Rn
>
> Would deal with all of the cases effectively (and with a single op), but
> at present, there is no encoding pace to add these in the 32-bit space
> (these would be a bit of an ask, even if the space did exist).
>
>
> More viable would be (in XG2):
> CMPQEQ Rm, Imm6s, Rn
> CMPQNE Rm, Imm6s, Rn
> CMPQGT Rm, Imm6s, Rn
> CMPQGE Rm, Imm6s, Rn
> CMPQLT Rm, Imm6s, Rn
> CMPQLE Rm, Imm6s, Rn
>
> But, this is lame, but still more than the current:
> CMPQEQ Rm, Imm5u, Rn
> CMPQNE Rm, Imm5u, Rn
> CMPQGT Rm, Imm5u, Rn
> But, can maybe re-add the GE case:
> CMPQGE Rm, Imm5u, Rn
>
>
> Theoretically, 6s could get around a 60% hit-rate (vs 40% for 5u). The
> hit-rate for 6u is also pretty close. Having both 6u and 6n cases would
> have a better hit-rate, but is a bit more steep in terms of encoding
> space (and is unlikely to matter enough to justify burning 12
> instruction spots on it).
>
> Though, there is still the option of throwing a Jumbo prefix on these
> ops getting, say:
> CMPQEQ Rm, Imm29s, Rn //EQ, Wi=0
> CMPQNE Rm, Imm29s, Rn //NE, Wi=0
> CMPQGT Rm, Imm29s, Rn //GT, Wi=0
> CMPQGE Rm, Imm29s, Rn //GE, Wi=0
> CMPQLT Rm, Imm29s, Rn //GE, Wi=1
> CMPQLE Rm, Imm29s, Rn //GT, Wi=1
>
> CMPQHI Rm, Imm29s, Rn //EQ, Wi=1 (?)
> CMPQHS Rm, Imm29s, Rn //NE, Wi=1 (?)
>
> But... These would be 64-bit encodings, so would have the usual
> tradeoffs/drawbacks of using 64-bit encodings...
>
> Note that in XG2, the 'Wi' bit would otherwise serve as a sign extension
> bit for the immediate (but, with a Jumbo-Imm prefix, the Ei bit serves
> as the sign bit, and Wi would be left as a possible opcode bit, and/or
> ignored...).
>
>
> And, with WEX, would be hit/miss vs loading the values into registers
> for the value-range of +/- 65535.
>
>
> Also, main reason GE was left out of the current batch Imm5-forms was
> that it seemed to have a comparably lower hit-rate than EQ/NE/GT
> (though, GE does better than GT for the 2-register case, but was a lower
> hit-rate for compare-with-immediate).
>
>
> Arguably, a case could be made for the unsigned compares, these were
> left out for these cases as 64-bit unsigned compare is comparably much
> rarer (and, 64-bit signed-compare works for 32-bit unsigned values, in
> the case where the ABI keeps these values zero-extended, unlike the wonk
> that is RV64 apparently sign-extending 32-bit unsigned values to 64 bits).
>
> ...
>
>
Sounds like you hit the 32-bit encoding crunch. I think going with a
wider instruction format for a 64-bit machine is a reasonable choice. I
think they got that right with the Itanium. Being limited to constants <
12 bits uses extra instructions. If significant percentage of the
constants need extra instructions does using 32-bits really save space?
A decent compare-and-branch can be built in 40-bits. Compare-and-branch
is 10% of the instructions. If one looks at all the extra bits required
to use a 32-bit instruction instead of a 40-bit one, the difference in
code size is likely to be much smaller than the 25% difference in
instruction bit size. I have been wanting to measure this for a while. I
have thought of switching to 41-bit instructions as three will fit into
128-bits and it may be possible to simplify the fetch stage if bundles
of 128-bits are fetched for a three-wide machine. But the software for
41-bits is more challenging.

Click here to read the complete article

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upfp8k$20n33$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37174&group=comp.arch#37174

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 03:45:48 -0600
Organization: A noiseless patient Spider
Lines: 310
Message-ID: <upfp8k$20n33$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me>
<e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
<upfg7b$1va2n$1@dont-email.me> <upfing$1vlj1$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 09:45:56 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="74d989d7c65c073a52f4a677ceb8bb32";
logging-data="2120803"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TzChoaiz/clVP6n7R82gE"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:5q0z/m3+72HTIvwYi75qOTa2VJg=
Content-Language: en-US
In-Reply-To: <upfing$1vlj1$1@dont-email.me>

by: BGB - Thu, 1 Feb 2024 09:45 UTC

On 2/1/2024 1:54 AM, Robert Finch wrote:
> On 2024-02-01 2:11 a.m., BGB wrote:
>> On 1/31/2024 9:01 PM, MitchAlsup wrote:
>>> BGB-Alt wrote:
>>>
>>>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>>>> BGB-Alt wrote:
>>>>> <snip>
>>>
>>>> Also it would appear as-if the scheduling is assuming 1-cycle ALU
>>>> and 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>>>
>>>> So, at least part of the problem is that GCC is generating code that
>>>> is not ideal for my pipeline.
>>>
>>> Captain Obvious strikes again.
>>>
>>>> Tried modeling what happens if RV64 had superscalar (in my
>>>> emulator), and the interlock issue gets worse, as then jumps up to
>>>> around 23%-26% interlock penalty (mostly eating any gains that
>>>> superscalar would bring). Where, it seems that superscalar
>>>> (according to my CPU's rules) would bundle around 10-15% of the RV64
>>>> ops with '-O3' (or, around 8-12% with '-Os').
>>>
>>> You are running into the reasons CPU designers went OoO after the 2-wide
>>> in-order machine generation.
>>>
>>
>>
>> At the moment, it is bad enough to make me question whether even
>> 2-wide superscalar makes sense for RV64.
>>
>> Like, if Instructions/Bundle jumps by 10% but Interlock-Cost jumps by
>> 9%, then it would only gain 1% in terms of Instructions/Clock.
>>
>> This would suck, and not worth the cost of adding all the plumbing
>> needed to support superscalar.
>>
>>
>>>> On the other hand, disabling WEX in BJX2 causes interlock penalties
>>>> to drop. So, it still maintains a performance advantage over RV, as
>>>> the drop in MIPs score is smaller.
>>>
>>> Your compiler is tuned to your pipeline.
>>> But how do you tune your compiler to EVERY conceivable pipeline ??
>>>
>>
>> Possibly so.
>>
>> Seems that since my CPU and compiler co-evolved, then they fit
>> together reasonably well.
>>
>> Meanwhile, GCC output output seems to assume a different looking CPU,
>> and is at a natural disadvantage (independent of the respective
>> "goodness" of the ISA's in question).
>>
>>
>> So, it seems like, my ISA runs roughly 22% faster than RV64 on my CPU
>> design, with GCC's tuning being sub-optimal.
>>
>>
>> But, both would get a nice speed up if the instruction latency were
>> more in-tune with what GCC seems to expect (and what is apparently
>> delivered by many of the RV64 chips).
>>
>> So, in part, the comparably high latency values are hurting
>> performance it seems.
>>
>>
>>>> Otherwise, had started work on trying to get RV64G support working,
>>>> as this would support a wider variety of programs than RV64IMA.
>>>
>>>
>>>
>>>> In another experiment, had added logic to fold && and || operators
>>>> to use bitwise arithmetic for logical expressions (in certain cases).
>>>> If both the LHS and RHS represent logical expressions with no side
>>>> effects;
>>>> If the LHS and RHS are not "too expensive" according to a cost
>>>> heuristic (past a certain size, it is cheaper to use short-circuit
>>>> branching rather than ALU operations).
>>>
>>>> Internally, this added various pseudo operators to the compiler:
>>>>    &&&, |||: Logical and expressed as bitwise.
>>>>    !& : !(a&b)
>>>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>>>      Exists to be distinct from normal bitwise AND.
>>>
>>> For the inexpensive cases, PRED was designed to handle the && and ||
>>> of HLLs.
>>
>> Mine didn't handle them, so generally predication only worked with
>> trivial conditionals:
>>    if(a<0)
>>      a=0;
>> Would use predication, but more complex cases:
>>    if((a<0) && (b>0))
>>      a=0;
>> Would not, and would always fall back to branching.
>>
>>
>> In the new mechanism, the latter case can partly be folded back into
>> the former, and can now allow parts of the conditional expression to
>> be subject to shuffling and bundling.
>>
>> But, it seems, say:
>>    CMPxx; MOVT; CMPxx; MOVT; AND; BNE
>> Is more bulky than, say:
>>    CMPxx; BF; CMPxx; BF;
>> And, not always faster.
>>
>>
>> The CMP-3R ops partially address this, but the usefulness of the
>> immediate case is severely compromised with a small value range and
>> only a few possibilities.
>>
>> But, don't really have the encoding space left over (in the 32-bit
>> space) to add "better" versions.
>>
>> Like, say:
>>    CMPQEQ Rm, Imm9u, Rn
>>    CMPQEQ Rm, Imm9n, Rn
>>    CMPQNE Rm, Imm9u, Rn
>>    CMPQNE Rm, Imm9n, Rn
>>    CMPQGT Rm, Imm9u, Rn
>>    CMPQGT Rm, Imm9n, Rn
>>    CMPQGE Rm, Imm9u, Rn
>>    CMPQGE Rm, Imm9n, Rn
>>    CMPQLT Rm, Imm9u, Rn
>>    CMPQLT Rm, Imm9n, Rn
>>    CMPQLE Rm, Imm9u, Rn
>>    CMPQLE Rm, Imm9n, Rn
>>
>> Would deal with all of the cases effectively (and with a single op),
>> but at present, there is no encoding pace to add these in the 32-bit
>> space (these would be a bit of an ask, even if the space did exist).
>>
>>
>> More viable would be (in XG2):
>>    CMPQEQ Rm, Imm6s, Rn
>>    CMPQNE Rm, Imm6s, Rn
>>    CMPQGT Rm, Imm6s, Rn
>>    CMPQGE Rm, Imm6s, Rn
>>    CMPQLT Rm, Imm6s, Rn
>>    CMPQLE Rm, Imm6s, Rn
>>
>> But, this is lame, but still more than the current:
>>    CMPQEQ Rm, Imm5u, Rn
>>    CMPQNE Rm, Imm5u, Rn
>>    CMPQGT Rm, Imm5u, Rn
>> But, can maybe re-add the GE case:
>>    CMPQGE Rm, Imm5u, Rn
>>
>>
>> Theoretically, 6s could get around a 60% hit-rate (vs 40% for 5u). The
>> hit-rate for 6u is also pretty close. Having both 6u and 6n cases
>> would have a better hit-rate, but is a bit more steep in terms of
>> encoding space (and is unlikely to matter enough to justify burning 12
>> instruction spots on it).
>>
>> Though, there is still the option of throwing a Jumbo prefix on these
>> ops getting, say:
>>    CMPQEQ Rm, Imm29s, Rn //EQ, Wi=0
>>    CMPQNE Rm, Imm29s, Rn //NE, Wi=0
>>    CMPQGT Rm, Imm29s, Rn //GT, Wi=0
>>    CMPQGE Rm, Imm29s, Rn //GE, Wi=0
>>    CMPQLT Rm, Imm29s, Rn //GE, Wi=1
>>    CMPQLE Rm, Imm29s, Rn //GT, Wi=1
>>
>>    CMPQHI Rm, Imm29s, Rn //EQ, Wi=1 (?)
>>    CMPQHS Rm, Imm29s, Rn //NE, Wi=1 (?)
>>
>> But... These would be 64-bit encodings, so would have the usual
>> tradeoffs/drawbacks of using 64-bit encodings...
>>
>> Note that in XG2, the 'Wi' bit would otherwise serve as a sign
>> extension bit for the immediate (but, with a Jumbo-Imm prefix, the Ei
>> bit serves as the sign bit, and Wi would be left as a possible opcode
>> bit, and/or ignored...).
>>
>>
>> And, with WEX, would be hit/miss vs loading the values into registers
>> for the value-range of +/- 65535.
>>
>>
>> Also, main reason GE was left out of the current batch Imm5-forms was
>> that it seemed to have a comparably lower hit-rate than EQ/NE/GT
>> (though, GE does better than GT for the 2-register case, but was a
>> lower hit-rate for compare-with-immediate).
>>
>>
>> Arguably, a case could be made for the unsigned compares, these were
>> left out for these cases as 64-bit unsigned compare is comparably much
>> rarer (and, 64-bit signed-compare works for 32-bit unsigned values, in
>> the case where the ABI keeps these values zero-extended, unlike the
>> wonk that is RV64 apparently sign-extending 32-bit unsigned values to
>> 64 bits).
>>
>> ...
>>
>>
> Sounds like you hit the 32-bit encoding crunch. I think going with a
> wider instruction format for a 64-bit machine is a reasonable choice. I
> think they got that right with the Itanium. Being limited to constants <
> 12 bits uses extra instructions. If significant percentage of the
> constants need extra instructions does using 32-bits really save space?
> A decent compare-and-branch can be built in 40-bits. Compare-and-branch
> is 10% of the instructions. If one looks at all the extra bits required
> to use a 32-bit instruction instead of a 40-bit one, the difference in
> code size is likely to be much smaller than the 25% difference in
> instruction bit size. I have been wanting to measure this for a while. I
> have thought of switching to 41-bit instructions as three will fit into
> 128-bits and it may be possible to simplify the fetch stage if bundles
> of 128-bits are fetched for a three-wide machine. But the software for
> 41-bits is more challenging.
>

Click here to read the complete article

Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

<upftsc$21dnb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37175&group=comp.arch#37175

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Misc: Preliminary (actual) performance: BJX2 vs RV64
Date: Thu, 1 Feb 2024 06:04:44 -0500
Organization: A noiseless patient Spider
Lines: 358
Message-ID: <upftsc$21dnb$1@dont-email.me>
References: <uoimv3$4ahu$1@dont-email.me>
<cdf7ed10b08f1665d4741f20c1c7dd06@www.novabbs.org>
<uok4v6$csm5$1@dont-email.me>
<959e309e1b96f7d57a5265408ca358c3@www.novabbs.org>
<uoknuo$j55p$2@dont-email.me> <uomd6h$2u82f$2@newsreader4.netcologne.de>
<uomgoh$sck8$1@dont-email.me> <2024Jan22.230946@mips.complang.tuwien.ac.at>
<uomsiu$ub0e$1@dont-email.me>
<6a35c4271ca330a56225c514686727ca@www.novabbs.org>
<upekig$1nogr$1@dont-email.me>
<e01eddaf93fc9ac8cf2ff48232f2a133@www.novabbs.com>
<upfg7b$1va2n$1@dont-email.me> <upfing$1vlj1$1@dont-email.me>
<upfp8k$20n33$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 11:04:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="10f86dcb1b4ddd0e4174eaf4d2282de9";
logging-data="2143979"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/2XewL+UjR+TpnfeuqnYyNS5tEMNj0A8I="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:chZ/G6OsMBHAf71ndk6wmFll/go=
Content-Language: en-US
In-Reply-To: <upfp8k$20n33$1@dont-email.me>

by: Robert Finch - Thu, 1 Feb 2024 11:04 UTC

On 2024-02-01 4:45 a.m., BGB wrote:
> On 2/1/2024 1:54 AM, Robert Finch wrote:
>> On 2024-02-01 2:11 a.m., BGB wrote:
>>> On 1/31/2024 9:01 PM, MitchAlsup wrote:
>>>> BGB-Alt wrote:
>>>>
>>>>> On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
>>>>>> BGB-Alt wrote:
>>>>>> <snip>
>>>>
>>>>> Also it would appear as-if the scheduling is assuming 1-cycle ALU
>>>>> and 2-cycle load, vs 2-cycle ALU and 3-cycle load.
>>>>
>>>>> So, at least part of the problem is that GCC is generating code
>>>>> that is not ideal for my pipeline.
>>>>
>>>> Captain Obvious strikes again.
>>>>
>>>>> Tried modeling what happens if RV64 had superscalar (in my
>>>>> emulator), and the interlock issue gets worse, as then jumps up to
>>>>> around 23%-26% interlock penalty (mostly eating any gains that
>>>>> superscalar would bring). Where, it seems that superscalar
>>>>> (according to my CPU's rules) would bundle around 10-15% of the
>>>>> RV64 ops with '-O3' (or, around 8-12% with '-Os').
>>>>
>>>> You are running into the reasons CPU designers went OoO after the
>>>> 2-wide
>>>> in-order machine generation.
>>>>
>>>
>>>
>>> At the moment, it is bad enough to make me question whether even
>>> 2-wide superscalar makes sense for RV64.
>>>
>>> Like, if Instructions/Bundle jumps by 10% but Interlock-Cost jumps by
>>> 9%, then it would only gain 1% in terms of Instructions/Clock.
>>>
>>> This would suck, and not worth the cost of adding all the plumbing
>>> needed to support superscalar.
>>>
>>>
>>>>> On the other hand, disabling WEX in BJX2 causes interlock penalties
>>>>> to drop. So, it still maintains a performance advantage over RV, as
>>>>> the drop in MIPs score is smaller.
>>>>
>>>> Your compiler is tuned to your pipeline.
>>>> But how do you tune your compiler to EVERY conceivable pipeline ??
>>>>
>>>
>>> Possibly so.
>>>
>>> Seems that since my CPU and compiler co-evolved, then they fit
>>> together reasonably well.
>>>
>>> Meanwhile, GCC output output seems to assume a different looking CPU,
>>> and is at a natural disadvantage (independent of the respective
>>> "goodness" of the ISA's in question).
>>>
>>>
>>> So, it seems like, my ISA runs roughly 22% faster than RV64 on my CPU
>>> design, with GCC's tuning being sub-optimal.
>>>
>>>
>>> But, both would get a nice speed up if the instruction latency were
>>> more in-tune with what GCC seems to expect (and what is apparently
>>> delivered by many of the RV64 chips).
>>>
>>> So, in part, the comparably high latency values are hurting
>>> performance it seems.
>>>
>>>
>>>>> Otherwise, had started work on trying to get RV64G support working,
>>>>> as this would support a wider variety of programs than RV64IMA.
>>>>
>>>>
>>>>
>>>>> In another experiment, had added logic to fold && and || operators
>>>>> to use bitwise arithmetic for logical expressions (in certain cases).
>>>>> If both the LHS and RHS represent logical expressions with no side
>>>>> effects;
>>>>> If the LHS and RHS are not "too expensive" according to a cost
>>>>> heuristic (past a certain size, it is cheaper to use short-circuit
>>>>> branching rather than ALU operations).
>>>>
>>>>> Internally, this added various pseudo operators to the compiler:
>>>>>    &&&, |||: Logical and expressed as bitwise.
>>>>>    !& : !(a&b)
>>>>>    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
>>>>>      Exists to be distinct from normal bitwise AND.
>>>>
>>>> For the inexpensive cases, PRED was designed to handle the && and ||
>>>> of HLLs.
>>>
>>> Mine didn't handle them, so generally predication only worked with
>>> trivial conditionals:
>>>    if(a<0)
>>>      a=0;
>>> Would use predication, but more complex cases:
>>>    if((a<0) && (b>0))
>>>      a=0;
>>> Would not, and would always fall back to branching.
>>>
>>>
>>> In the new mechanism, the latter case can partly be folded back into
>>> the former, and can now allow parts of the conditional expression to
>>> be subject to shuffling and bundling.
>>>
>>> But, it seems, say:
>>>    CMPxx; MOVT; CMPxx; MOVT; AND; BNE
>>> Is more bulky than, say:
>>>    CMPxx; BF; CMPxx; BF;
>>> And, not always faster.
>>>
>>>
>>> The CMP-3R ops partially address this, but the usefulness of the
>>> immediate case is severely compromised with a small value range and
>>> only a few possibilities.
>>>
>>> But, don't really have the encoding space left over (in the 32-bit
>>> space) to add "better" versions.
>>>
>>> Like, say:
>>>    CMPQEQ Rm, Imm9u, Rn
>>>    CMPQEQ Rm, Imm9n, Rn
>>>    CMPQNE Rm, Imm9u, Rn
>>>    CMPQNE Rm, Imm9n, Rn
>>>    CMPQGT Rm, Imm9u, Rn
>>>    CMPQGT Rm, Imm9n, Rn
>>>    CMPQGE Rm, Imm9u, Rn
>>>    CMPQGE Rm, Imm9n, Rn
>>>    CMPQLT Rm, Imm9u, Rn
>>>    CMPQLT Rm, Imm9n, Rn
>>>    CMPQLE Rm, Imm9u, Rn
>>>    CMPQLE Rm, Imm9n, Rn
>>>
>>> Would deal with all of the cases effectively (and with a single op),
>>> but at present, there is no encoding pace to add these in the 32-bit
>>> space (these would be a bit of an ask, even if the space did exist).
>>>
>>>
>>> More viable would be (in XG2):
>>>    CMPQEQ Rm, Imm6s, Rn
>>>    CMPQNE Rm, Imm6s, Rn
>>>    CMPQGT Rm, Imm6s, Rn
>>>    CMPQGE Rm, Imm6s, Rn
>>>    CMPQLT Rm, Imm6s, Rn
>>>    CMPQLE Rm, Imm6s, Rn
>>>
>>> But, this is lame, but still more than the current:
>>>    CMPQEQ Rm, Imm5u, Rn
>>>    CMPQNE Rm, Imm5u, Rn
>>>    CMPQGT Rm, Imm5u, Rn
>>> But, can maybe re-add the GE case:
>>>    CMPQGE Rm, Imm5u, Rn
>>>
>>>
>>> Theoretically, 6s could get around a 60% hit-rate (vs 40% for 5u).
>>> The hit-rate for 6u is also pretty close. Having both 6u and 6n cases
>>> would have a better hit-rate, but is a bit more steep in terms of
>>> encoding space (and is unlikely to matter enough to justify burning
>>> 12 instruction spots on it).
>>>
>>> Though, there is still the option of throwing a Jumbo prefix on these
>>> ops getting, say:
>>>    CMPQEQ Rm, Imm29s, Rn //EQ, Wi=0
>>>    CMPQNE Rm, Imm29s, Rn //NE, Wi=0
>>>    CMPQGT Rm, Imm29s, Rn //GT, Wi=0
>>>    CMPQGE Rm, Imm29s, Rn //GE, Wi=0
>>>    CMPQLT Rm, Imm29s, Rn //GE, Wi=1
>>>    CMPQLE Rm, Imm29s, Rn //GT, Wi=1
>>>
>>>    CMPQHI Rm, Imm29s, Rn //EQ, Wi=1 (?)
>>>    CMPQHS Rm, Imm29s, Rn //NE, Wi=1 (?)
>>>
>>> But... These would be 64-bit encodings, so would have the usual
>>> tradeoffs/drawbacks of using 64-bit encodings...
>>>
>>> Note that in XG2, the 'Wi' bit would otherwise serve as a sign
>>> extension bit for the immediate (but, with a Jumbo-Imm prefix, the Ei
>>> bit serves as the sign bit, and Wi would be left as a possible opcode
>>> bit, and/or ignored...).
>>>
>>>
>>> And, with WEX, would be hit/miss vs loading the values into registers
>>> for the value-range of +/- 65535.
>>>
>>>
>>> Also, main reason GE was left out of the current batch Imm5-forms was
>>> that it seemed to have a comparably lower hit-rate than EQ/NE/GT
>>> (though, GE does better than GT for the 2-register case, but was a
>>> lower hit-rate for compare-with-immediate).
>>>
>>>
>>> Arguably, a case could be made for the unsigned compares, these were
>>> left out for these cases as 64-bit unsigned compare is comparably
>>> much rarer (and, 64-bit signed-compare works for 32-bit unsigned
>>> values, in the case where the ABI keeps these values zero-extended,
>>> unlike the wonk that is RV64 apparently sign-extending 32-bit
>>> unsigned values to 64 bits).
>>>
>>> ...
>>>
>>>
>> Sounds like you hit the 32-bit encoding crunch. I think going with a
>> wider instruction format for a 64-bit machine is a reasonable choice.
>> I think they got that right with the Itanium. Being limited to
>> constants < 12 bits uses extra instructions. If significant percentage
>> of the constants need extra instructions does using 32-bits really
>> save space? A decent compare-and-branch can be built in 40-bits.
>> Compare-and-branch is 10% of the instructions. If one looks at all the
>> extra bits required to use a 32-bit instruction instead of a 40-bit
>> one, the difference in code size is likely to be much smaller than the
>> 25% difference in instruction bit size. I have been wanting to measure
>> this for a while. I have thought of switching to 41-bit instructions
>> as three will fit into 128-bits and it may be possible to simplify the
>> fetch stage if bundles of 128-bits are fetched for a three-wide
>> machine. But the software for 41-bits is more challenging.
>>
>
> As can be noted, for 3RI Imm9 encodings in XG2, costs are:
> 2-bits: Bundle+Predicate Mode
> 12 bits: Rm/Rn register fields
> 9 bits: Immediate
> 9 bits: Remains for opcode/etc.
>
> For 3R instructions:
> 2-bits: Bundle+Predicate Mode
> 18 bits: Rm/Ro/Rn register fields
> 12 bits: Remains for opcode/etc.
>
> Though, given the ISA has other instructions:
> 32 spots: Load/Store (Disp9) and JCMP
> 16 spots: ALU 3R (Imm9)
>     The F2 block was split in half, with half going to 2RI Imm10 ops.
>
> The F0 block holds all of the 3R ops, with a theoretical 9-bit opcode
> space.
> Though, 1/4 of the space was carved initially for Branch ops.
>     In the original encoding, they were Disp20.
>     In XG2, they are effectively Disp23.
>     Half of this space has been semi-reclaimed though.
>       The BT/BF ops were redefined as being encoded as BRA?T / BRA?F
>
> Parts of the 3R space were also carved out for 2R space, etc.
>
>
> The encoding space can be extended with Jumbo Prefixes.
> Currently defined as FE and FF, with 24 bits of payload.
> FE is solely "Make Disp/Imm field bigger".
> FF is mostly "Mostly make Opcode bigger, maybe also extend Immed".
>
> In XG2, there are theoretically a number of other jumbo prefixes:
> 1E/1F/3E/3F/5E/5F/7E/7F/9E/9F/BE/BF/DE/DF
> But, these are not yet defined for anything, and are reserved.
>
> There are also variants of the FA/FB block:
> 1A/1B/3A/3B/5A/5B/7A/7B/9A/9B/BA/BB/DA/DB
> Which are similarly reserved (each with a potential of 24 bits of payload).
>
>
> Status of the major blocks:
> F0: Mostly full (3R Space)
>     0/1/2/3/4/5/6: Full
>     7/8/9: Partly used.
>     A/B: Still available
>     C/D: BRA/BSR
>     E/F: Semi reclaimed (former BT/BF ops)
> F1: Basically full (LD/ST)
> F2: Full as for 3RI Imm9 ops, some 2RI space remains.
> F3: Unused, Intended as User-Extension-Block
>     Would likely follow same layout as F0 block.
> F8: 2RI Imm16 ops, 6/8 used.
> F9: Reserved
>     Likely more 3R space (similar to F0 Block)
>     May expand to F9 when F0 gets full.
>     Beyond then, dunno.
>     Probably no more Imm9 ops though.
>
> Note that:
> F4..F7 mirrors F0..F3 (but, with the WEX flag set)
> FA/FB are some niche, but used indirectly for alternative uses.
> FC/FD mirror F8/F9;
> FE/FF: Jumbo Prefixes
> The Ez block follows a similar layout, but represents predicated ops.
> E0..E3: F0..F3, but Pred?T
> E4..E7: F0..F3, but Pred?F
> E8..E9: F8..F9, but Pred?T
> EA..EB: F0, F2, but Pred?T and WEX
> EC..ED: F8..F9, but Pred?F
> EE..EE: F0, F2, but Pred?F and WEX
>
> In XG2, all blocks other than Ez/Fz mirror Ez/Fz, but used to encode
> Bit5 of the register field.
>
> In Baseline mode, these mostly encode 16-bit ops (where nominally,
> everything uses 5-bit register fields, and the handling of R32..R63 is
> hacky and only works with a limited subset of the ISA; having reclaimed
> the 7z and 9z blocks from 16-bit land; these were reclaimed from a
> defunct 24-bit instructions experiment, which had in turn used these
> because initially "nothing of particular value" was in these parts of
> the 16-bit map).
>
>
> There has been a slowdown of adding new instructions, and being more
> conservative when they are added, mostly because there isn't a whole lot
> of encoding space left in the existing blocks.
>
> Apart from F3 and F9, the existing 32-bit encoding space is mostly used up.
>
>
> ...
>
>
Put some work into the compiler and got it to optimize some expressions
to use the dual-operation instructions. ATM it supports and_or, and_and,
or_or, and or_and. The HelloWorld! Program produces the following.

Click here to read the complete article

Pages:12

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor