Rocksolid Light - comp.arch - Re: Misc: Dealing with variability

On 7/4/2023 9:10 AM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>> On 7/3/2023 4:55 PM, MitchAlsup wrote:
>>> On Monday, July 3, 2023 at 12:14:37 PM UTC-5, BGB wrote:
>>>> One annoyance with my project is that I can't run a core that can run
>>>> exactly the same code along the range of FPGA's I want to use:
>>>> XC7S25: Can only run a simplistic RISC-like core (with no FPU or MMU)
>>>> XC7S50: Currently running a 2-wide configuration with 32 GPRs
>>>> Reduced features, eg: no 128-bit ALU ops in this case, ...
>>>> XC7A100T: Can run 3-wide with 64 GPRs, no issue.
>>>> Can also run 128-bit ALU ops, ...
>>>>
>>>>
>>>>
>>>> Generally, one needs to rebuild the code for each configuration.
>>>> Code built for small configurations will perform worse on bigger
>>>> configurations. Code built for wide configurations will not run on small
>>>> configurations.
>>> <
>>> This is exactly why my architecture allows for the HW to perform the narrow
>>> to wide transformations. There is 1 ISA model for everything from 1-wide
>>> in-order to 8-wide GBOoO. Code that runs optimally on the 1-wide machine
>>> is within spitting distance of running optimally on the GBOoO.
>>> <
>>> Then the compiler can be targeted at just the ISA and be model ignorant.
>>
>> The issue is doing this effectively without causing LUT cost or timing
>> issues.
>>
>>
>> I have a possible way to execute 1-wide code as a 2-way superscalar, but
>> it would only really be practical on the larger profiles.
>
> If you want bundles, you could just execute the individual
> instructions sequentially on your smallest core. Unless you throw
> out the NOPs during decoding and give the CPU something else to
> do, you would then suffer the performance penalty for the NOPs
> you generate.
>

This part is not the main issue...

Also, luckily, my encoding scheme does not waste space with a bunch of
NOPs. Rather the bundles are a variable number of 32-bit instruction
words with a wide-execute flag.

The pipeline also has interlock handling, so there is no real need for
"timing NOPs" either.

A much bigger issue is, for example, that the 2-wide and 3-wide profiles
assume the existence of a MOV.X instruction (Load/Store 128-bit pair),
which does not exist on 1-wide as it requires more register ports than
are available.

Would otherwise need some mechanism to decompose it in the pipeline into
a pair of MOV.Q instructions.

Though, MOV.X isn't an issue for WEX-3W code on WEX-2W.

Rather, it is that code built for WEX-3W tends to assume other
extensions (such as MULQ or ALUX). So, stuff breaks as soon as it hits
an attempt to use an Integer Divide instruction or 128-bit ALU op or
similar.

Trying to running 3W on 2W will cause the core to fall back to scalar
operation, as 3W bundling rules aren't valid for 2W.

One could assume that the compiler not use these.

Jumbo operations were originally a problem for 1-wide, but I have ended
up using a hack apparently similar to what MicroBlaze uses, namely the
Jumbo-prefix itself behaves like a NOP as far as the EX stages are
concerned, but captures its state in a special internal register, which
then combines with the following instruction. One has to basically flag
the prefixes as "No, Interrupt can't land here.", similar to those
generated by interlock stalls, so it is a "special" type of NOP.

For the 2-wide, the Jumbo prefixes are decoded the same as for 3-wide.

Just as-is, the 3rd decoder is limited to F8 block instructions, eg:
FEab-cdef-FE01-2345-F811-6789 MOV 0xABCDEF0123456789, R17
So, it can handle 96-bit 2RI MOV and ADD and similar, but not much else.

Limiting the 3rd decoder to the F8 block serving to shave off some LUTs
in this case.

For the most part, Imm57s ALU ops or Disp57s Load/Store, while
theoretically possible to encode, are "not a thing" at present.

Mostly because the number of times these cases tend to come up (where an
Imm33s or Disp33s is insufficient) is "vanishingly small", so it ends up
debatable of whether they are worth the cost of having them enabled in
the decoder (with the closest thing to a "killer app" being to use them
as immediate values for SIMD instructions).

Also, currently Disp33s is the largest format supported natively by the
AGU (going any bigger requiring the use of ALU instructions).

> Or, if you have a bit to spare, you can put in a hint that an
> instruction can be executed consecutively with the next (or the
> preceding) one. This would save you the logic needed to analyze
> register dependencies. Low-end implementations could just ignore
> the bit, middle-end implementations would use it and high-end
> implementations would not need it, wasting a bit.
>

This is pretty much how the encoding scheme works already, and was
designed to operate.

Achieving this scalability in practice isn't as easy.

One would need to also ensure that all of the cores have the same basic
feature set in terms of scalar instructions as well, which is where the
main problem lies with the profiles.

> Or, you could structure your ISA so that the dependency analysis
> becomes simple - always put the destination register and source
> register(s) in the same place, and have a opcode simple bit pattern
> to show which instructions have which registers. Then, having
> a two-wide in-order implementation would become simpler.
>

Luckily, register fields don't move around that much.

Going superscalar or OoO should be very well possible with BJX2.

Just, the bigger question is if one can do a "good enough" superscalar
or OoO core on a Spartan or Artix FPGA to make it worthwhile.

The fastest (in terms of MHz) options thus far tend to be 1-wide RISC
cores (which one can run at ~ 100 MHz).

Or, I can run a VLIW core at 50 MHz, with enough "extra" to mostly
compensate for the lower clock-speed.

Also, at 50 and 100 MHz, performance is more limited by memory bandwidth
in most of my test cases than by clock-speed.

Also, despite the higher clock-speed, relative performance will tank if
one achieves the higher clock-speed by reducing the L1 cache size (so,
maintaining 16K or 32K of L1 cache seems to be "fairly important" for
performance).

An intermediate option is 75 MHz, but it is "very painful" trying to get
or keep the core passing timing at 75MHz (with both 3-wide VLIW and
"decent" L1 caches).

If one drops to 25 MHz, the loss of clock-speed ends up hurting more
than anything one could likely gain from doing so.

Say, for example, even with minimal memory and instruction latency, Doom
(and similar) at 25 MHz would still run slower than 50MHz, shifting from
being memory-bound to being limited by how quickly it can execute
instructions (and it would also rarely go much over around 12 fps or so).

One can also do a 16-bit core and run it at around 200 MHz, but... this
also isn't terribly useful...

> Think of what a Boolean function "Can the instruction sequence a,b
> be executed consecutively" would look like. The simpler this is,
> the better for your analysis and your LUT count.
>
> A highly regular structure, like the RISC designs had, would certainly
> make this easier.

Most of my instructions follow a few forms (for 32-bit ops):
FZnm-ZeoZ //3R
FZnm-ZeZZ //2R
FZnm-Zeii //3RI (Imm9)
FZnZ-Zeii //2RI (Imm10)
FZZZ-ZeoZ //1R ("OP Ro", but Ro understood as Rn)
FZZn-iiii //2RI (Imm16)
FZii-Ziii //IMM (BRA and BSR)

Where, Rn, Rm, Ro are the 3 registers; the 'e' field adding bit 4 for
each register: Znmo.

This leaves the Imm16 block as an outlier with a different register
encoding from all the other ops. But, there was no good way around this
short of dog-chewing the immediate field, and in this case I prioritized
having a contiguous immediate over a consistent register field.

Unlike RISC-V, I tried to minimize dog-chewing the immediate fields.

I also prioritized a layout where I could express encodings sensibly in
a hexadecimal-based notation (rather than binary), and where there was
sufficient detail in the notation to be able to unambiguously encode or
decode the instructions (without needing a big blob of text or diagrams
to try to express how a given instruction in question laid out its fields).

There are also 16-bit ops, mostly following a pattern:
ZZnm //2R
ZZni //1RI (Imm4)
ZZnZ //1R
ZZii //Imm8 / Disp8
Znii //1RI (Imm8) (MOV and ADD)
Ziii //Imm12 (MOV Imm12, R0)

The 16-bit ops being mostly limited to R0..R15 (some Load/Store ops also
have variants to encode R16..R31). There are no 3R encodings in 16-bit land.

On the first 16 bit word, the ranges are:
F0: Mostly 3R, 2R, and 1R ops.
F1: Load/Store Disp9 ops
F2: ALU Imm9 ops.
F3: Reserved / Implementation-Extension
F4..F7: Repeat F0..F3, flagged as bundled.
F8: 2RI Imm16 ops
F9: Reserved
FA: MOV Imm24u, R0
FB: MOV Imm24n, R0
FC/FD: Repeat F8/F9, bundled
FE: Jumbo Prefix (Immed Extension)
FF: Jumbo Prefix (Opcode Extension)

The EZ block has a similar layout:
E0..E3: Repeat F0..F3, but "Execute if True"
E4..E7: Repeat F0..F3, but "Execute if False"
E8/E9: Repeat F8/F9, but "Execute if True"
EA/EB: Repeat F0 and F2, both "Execute if True" and Bundled
EC/ED: Repeat F8/F9, but "Execute if False"
EE/EF: Repeat F0 and F2, both "Execute if False" and Bundled

In the baseline ISA (no XGPR):
If the top 3 bits are not 111, it is a 16-bit op.

With XGPR enabled (in Baseline mode):
7ynm-ZeoZ: Repeat F0, but extend register fields to 6 bits.
9ynm-Zeii: Repeat F1/F2, but extend register fields to 6 bits.
The would-be Ro extension bit selects F2 or F1.
This was ugly, but minimized breakage to the existing ISA.

FA and FB are special.
FAii-iiii MOV Imm24u, R0
FBii-iiii MOV Imm24n, R0
FEii-iiii-FAii-iiii MOV Imm48u, R0
FEii-iiii-FBii-iiii MOV Imm48n, R0
FFii-iiii-FAii-iiii BRA Abs48
FFii-iiii-FBii-iiii BSR Abs48

These may not be predicated or bundled, dominant use-case is mostly when
loading a constant to be used as an immediate for another instruction
when not using a Jumbo encoding (or the instruction in question doesn't
actually have a form that takes an immediate).

As noted, trying to predicate or bundle these blocks leads to different
encodings (such as the Jumbo prefix).

The Abs48 branch mostly takes over the role where one would otherwise
need to do a 64-bit constant load and branch (though, this is still
needed in cases where the branch is also an Inter-ISA branch).

There is another newer "XG2 Mode", which drops all the 16 bit ops, using
the high 3 bits to instead add a bit to all the register fields (though,
these bits are inverted). This is basically a "Native 64 GPR" encoding
(but has the tradeoff of "slightly worse code density" mostly due to the
loss of 16-bit ops).

There is another older subset called Fix32, which drops all the 16-bit
ops. I can note that Fix32 code can be decoded in directly in both XG2
Mode and Baseline mode.

Implicitly, XG2 requires XGPR to be enabled, otherwise it "does not make
sense" (XG2 without XGPR being equivalent to Fix32).

For a possible dedicated "GPU Core", had considered the possibility of
hard-wiring it into XG2 Mode (as hard-wired XG2 leads to a smaller
decoder than one which supports the baseline ISA).

Where, high 4 bits in XG2 mode:
0/2/4/6/8/A/C/E: Blocks with predicated instructions.
1/3/5/7/9/B/D/F: Blocks with unconditional and bundled instructions.

E and F representing the blocks where only R0..R31 are used, and the
rest of the encoding space representing blocks where R32..R63 are used
on one or more of the register fields.

I would have liked an encoding where it was possible to encode both
predication and bundling for the entire ISA, but it wasn't really
possible to fit this into 32 bits.

In practice, this isn't too big of an issue though. F0/F1/F2 are the
most heavily used blocks, and the F1 block is pretty much never used
with the WEX flag set (since Load/Store is only allowed in "Lane 1",
which in this case is as the last instruction in the bundle).

In cores where both XG2 and the alternate RISC-V decoder are enabled,
there is also an XG2RV mode which uses the XG2 encoding scheme, but
operating within RISC-V's register space and using RISC-V's C ABI (just
with F0..F31 mapped to R32..R63).

Though, the intended use-case for this was rendered effectively moot by
GCC's RISC-V support seemingly lacking support for both PIE/PIC binaries
and Shared Objects, so technically, there is not currently a way to run
native RISC-V binaries in such a way that XG2RV is "actually useful".

In effect, if I wanted to use GCC here, I would need to write my own
linker (and/or implement a proper RISC-V backend for BGBCC), which
mostly defeats the whole point...

And, cold-booting into RISC-V mode on the BJX2 core is still "kinda
useless"...

Subject	Replies	Author
Misc: Dealing with variability By: BGB on Mon, 3 Jul 2023	20	BGB

"Hello again, Peabody here..." -- Mister Peabody

computers / comp.arch / Re: Misc: Dealing with variability