Rocksolid Light - comp.arch - Re: Encoding saturating arithmetic

On 5/18/2023 4:08 AM, robf...@gmail.com wrote:
> On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
>> On 5/17/2023 3:13 PM, MitchAlsup wrote:
>>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>>>> chapter 7. after seeing how out of control this can get
>>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>>>> i see explicit saturation opcodes added to an ISA that
>>>>>>> only has 32-bit available for instruction format.
>>>>>>>
>>>>>> I can note that I still don't have any dedicated saturating ops, but
>>>>>> this is partly for cost and timing concerns (and I haven't yet
>>>>>> encountered a case where I "strongly needed" saturating ops).
>>>>>
>>>>> if you are doing Video Encode/Decode (try AV1 for example)
>>>>> you'll need them to stand a chance of any kind of power-efficient
>>>>> operation.
>>>>>
>>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>>>> than 0.16, and then clamping after the fact.
>>>> Say: High 2 bits:
>>>> 00: Value in range
>>>> 01: Value out of range on positive side, clamp to 3FFF
>>>> 11: Value out of range on negative side, clamp to 0000
>>>> 10: Ambiguous, shouldn't happen.
>>> <
>>> This brings to mind:: the application:::
>>> <
>>> CPUs try to achieve highest frequency of operation and pipeline
>>> away logic delay problems--LDs are now 4 and 5 cycles rather than
>>> 2 (MIPS R3000); because that is where performance is as there is
>>> rarely enough parallelism to utilize more than a "few" cores.
>>> <
>> I have 3-cycle memory access.
>>
>> Early on, load/store was not pipelined (and would always take 3 clock
>> cycles), but slow memory ops were not ideal for performance. I had
>> extended the pipeline to 3 execute stages mostly as this allowed for
>> pipelining both load/store and also integer multiply.
>>
>>
>> If the pipeline were extended to 6 execute stages, this would also allow
>> for things like pipelined double-precision ops, or single-precision
>> multiply-accumulate.
>>
>> But, this would also require more complicated register forwarding, would
>> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
>> it would likely end up hurting performance more than it would help.
>>
>>
>> As can be noted, current pipeline is roughly:
>> PF IF ID1 ID2 EX1 EX2 EX3 WB
>> Or:
>> PF IF ID RF EX1 EX2 EX3 WB
>>
>> Since ID2 doesn't actually decode anything, just fetches and forwards
>> register values in preparation for EX1.
>>
>> From what I can gather, it seems a fair number of other RISC's had also
>> ended up with a similar pipeline (somewhat more so than the 5-stage
>> pipeline).
>>> GPUs on the other hand, seem to be content to stay near 1 GHz
>>> and just throw shader cores at the problem rather than fight for
>>> frequency. Since GPUs process embarrassingly parallel applications
>>> one can freely trade cores for frequency (and vice versa).
>>> <
>>> So, in GPUs, there are arithmetic designs can fully absorb the
>>> delays of saturation, whereas in CPUs it is not so simple.
>>> <merciful snip>
>> For many use-cases, running at a lower clock-cycle and focusing more on
>> shoveling stuff through the pipeline may make more sense than trying to
>> run at a higher clock speed.
>>
>>
>> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
>> RISC with fairly small L1 caches. Didn't really seem like a win, and I
>> can't really make the RAM any faster.
>>
>>
>> Though, it is very possible that programs like Doom and similar might do
>> better with a 100MHz RISC than a 50MHz VLIW.
>>
>> Things like "spin in a tight loop executing a relatively small number of
>> serially dependent instructions" is something where a 100MHz 1-wide core
>> has an obvious advantage over a 50MHz 3-wide core.
>>>>> and that's what i warned about: when you get down to it,
>>>>> saturation turns out to need to be applied to such a vast
>>>>> number of operations that it is about 0.5 of a bit's worth
>>>>> of encoding needed.
>>>>>
>>>> OK.
>>>>
>>>> Doesn't mean I intend to add general saturation.
>>> <
>>> Your application is mid-way between CPUs and GPUs.
>>>
>> Probably true, and it seems like I am getting properties that at times
>> seem more GPU-like than CPU-like.
>>
>>
>> Then, I am still off trying to get RISC-V code running on top of BJX2 as
>> well.
>>
>> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
>> so much as trying to get GCC to produce output that I can really use in
>> TestKern...
>>
>> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
>> only really supports fixed-address binaries (with the libraries
>> apparently being static linked into the binaries).
>>
>> People had apparently argued back and forth between whether to enable
>> shared-objects and similar, but apparently tended to leave it off
>> because dynamic linking is prone to breaking stuff.
>>
>> I hadn't imagined the situation would be anywhere near this weak...
>>
>>
>> I had sort of thought being able to have shared objects, PIE
>> executables, etc, was sort of the whole point of ELF.
>>
>> Also, the toolchain doesn't support PE/COFF for this target either
>> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
>>
>> Where, typically, PE/COFF binaries have a base-relocation table, ...
>>
>>
>>
>> Most strategies for giving a program its own logical address space would
>> be kind of a pain for TestKern.
>>
>> I would need to decide between having multiple 48-bit address spaces, or
>> make use of the 96-bit address space; say, loading a RV64 process at,
>> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
>>
>> Though, at least the 96-bit address space option means that the kernel
>> can still have pointers into the program's space (but, would mean that
>> stuff servicing system calls would need to start working with 128-bit
>> pointers).
>>
>> Well, at least short of other address space hacks, say:
>> 0000_00000123_0000_0xxxxxxx
>> Is mirrored at, say:
>> 7123_0xxxxxxx
>>
>> So that syscall handlers don't need to use bigger pointers, but the
>> program can still pretend to have its own virtual address space.
>>
>> Well, this or add some addressing hacks (say, a mode allowing
>> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
>>
>>
>> I would rather have had PIE binaries or similar and not need to deal
>> with any of this...
>>
>>
>> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
>> address space" to actually be an issue...
>>
>>
>> I can note, by extension, that BGBCC's PEL4 output can be loaded
>> anywhere in the address space.
>>
>> Still mostly static-linking everything, but (unexpectedly) I am not
>> actually behind on this front (and the DLLs do actually exist, sort of;
>> even if at present they are more used as loadable modules than as OS
>> libraries).
>>
>> ...
>
> I think BJX2 is doing very well if data access is only three cycles.
>
> I think Thor is sitting at six cycle data memory access, I$ access is single
> cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
> to the data cache, two to access the data cache, 1 to put the response into
> a response fifo and 1 to off load the response back in the cpu. I think
> there may also be an agen cycle happening too. There is probably at least
> one cycle that could be eliminated, but eliminating it would improve
> performance by about 4% overall and likely cost clock cycle time. ATM
> writes write all the way through to memory and therefore take a
> horrendous number of clock cycles eg. 30. Writes to some of the SoC
> devices are much faster.
>

Load/Store is hard-locked to the pipeline in my case (runs in lock-step
with everything else).

Pretty much everything that operates directly within the pipeline is
lock-stepped to the pipeline (and the core stalls entirely to handle L1
misses or similar).

....

> I really need a larger FPGA for my designs, any suggestions? I broke 500k
> LUTs again and had to trim cores. I scrapped the wonderful register file
> that could load four registers at a time, when I realized it looked like a
> 16-read port file. 25,000 LUTs. A four-port register file is used now with
> serial reads / writes for multi-register access. 2k LUTs. Same ISA,
> implemented differently.
>

Yeah...

With "all the features", my core is closer to 40k LUT.
A dual core setup currently uses 66% of an XC7A200T.
Single core fits on an XC7A100T at around 70%.

If I trim it down, it can fit on an XC7S50.

For these, this is with a core that does:
3-wide pipeline;
6R3W register file;
...

A simple RISC-like subset can fit onto an XC7S25.
But, at this point, may as well just use RV32I or similar...

Where, device capacities are, roughly:
XC7A200T 135k LUTs
XC7A100T 68k LUTs
XC7S50 34k LUTs
XC7S25 17k LUTs

Don't have a Kintex mostly because, even if I got the device itself, the
license needed for Vivado is expensive... (similar issue for Virtex).

And, I am not really a fan of piracy, and there are no FOSS alternatives
for Xilinx chips.

I had at one point built the design in Quartus for an Altera chip, but
didn't buy one.

Did at one point get an Lattice ECP5 based device, but didn't end up
using it, as:
Board turned out to have far less usable IO than I thought;
Couldn't make as much sense of the toolchain.

The minimal case for BJX2 requires a 2R1W register file, but lacks much
obvious advantage over RISC-V in this case. It is like RISC-V with
predicated instructions and smaller immediate fields. Jumbo prefixes
don't work, so one needs to use a multi-op sequence to load large constants.

There are ISA design tradeoffs due to BJX2 lacking a zero register,
where various instructions (such as NEG, etc) become unnecessary if one
has a zero-register. OTOH, the number of usable GPRs is slightly bigger
(in 32 GPR mode, nevermind that I currently have 64 GPRs), so it
balances out.

....

Subject	Replies	Author
Re: Encoding saturating arithmetic By: luke.l...@gmail.com on Tue, 16 May 2023	38	luke.l...@gmail.com

Heisenberg may have slept here...

computers / comp.arch / Re: Encoding saturating arithmetic