Message-ID:

Real computer scientists like having a computer on their desk, else how could they read their mail?

On Thursday, May 18, 2023 at 7:01:48 PM UTC+1, BGB wrote:
> On 5/18/2023 4:08 AM, robf...@gmail.com wrote:
> > On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
> >> On 5/17/2023 3:13 PM, MitchAlsup wrote:
> >>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
> >>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> >>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> >>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> >>>>>>> chapter 7. after seeing how out of control this can get
> >>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
> >>>>>>> i see explicit saturation opcodes added to an ISA that
> >>>>>>> only has 32-bit available for instruction format.
> >>>>>>>
> >>>>>> I can note that I still don't have any dedicated saturating ops, but
> >>>>>> this is partly for cost and timing concerns (and I haven't yet
> >>>>>> encountered a case where I "strongly needed" saturating ops).
> >>>>>
> >>>>> if you are doing Video Encode/Decode (try AV1 for example)
> >>>>> you'll need them to stand a chance of any kind of power-efficient
> >>>>> operation.
> >>>>>
> >>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> >>>> than 0.16, and then clamping after the fact.
> >>>> Say: High 2 bits:
> >>>> 00: Value in range
> >>>> 01: Value out of range on positive side, clamp to 3FFF
> >>>> 11: Value out of range on negative side, clamp to 0000
> >>>> 10: Ambiguous, shouldn't happen.
> >>> <
> >>> This brings to mind:: the application:::
> >>> <
> >>> CPUs try to achieve highest frequency of operation and pipeline
> >>> away logic delay problems--LDs are now 4 and 5 cycles rather than
> >>> 2 (MIPS R3000); because that is where performance is as there is
> >>> rarely enough parallelism to utilize more than a "few" cores.
> >>> <
> >> I have 3-cycle memory access.
> >>
> >> Early on, load/store was not pipelined (and would always take 3 clock
> >> cycles), but slow memory ops were not ideal for performance. I had
> >> extended the pipeline to 3 execute stages mostly as this allowed for
> >> pipelining both load/store and also integer multiply.
> >>
> >>
> >> If the pipeline were extended to 6 execute stages, this would also allow
> >> for things like pipelined double-precision ops, or single-precision
> >> multiply-accumulate.
> >>
> >> But, this would also require more complicated register forwarding, would
> >> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
> >> it would likely end up hurting performance more than it would help.
> >>
> >>
> >> As can be noted, current pipeline is roughly:
> >> PF IF ID1 ID2 EX1 EX2 EX3 WB
> >> Or:
> >> PF IF ID RF EX1 EX2 EX3 WB
> >>
> >> Since ID2 doesn't actually decode anything, just fetches and forwards
> >> register values in preparation for EX1.
> >>
> >> From what I can gather, it seems a fair number of other RISC's had also
> >> ended up with a similar pipeline (somewhat more so than the 5-stage
> >> pipeline).
> >>> GPUs on the other hand, seem to be content to stay near 1 GHz
> >>> and just throw shader cores at the problem rather than fight for
> >>> frequency. Since GPUs process embarrassingly parallel applications
> >>> one can freely trade cores for frequency (and vice versa).
> >>> <
> >>> So, in GPUs, there are arithmetic designs can fully absorb the
> >>> delays of saturation, whereas in CPUs it is not so simple.
> >>> <merciful snip>
> >> For many use-cases, running at a lower clock-cycle and focusing more on
> >> shoveling stuff through the pipeline may make more sense than trying to
> >> run at a higher clock speed.
> >>
> >>
> >> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
> >> RISC with fairly small L1 caches. Didn't really seem like a win, and I
> >> can't really make the RAM any faster.
> >>
> >>
> >> Though, it is very possible that programs like Doom and similar might do
> >> better with a 100MHz RISC than a 50MHz VLIW.
> >>
> >> Things like "spin in a tight loop executing a relatively small number of
> >> serially dependent instructions" is something where a 100MHz 1-wide core
> >> has an obvious advantage over a 50MHz 3-wide core.
> >>>>> and that's what i warned about: when you get down to it,
> >>>>> saturation turns out to need to be applied to such a vast
> >>>>> number of operations that it is about 0.5 of a bit's worth
> >>>>> of encoding needed.
> >>>>>
> >>>> OK.
> >>>>
> >>>> Doesn't mean I intend to add general saturation.
> >>> <
> >>> Your application is mid-way between CPUs and GPUs.
> >>>
> >> Probably true, and it seems like I am getting properties that at times
> >> seem more GPU-like than CPU-like.
> >>
> >>
> >> Then, I am still off trying to get RISC-V code running on top of BJX2 as
> >> well.
> >>
> >> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
> >> so much as trying to get GCC to produce output that I can really use in
> >> TestKern...
> >>
> >> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
> >> only really supports fixed-address binaries (with the libraries
> >> apparently being static linked into the binaries).
> >>
> >> People had apparently argued back and forth between whether to enable
> >> shared-objects and similar, but apparently tended to leave it off
> >> because dynamic linking is prone to breaking stuff.
> >>
> >> I hadn't imagined the situation would be anywhere near this weak...
> >>
> >>
> >> I had sort of thought being able to have shared objects, PIE
> >> executables, etc, was sort of the whole point of ELF.
> >>
> >> Also, the toolchain doesn't support PE/COFF for this target either
> >> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
> >>
> >> Where, typically, PE/COFF binaries have a base-relocation table, ...
> >>
> >>
> >>
> >> Most strategies for giving a program its own logical address space would
> >> be kind of a pain for TestKern.
> >>
> >> I would need to decide between having multiple 48-bit address spaces, or
> >> make use of the 96-bit address space; say, loading a RV64 process at,
> >> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
> >>
> >> Though, at least the 96-bit address space option means that the kernel
> >> can still have pointers into the program's space (but, would mean that
> >> stuff servicing system calls would need to start working with 128-bit
> >> pointers).
> >>
> >> Well, at least short of other address space hacks, say:
> >> 0000_00000123_0000_0xxxxxxx
> >> Is mirrored at, say:
> >> 7123_0xxxxxxx
> >>
> >> So that syscall handlers don't need to use bigger pointers, but the
> >> program can still pretend to have its own virtual address space.
> >>
> >> Well, this or add some addressing hacks (say, a mode allowing
> >> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
> >>
> >>
> >> I would rather have had PIE binaries or similar and not need to deal
> >> with any of this...
> >>
> >>
> >> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
> >> address space" to actually be an issue...
> >>
> >>
> >> I can note, by extension, that BGBCC's PEL4 output can be loaded
> >> anywhere in the address space.
> >>
> >> Still mostly static-linking everything, but (unexpectedly) I am not
> >> actually behind on this front (and the DLLs do actually exist, sort of;
> >> even if at present they are more used as loadable modules than as OS
> >> libraries).
> >>
> >> ...
> >
> > I think BJX2 is doing very well if data access is only three cycles.
> >
> > I think Thor is sitting at six cycle data memory access, I$ access is single
> > cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
> > to the data cache, two to access the data cache, 1 to put the response into
> > a response fifo and 1 to off load the response back in the cpu. I think
> > there may also be an agen cycle happening too. There is probably at least
> > one cycle that could be eliminated, but eliminating it would improve
> > performance by about 4% overall and likely cost clock cycle time. ATM
> > writes write all the way through to memory and therefore take a
> > horrendous number of clock cycles eg. 30. Writes to some of the SoC
> > devices are much faster.
> >
> Load/Store is hard-locked to the pipeline in my case (runs in lock-step
> with everything else).
>
> Pretty much everything that operates directly within the pipeline is
> lock-stepped to the pipeline (and the core stalls entirely to handle L1
> misses or similar).
>
> ...
> > I really need a larger FPGA for my designs, any suggestions? I broke 500k
> > LUTs again and had to trim cores. I scrapped the wonderful register file
> > that could load four registers at a time, when I realized it looked like a
> > 16-read port file. 25,000 LUTs. A four-port register file is used now with
> > serial reads / writes for multi-register access. 2k LUTs. Same ISA,
> > implemented differently.
> >
> Yeah...
>
>
> With "all the features", my core is closer to 40k LUT.
> A dual core setup currently uses 66% of an XC7A200T.
> Single core fits on an XC7A100T at around 70%.
>
> If I trim it down, it can fit on an XC7S50.
>
> For these, this is with a core that does:
> 3-wide pipeline;
> 6R3W register file;
> ...
>
> A simple RISC-like subset can fit onto an XC7S25.
> But, at this point, may as well just use RV32I or similar...
>
> Where, device capacities are, roughly:
> XC7A200T 135k LUTs
> XC7A100T 68k LUTs
> XC7S50 34k LUTs
> XC7S25 17k LUTs
>
> Don't have a Kintex mostly because, even if I got the device itself, the
> license needed for Vivado is expensive... (similar issue for Virtex).

wake me up again when you've installed this :)
https://github.com/openXC7

>
> And, I am not really a fan of piracy,

very sensible. you never know what gets downloaded

> and there are no FOSS alternatives for Xilinx chips.

yes there are. symbiflow is about 5-10x slower than
nextpnr-xilinx so don't bother with it, but nextpnr-xilinx
is pretty good

> Board turned out to have far less usable IO than I thought;
> Couldn't make as much sense of the toolchain.

when you use nextpnr-ecp5 it's effectively exactly the same
as using nextpnr-xilinx with a few FPGA-specific additional
(optional) parameters.

you *do* need an IO connection file which if you haven't
got a pre-built one you really *really* need to find one, or
use a (separate) HDL library/tool that will auto-build one for you.
i use nmigen(tm) so it is not a problem.

i've been using both of these FPGA toolchains - all entirely
FOSS - no proprietary licenses AT ALL because i don't need them -
since 2021.

i have a Digilent Arty A7-100t, a ULX3S-85F, a VERSA-ECP5-45,
and my next target (when i get round to creating the Platform
file) is the (now pretty much unobtainable) Digilent Nexys4 Video
(the one with the xc7-200t)

i am extremely surprised that nobody here has told you
about them, the nextpnr FOSS toolchain and associated
scripts is extremely common knowledge on the Libera.Chat IRC
channels that i frequent.

i also maintain pages and scripts that will allow you to
install all of that software from scratch into a chroot environment
if you so desire, but please bear in mind that although they
are supposed to be "reproducible build" scripts, it has been
over.... 6 months since we had a report from someone
following them.

https://libre-soc.org/HDL_workflow/nextpnr/
https://libre-soc.org/HDL_workflow/nextpnr-xilinx/

if those don't quotes work quotes you are at this
immediate time better off finding someone else's
pre-compiled FOSS package(s) and/or build-system(s),
such as the openXC7 ones. for the sole reason that i am
completely overloaded and cannot give you even paid
support let alone unpaid support, i sincerely apologise.

Subject	Replies	Author
Re: Encoding saturating arithmetic By: luke.l...@gmail.com on Tue, 16 May 2023	38	luke.l...@gmail.com

Real computer scientists like having a computer on their desk, else how could they read their mail?

computers / comp.arch / Re: Encoding saturating arithmetic