Rocksolid Light - comp.arch - Re: Encoding saturating arithmetic

On 5/18/2023 4:08 AM, robf...@gmail.com wrote:
> On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
>> On 5/17/2023 3:13 PM, MitchAlsup wrote:
>>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>>>> chapter 7. after seeing how out of control this can get
>>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>>>> i see explicit saturation opcodes added to an ISA that
>>>>>>> only has 32-bit available for instruction format.
>>>>>>>
>>>>>> I can note that I still don't have any dedicated saturating ops, but
>>>>>> this is partly for cost and timing concerns (and I haven't yet
>>>>>> encountered a case where I "strongly needed" saturating ops).
>>>>>
>>>>> if you are doing Video Encode/Decode (try AV1 for example)
>>>>> you'll need them to stand a chance of any kind of power-efficient
>>>>> operation.
>>>>>
>>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>>>> than 0.16, and then clamping after the fact.
>>>> Say: High 2 bits:
>>>> 00: Value in range
>>>> 01: Value out of range on positive side, clamp to 3FFF
>>>> 11: Value out of range on negative side, clamp to 0000
>>>> 10: Ambiguous, shouldn't happen.
>>> <
>>> This brings to mind:: the application:::
>>> <
>>> CPUs try to achieve highest frequency of operation and pipeline
>>> away logic delay problems--LDs are now 4 and 5 cycles rather than
>>> 2 (MIPS R3000); because that is where performance is as there is
>>> rarely enough parallelism to utilize more than a "few" cores.
>>> <
>> I have 3-cycle memory access.
>>
>> Early on, load/store was not pipelined (and would always take 3 clock
>> cycles), but slow memory ops were not ideal for performance. I had
>> extended the pipeline to 3 execute stages mostly as this allowed for
>> pipelining both load/store and also integer multiply.
>>
>>
>> If the pipeline were extended to 6 execute stages, this would also allow
>> for things like pipelined double-precision ops, or single-precision
>> multiply-accumulate.
>>
>> But, this would also require more complicated register forwarding, would
>> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
>> it would likely end up hurting performance more than it would help.
>>
>>
>> As can be noted, current pipeline is roughly:
>> PF IF ID1 ID2 EX1 EX2 EX3 WB
>> Or:
>> PF IF ID RF EX1 EX2 EX3 WB
>>
>> Since ID2 doesn't actually decode anything, just fetches and forwards
>> register values in preparation for EX1.
>>
>> From what I can gather, it seems a fair number of other RISC's had also
>> ended up with a similar pipeline (somewhat more so than the 5-stage
>> pipeline).
>>> GPUs on the other hand, seem to be content to stay near 1 GHz
>>> and just throw shader cores at the problem rather than fight for
>>> frequency. Since GPUs process embarrassingly parallel applications
>>> one can freely trade cores for frequency (and vice versa).
>>> <
>>> So, in GPUs, there are arithmetic designs can fully absorb the
>>> delays of saturation, whereas in CPUs it is not so simple.
>>> <merciful snip>
>> For many use-cases, running at a lower clock-cycle and focusing more on
>> shoveling stuff through the pipeline may make more sense than trying to
>> run at a higher clock speed.
>>
>>
>> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
>> RISC with fairly small L1 caches. Didn't really seem like a win, and I
>> can't really make the RAM any faster.
>>
>>
>> Though, it is very possible that programs like Doom and similar might do
>> better with a 100MHz RISC than a 50MHz VLIW.
>>
>> Things like "spin in a tight loop executing a relatively small number of
>> serially dependent instructions" is something where a 100MHz 1-wide core
>> has an obvious advantage over a 50MHz 3-wide core.
>>>>> and that's what i warned about: when you get down to it,
>>>>> saturation turns out to need to be applied to such a vast
>>>>> number of operations that it is about 0.5 of a bit's worth
>>>>> of encoding needed.
>>>>>
>>>> OK.
>>>>
>>>> Doesn't mean I intend to add general saturation.
>>> <
>>> Your application is mid-way between CPUs and GPUs.
>>>
>> Probably true, and it seems like I am getting properties that at times
>> seem more GPU-like than CPU-like.
>>
>>
>> Then, I am still off trying to get RISC-V code running on top of BJX2 as
>> well.
>>
>> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
>> so much as trying to get GCC to produce output that I can really use in
>> TestKern...
>>
>> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
>> only really supports fixed-address binaries (with the libraries
>> apparently being static linked into the binaries).
>>
>> People had apparently argued back and forth between whether to enable
>> shared-objects and similar, but apparently tended to leave it off
>> because dynamic linking is prone to breaking stuff.
>>
>> I hadn't imagined the situation would be anywhere near this weak...
>>
>>
>> I had sort of thought being able to have shared objects, PIE
>> executables, etc, was sort of the whole point of ELF.
>>
>> Also, the toolchain doesn't support PE/COFF for this target either
>> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
>>
>> Where, typically, PE/COFF binaries have a base-relocation table, ...
>>
>>
>>
>> Most strategies for giving a program its own logical address space would
>> be kind of a pain for TestKern.
>>
>> I would need to decide between having multiple 48-bit address spaces, or
>> make use of the 96-bit address space; say, loading a RV64 process at,
>> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
>>
>> Though, at least the 96-bit address space option means that the kernel
>> can still have pointers into the program's space (but, would mean that
>> stuff servicing system calls would need to start working with 128-bit
>> pointers).
>>
>> Well, at least short of other address space hacks, say:
>> 0000_00000123_0000_0xxxxxxx
>> Is mirrored at, say:
>> 7123_0xxxxxxx
>>
>> So that syscall handlers don't need to use bigger pointers, but the
>> program can still pretend to have its own virtual address space.
>>
>> Well, this or add some addressing hacks (say, a mode allowing
>> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
>>
>>
>> I would rather have had PIE binaries or similar and not need to deal
>> with any of this...
>>
>>
>> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
>> address space" to actually be an issue...
>>
>>
>> I can note, by extension, that BGBCC's PEL4 output can be loaded
>> anywhere in the address space.
>>
>> Still mostly static-linking everything, but (unexpectedly) I am not
>> actually behind on this front (and the DLLs do actually exist, sort of;
>> even if at present they are more used as loadable modules than as OS
>> libraries).
>>
>> ...
>
> I think BJX2 is doing very well if data access is only three cycles.
>
> I think Thor is sitting at six cycle data memory access, I$ access is single
> cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
> to the data cache, two to access the data cache, 1 to put the response into
> a response fifo and 1 to off load the response back in the cpu. I think
> there may also be an agen cycle happening too. There is probably at least
> one cycle that could be eliminated, but eliminating it would improve
> performance by about 4% overall and likely cost clock cycle time. ATM
> writes write all the way through to memory and therefore take a
> horrendous number of clock cycles eg. 30. Writes to some of the SoC
> devices are much faster.
>

Click here to read the complete article

Re: Encoding saturating arithmetic

<b8e4a413-ed43-4f4a-8638-353ff44c573an@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32286&group=comp.arch#32286

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:469f:b0:74a:d2b0:42cb with SMTP id bq31-20020a05620a469f00b0074ad2b042cbmr133280qkb.2.1684433578257;
Thu, 18 May 2023 11:12:58 -0700 (PDT)
X-Received: by 2002:a05:6870:7735:b0:19a:12aa:e3b8 with SMTP id
dw53-20020a056870773500b0019a12aae3b8mr779632oab.4.1684433577970; Thu, 18 May
2023 11:12:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 11:12:57 -0700 (PDT)
In-Reply-To: <u45o06$b96e$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84a2:55c:f9c8:b506;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84a2:55c:f9c8:b506
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
<41q9M.259746$qpNc.185276@fx03.iad> <u45o06$b96e$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b8e4a413-ed43-4f4a-8638-353ff44c573an@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 18 May 2023 18:12:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4983

by: MitchAlsup - Thu, 18 May 2023 18:12 UTC

On Thursday, May 18, 2023 at 12:43:06 PM UTC-5, BGB wrote:
> On 5/18/2023 8:48 AM, Scott Lurndal wrote:
> > BGB <cr8...@gmail.com> writes:

> >> I have 3-cycle memory access.
> >
> > To L1? Virtually indexed?
> >
> Yes, both.
>
> L1 D$ access has a 3-cycle latency, 1-cycle throughput (so, one memory
> access every clock-cycle in most cases).
>
> The L1 is indexed based on virtual address, though in this case it is a
> modulo-mapped direct-mapped cache, so as long as the virtual and
> physical pages have the same alignment, the difference becomes
> insignificant.
>
>
> With a 16K L1 D$ and 16K page size, there is no difference.
> Was using 32K for a while, but this makes timing more difficult.
> A 64K L1 D$ basically explodes timing.
>
>
> Ideally, to support 32K and (possibly) 64K L1 caches, a 64K alignment is
> recommended. However, strict 64K alignment and/or a 64K page size is
> undesirable as it reduces memory efficiency (mostly in terms of the
> amount of padding space needed for mmap()/VirtualAlloc() and large
> object "malloc()", *).
>
> I ended up going with 16K pages mostly as this significantly reduced TLB
> miss rate without suffering the same adverse effects as 64K pages (and,
> in my testing, there was very little difference between 16K and 64K in
> terms of nominal TLB miss rate for a given size of TLB).
<
Conversely, I went with 8K pages to cut down on the number of layers in
the MMU tables. My sequence is easy to remember:: 8K, 8M, 8G, 8T, 8E, 8A.
And so is the bit position 13, 23, 33, 43, 53, 63.
<
Then (before actually) I put in level skipping in the tables and put a level in
the ROOT pointer. One can map a 8Mbyte application using 1 page of MMU
tables.
<
Now, once one can skip levels in the tables and terminate the walk at any
level, one has "large pages" drop out for free. In these large table entries
there are unused address bits. So I used these as a limit on the number of
pages the large PTE points at. This enables one to use an 8G entry and only
map one 8K page (should anyone want to do something like that).
>
> Meanwhile: 8K was merely intermediate between 4K and 16K (still fairly
> high miss rate, but lower than 4K). 32K had basically similar miss rates
> to 16K and 64K, but worse memory overhead properties than 16K.
>
>
> *: There is an issue of basically how big of objects can be handled by
> allocating a memory block within a larger shared memory chunk, and when
> one effectively needs to invoke a "mmap()" call to allocate it in terms
> of pages. With 64K pages, one either needs to set this limit fairly high
> (in turn potentially wasting memory by the heap chunks being larger than
> ideal), or waste memory by the page-alloc cases near these transition
> points having a significant amount of the total object size just in the
> "wasted" memory at the end of the final page.
<
This is where using the unused bits in a PTE allows for limits.
>
> Say, if you want to malloc 67K and get 128K, this is a waste. If you get
> 80K, this is less of a waste.
<
72K is even less of a waste (first multiple of 8K larger than 67K)

Re: Encoding saturating arithmetic

<3bdce9ab-8289-423b-bf63-e8061facdd3en@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32287&group=comp.arch#32287

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:24c3:b0:759:2ac4:af2c with SMTP id m3-20020a05620a24c300b007592ac4af2cmr135263qkn.7.1684434082829;
Thu, 18 May 2023 11:21:22 -0700 (PDT)
X-Received: by 2002:a05:6808:3317:b0:396:e3:37be with SMTP id
ca23-20020a056808331700b0039600e337bemr1225431oib.10.1684434082577; Thu, 18
May 2023 11:21:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 11:21:22 -0700 (PDT)
In-Reply-To: <u45orh$bbcu$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84a2:55c:f9c8:b506;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84a2:55c:f9c8:b506
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
<e5e8be3e-c05d-43c2-9a19-22a1a852b9e8n@googlegroups.com> <u45orh$bbcu$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3bdce9ab-8289-423b-bf63-e8061facdd3en@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 18 May 2023 18:21:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2856

by: MitchAlsup - Thu, 18 May 2023 18:21 UTC

On Thursday, May 18, 2023 at 12:57:46 PM UTC-5, BGB wrote:
> On 5/18/2023 7:43 AM, luke.l...@gmail.com wrote:
>
> > did you by any chance Micro-code it? did you put in some
> > internal re-writing into BJX2 internal operations, in some
> > fashion? i would be interested to hear if you did so, and how.
> > or, if it is a shared Micro-coding back-end with two disparate
> > front-end ISAs.
> >
> No micro-code, just an alternate decoder.
> There is no micro-code in my core at all, rather direct-logic for
> everything.
<
If you have a sequencer that could be compiled into a NOR-plane
or PLA, it could be called micro-code !! But Should it ??
<
If, on the other hand, got compiled into gates; could it be called
micro-code ??
<
That is, I think the Wilkes term micro-code has run its course and
no longer represents a sequencer that is ultimately programmed.
And, by and large, there is scant support at the fabrication level
for programming the ROM that micro-code uses as its program.
{We used to program with Diffusion, later, Poly, later contacts,
then M1--all of which require new mask sets. Now all we get is
gates.}
<
But if you use "just gates" to define the ROM, is it still micro-code ??
<
Sorry for the thread Hijack.

On Thursday, May 18, 2023 at 6:56:42 PM UTC+1, MitchAlsup wrote:
> On Thursday, May 18, 2023 at 11:52:18 AM UTC-5, Marcus wrote:

> Like you, I prefer that SIMD-style calculations use natural register
> widths. Unlike you, I left SIMD out of my ISA and found (what I consider)
> a better alternative than {calculations}×{widths}×{special-properties}
> that accompany SIMD. Since memory references come with {widths}
> (and signed, unsigned semantics), and calculations are self describing,
> AND you have predication in the ISA, then synthesizing SIMD using VVM
> is actually straightforward.

by having an incredibly simple translation layer that goes "oh you asked
for a Vector of 13 8-bit operations? hmm, my registers and pipelines are all
32-bit wide, let me just subtract 4 from that 13 *automatically* for ya"

this is actually implemented in Broadcom VideoCore IV (something
like it), they call it "Virtual Vectors" or something, using 4 consecutive
cycles to pump *16* FP operations into *only a 4-SIMD-wide* FP32
back-end.

if you do not have a multiple of {elwidth}*{num_elements_left}=32
remaining SO WHAT!! just mask them out, and pass the very same
mask directly through to regfile byte-level Write-Enable lines.

i was going to put this under a separate thread, oh well :)

> Sooner or later the R in RISC should stand for "reduced".
> It is my contention that any ISA with more than 200-ish instructions
> ceases to be RISC.

Power ISA SFFS Compliancy Subset (aka "Power ISA without the SIMD
hell") only barely misses that by 14 instructions. and Simple-V
only adds 5 more (Management Instructions I call them).

the reason it misses that target is because the Power ISA Architects
went for a RISC *internal* Micro-architecture, most notable from
an examination of the add instructions, which may be micro-coded
as a single actual operation, where the front-end extracts
(encodes room for) an invert-in, a carry-in= 0/1/XER.CA and
invert-out.

> I don't know of a architecture with SIMD instructions that fits under 200
> total instructions.

(none - but there do exist RISC ISAs that can exploit a
SIMD *backend* or multiple FUs thereof - behind the
scenes. named SVP64).

Re: Encoding saturating arithmetic

<04b1b406-9ba8-43f1-a0bf-5f3a18f23162n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32290&group=comp.arch#32290

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7f02:0:b0:3f1:fc85:9d74 with SMTP id f2-20020ac87f02000000b003f1fc859d74mr53475qtk.6.1684451560103;
Thu, 18 May 2023 16:12:40 -0700 (PDT)
X-Received: by 2002:aca:da83:0:b0:390:7dc9:dd39 with SMTP id
r125-20020acada83000000b003907dc9dd39mr63272oig.10.1684451559639; Thu, 18 May
2023 16:12:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 16:12:39 -0700 (PDT)
In-Reply-To: <u45p2n$bbcu$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
<66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com> <u45p2n$bbcu$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <04b1b406-9ba8-43f1-a0bf-5f3a18f23162n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 18 May 2023 23:12:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 14497

by: luke.l...@gmail.com - Thu, 18 May 2023 23:12 UTC

On Thursday, May 18, 2023 at 7:01:48 PM UTC+1, BGB wrote:
> On 5/18/2023 4:08 AM, robf...@gmail.com wrote:
> > On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
> >> On 5/17/2023 3:13 PM, MitchAlsup wrote:
> >>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
> >>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> >>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> >>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> >>>>>>> chapter 7. after seeing how out of control this can get
> >>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
> >>>>>>> i see explicit saturation opcodes added to an ISA that
> >>>>>>> only has 32-bit available for instruction format.
> >>>>>>>
> >>>>>> I can note that I still don't have any dedicated saturating ops, but
> >>>>>> this is partly for cost and timing concerns (and I haven't yet
> >>>>>> encountered a case where I "strongly needed" saturating ops).
> >>>>>
> >>>>> if you are doing Video Encode/Decode (try AV1 for example)
> >>>>> you'll need them to stand a chance of any kind of power-efficient
> >>>>> operation.
> >>>>>
> >>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> >>>> than 0.16, and then clamping after the fact.
> >>>> Say: High 2 bits:
> >>>> 00: Value in range
> >>>> 01: Value out of range on positive side, clamp to 3FFF
> >>>> 11: Value out of range on negative side, clamp to 0000
> >>>> 10: Ambiguous, shouldn't happen.
> >>> <
> >>> This brings to mind:: the application:::
> >>> <
> >>> CPUs try to achieve highest frequency of operation and pipeline
> >>> away logic delay problems--LDs are now 4 and 5 cycles rather than
> >>> 2 (MIPS R3000); because that is where performance is as there is
> >>> rarely enough parallelism to utilize more than a "few" cores.
> >>> <
> >> I have 3-cycle memory access.
> >>
> >> Early on, load/store was not pipelined (and would always take 3 clock
> >> cycles), but slow memory ops were not ideal for performance. I had
> >> extended the pipeline to 3 execute stages mostly as this allowed for
> >> pipelining both load/store and also integer multiply.
> >>
> >>
> >> If the pipeline were extended to 6 execute stages, this would also allow
> >> for things like pipelined double-precision ops, or single-precision
> >> multiply-accumulate.
> >>
> >> But, this would also require more complicated register forwarding, would
> >> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
> >> it would likely end up hurting performance more than it would help.
> >>
> >>
> >> As can be noted, current pipeline is roughly:
> >> PF IF ID1 ID2 EX1 EX2 EX3 WB
> >> Or:
> >> PF IF ID RF EX1 EX2 EX3 WB
> >>
> >> Since ID2 doesn't actually decode anything, just fetches and forwards
> >> register values in preparation for EX1.
> >>
> >> From what I can gather, it seems a fair number of other RISC's had also
> >> ended up with a similar pipeline (somewhat more so than the 5-stage
> >> pipeline).
> >>> GPUs on the other hand, seem to be content to stay near 1 GHz
> >>> and just throw shader cores at the problem rather than fight for
> >>> frequency. Since GPUs process embarrassingly parallel applications
> >>> one can freely trade cores for frequency (and vice versa).
> >>> <
> >>> So, in GPUs, there are arithmetic designs can fully absorb the
> >>> delays of saturation, whereas in CPUs it is not so simple.
> >>> <merciful snip>
> >> For many use-cases, running at a lower clock-cycle and focusing more on
> >> shoveling stuff through the pipeline may make more sense than trying to
> >> run at a higher clock speed.
> >>
> >>
> >> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
> >> RISC with fairly small L1 caches. Didn't really seem like a win, and I
> >> can't really make the RAM any faster.
> >>
> >>
> >> Though, it is very possible that programs like Doom and similar might do
> >> better with a 100MHz RISC than a 50MHz VLIW.
> >>
> >> Things like "spin in a tight loop executing a relatively small number of
> >> serially dependent instructions" is something where a 100MHz 1-wide core
> >> has an obvious advantage over a 50MHz 3-wide core.
> >>>>> and that's what i warned about: when you get down to it,
> >>>>> saturation turns out to need to be applied to such a vast
> >>>>> number of operations that it is about 0.5 of a bit's worth
> >>>>> of encoding needed.
> >>>>>
> >>>> OK.
> >>>>
> >>>> Doesn't mean I intend to add general saturation.
> >>> <
> >>> Your application is mid-way between CPUs and GPUs.
> >>>
> >> Probably true, and it seems like I am getting properties that at times
> >> seem more GPU-like than CPU-like.
> >>
> >>
> >> Then, I am still off trying to get RISC-V code running on top of BJX2 as
> >> well.
> >>
> >> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
> >> so much as trying to get GCC to produce output that I can really use in
> >> TestKern...
> >>
> >> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
> >> only really supports fixed-address binaries (with the libraries
> >> apparently being static linked into the binaries).
> >>
> >> People had apparently argued back and forth between whether to enable
> >> shared-objects and similar, but apparently tended to leave it off
> >> because dynamic linking is prone to breaking stuff.
> >>
> >> I hadn't imagined the situation would be anywhere near this weak...
> >>
> >>
> >> I had sort of thought being able to have shared objects, PIE
> >> executables, etc, was sort of the whole point of ELF.
> >>
> >> Also, the toolchain doesn't support PE/COFF for this target either
> >> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
> >>
> >> Where, typically, PE/COFF binaries have a base-relocation table, ...
> >>
> >>
> >>
> >> Most strategies for giving a program its own logical address space would
> >> be kind of a pain for TestKern.
> >>
> >> I would need to decide between having multiple 48-bit address spaces, or
> >> make use of the 96-bit address space; say, loading a RV64 process at,
> >> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
> >>
> >> Though, at least the 96-bit address space option means that the kernel
> >> can still have pointers into the program's space (but, would mean that
> >> stuff servicing system calls would need to start working with 128-bit
> >> pointers).
> >>
> >> Well, at least short of other address space hacks, say:
> >> 0000_00000123_0000_0xxxxxxx
> >> Is mirrored at, say:
> >> 7123_0xxxxxxx
> >>
> >> So that syscall handlers don't need to use bigger pointers, but the
> >> program can still pretend to have its own virtual address space.
> >>
> >> Well, this or add some addressing hacks (say, a mode allowing
> >> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
> >>
> >>
> >> I would rather have had PIE binaries or similar and not need to deal
> >> with any of this...
> >>
> >>
> >> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
> >> address space" to actually be an issue...
> >>
> >>
> >> I can note, by extension, that BGBCC's PEL4 output can be loaded
> >> anywhere in the address space.
> >>
> >> Still mostly static-linking everything, but (unexpectedly) I am not
> >> actually behind on this front (and the DLLs do actually exist, sort of;
> >> even if at present they are more used as loadable modules than as OS
> >> libraries).
> >>
> >> ...
> >
> > I think BJX2 is doing very well if data access is only three cycles.
> >
> > I think Thor is sitting at six cycle data memory access, I$ access is single
> > cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
> > to the data cache, two to access the data cache, 1 to put the response into
> > a response fifo and 1 to off load the response back in the cpu. I think
> > there may also be an agen cycle happening too. There is probably at least
> > one cycle that could be eliminated, but eliminating it would improve
> > performance by about 4% overall and likely cost clock cycle time. ATM
> > writes write all the way through to memory and therefore take a
> > horrendous number of clock cycles eg. 30. Writes to some of the SoC
> > devices are much faster.
> >
> Load/Store is hard-locked to the pipeline in my case (runs in lock-step
> with everything else).
>
> Pretty much everything that operates directly within the pipeline is
> lock-stepped to the pipeline (and the core stalls entirely to handle L1
> misses or similar).
>
> ...
> > I really need a larger FPGA for my designs, any suggestions? I broke 500k
> > LUTs again and had to trim cores. I scrapped the wonderful register file
> > that could load four registers at a time, when I realized it looked like a
> > 16-read port file. 25,000 LUTs. A four-port register file is used now with
> > serial reads / writes for multi-register access. 2k LUTs. Same ISA,
> > implemented differently.
> >
> Yeah...
>
>
> With "all the features", my core is closer to 40k LUT.
> A dual core setup currently uses 66% of an XC7A200T.
> Single core fits on an XC7A100T at around 70%.
>
> If I trim it down, it can fit on an XC7S50.
>
> For these, this is with a core that does:
> 3-wide pipeline;
> 6R3W register file;
> ...
>
> A simple RISC-like subset can fit onto an XC7S25.
> But, at this point, may as well just use RV32I or similar...
>
> Where, device capacities are, roughly:
> XC7A200T 135k LUTs
> XC7A100T 68k LUTs
> XC7S50 34k LUTs
> XC7S25 17k LUTs
>
> Don't have a Kintex mostly because, even if I got the device itself, the
> license needed for Vivado is expensive... (similar issue for Virtex).

Click here to read the complete article

On 5/18/2023 6:12 PM, luke.l...@gmail.com wrote:
> On Thursday, May 18, 2023 at 7:01:48 PM UTC+1, BGB wrote:
>> On 5/18/2023 4:08 AM, robf...@gmail.com wrote:
>>> On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
>>>> On 5/17/2023 3:13 PM, MitchAlsup wrote:
>>>>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>>>>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>>>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>>>>>> chapter 7. after seeing how out of control this can get
>>>>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>>>>>> i see explicit saturation opcodes added to an ISA that
>>>>>>>>> only has 32-bit available for instruction format.
>>>>>>>>>
>>>>>>>> I can note that I still don't have any dedicated saturating ops, but
>>>>>>>> this is partly for cost and timing concerns (and I haven't yet
>>>>>>>> encountered a case where I "strongly needed" saturating ops).
>>>>>>>
>>>>>>> if you are doing Video Encode/Decode (try AV1 for example)
>>>>>>> you'll need them to stand a chance of any kind of power-efficient
>>>>>>> operation.
>>>>>>>
>>>>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>>>>>> than 0.16, and then clamping after the fact.
>>>>>> Say: High 2 bits:
>>>>>> 00: Value in range
>>>>>> 01: Value out of range on positive side, clamp to 3FFF
>>>>>> 11: Value out of range on negative side, clamp to 0000
>>>>>> 10: Ambiguous, shouldn't happen.
>>>>> <
>>>>> This brings to mind:: the application:::
>>>>> <
>>>>> CPUs try to achieve highest frequency of operation and pipeline
>>>>> away logic delay problems--LDs are now 4 and 5 cycles rather than
>>>>> 2 (MIPS R3000); because that is where performance is as there is
>>>>> rarely enough parallelism to utilize more than a "few" cores.
>>>>> <
>>>> I have 3-cycle memory access.
>>>>
>>>> Early on, load/store was not pipelined (and would always take 3 clock
>>>> cycles), but slow memory ops were not ideal for performance. I had
>>>> extended the pipeline to 3 execute stages mostly as this allowed for
>>>> pipelining both load/store and also integer multiply.
>>>>
>>>>
>>>> If the pipeline were extended to 6 execute stages, this would also allow
>>>> for things like pipelined double-precision ops, or single-precision
>>>> multiply-accumulate.
>>>>
>>>> But, this would also require more complicated register forwarding, would
>>>> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
>>>> it would likely end up hurting performance more than it would help.
>>>>
>>>>
>>>> As can be noted, current pipeline is roughly:
>>>> PF IF ID1 ID2 EX1 EX2 EX3 WB
>>>> Or:
>>>> PF IF ID RF EX1 EX2 EX3 WB
>>>>
>>>> Since ID2 doesn't actually decode anything, just fetches and forwards
>>>> register values in preparation for EX1.
>>>>
>>>> From what I can gather, it seems a fair number of other RISC's had also
>>>> ended up with a similar pipeline (somewhat more so than the 5-stage
>>>> pipeline).
>>>>> GPUs on the other hand, seem to be content to stay near 1 GHz
>>>>> and just throw shader cores at the problem rather than fight for
>>>>> frequency. Since GPUs process embarrassingly parallel applications
>>>>> one can freely trade cores for frequency (and vice versa).
>>>>> <
>>>>> So, in GPUs, there are arithmetic designs can fully absorb the
>>>>> delays of saturation, whereas in CPUs it is not so simple.
>>>>> <merciful snip>
>>>> For many use-cases, running at a lower clock-cycle and focusing more on
>>>> shoveling stuff through the pipeline may make more sense than trying to
>>>> run at a higher clock speed.
>>>>
>>>>
>>>> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
>>>> RISC with fairly small L1 caches. Didn't really seem like a win, and I
>>>> can't really make the RAM any faster.
>>>>
>>>>
>>>> Though, it is very possible that programs like Doom and similar might do
>>>> better with a 100MHz RISC than a 50MHz VLIW.
>>>>
>>>> Things like "spin in a tight loop executing a relatively small number of
>>>> serially dependent instructions" is something where a 100MHz 1-wide core
>>>> has an obvious advantage over a 50MHz 3-wide core.
>>>>>>> and that's what i warned about: when you get down to it,
>>>>>>> saturation turns out to need to be applied to such a vast
>>>>>>> number of operations that it is about 0.5 of a bit's worth
>>>>>>> of encoding needed.
>>>>>>>
>>>>>> OK.
>>>>>>
>>>>>> Doesn't mean I intend to add general saturation.
>>>>> <
>>>>> Your application is mid-way between CPUs and GPUs.
>>>>>
>>>> Probably true, and it seems like I am getting properties that at times
>>>> seem more GPU-like than CPU-like.
>>>>
>>>>
>>>> Then, I am still off trying to get RISC-V code running on top of BJX2 as
>>>> well.
>>>>
>>>> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
>>>> so much as trying to get GCC to produce output that I can really use in
>>>> TestKern...
>>>>
>>>> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
>>>> only really supports fixed-address binaries (with the libraries
>>>> apparently being static linked into the binaries).
>>>>
>>>> People had apparently argued back and forth between whether to enable
>>>> shared-objects and similar, but apparently tended to leave it off
>>>> because dynamic linking is prone to breaking stuff.
>>>>
>>>> I hadn't imagined the situation would be anywhere near this weak...
>>>>
>>>>
>>>> I had sort of thought being able to have shared objects, PIE
>>>> executables, etc, was sort of the whole point of ELF.
>>>>
>>>> Also, the toolchain doesn't support PE/COFF for this target either
>>>> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
>>>>
>>>> Where, typically, PE/COFF binaries have a base-relocation table, ...
>>>>
>>>>
>>>>
>>>> Most strategies for giving a program its own logical address space would
>>>> be kind of a pain for TestKern.
>>>>
>>>> I would need to decide between having multiple 48-bit address spaces, or
>>>> make use of the 96-bit address space; say, loading a RV64 process at,
>>>> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
>>>>
>>>> Though, at least the 96-bit address space option means that the kernel
>>>> can still have pointers into the program's space (but, would mean that
>>>> stuff servicing system calls would need to start working with 128-bit
>>>> pointers).
>>>>
>>>> Well, at least short of other address space hacks, say:
>>>> 0000_00000123_0000_0xxxxxxx
>>>> Is mirrored at, say:
>>>> 7123_0xxxxxxx
>>>>
>>>> So that syscall handlers don't need to use bigger pointers, but the
>>>> program can still pretend to have its own virtual address space.
>>>>
>>>> Well, this or add some addressing hacks (say, a mode allowing
>>>> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
>>>>
>>>>
>>>> I would rather have had PIE binaries or similar and not need to deal
>>>> with any of this...
>>>>
>>>>
>>>> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
>>>> address space" to actually be an issue...
>>>>
>>>>
>>>> I can note, by extension, that BGBCC's PEL4 output can be loaded
>>>> anywhere in the address space.
>>>>
>>>> Still mostly static-linking everything, but (unexpectedly) I am not
>>>> actually behind on this front (and the DLLs do actually exist, sort of;
>>>> even if at present they are more used as loadable modules than as OS
>>>> libraries).
>>>>
>>>> ...
>>>
>>> I think BJX2 is doing very well if data access is only three cycles.
>>>
>>> I think Thor is sitting at six cycle data memory access, I$ access is single
>>> cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
>>> to the data cache, two to access the data cache, 1 to put the response into
>>> a response fifo and 1 to off load the response back in the cpu. I think
>>> there may also be an agen cycle happening too. There is probably at least
>>> one cycle that could be eliminated, but eliminating it would improve
>>> performance by about 4% overall and likely cost clock cycle time. ATM
>>> writes write all the way through to memory and therefore take a
>>> horrendous number of clock cycles eg. 30. Writes to some of the SoC
>>> devices are much faster.
>>>
>> Load/Store is hard-locked to the pipeline in my case (runs in lock-step
>> with everything else).
>>
>> Pretty much everything that operates directly within the pipeline is
>> lock-stepped to the pipeline (and the core stalls entirely to handle L1
>> misses or similar).
>>
>> ...
>>> I really need a larger FPGA for my designs, any suggestions? I broke 500k
>>> LUTs again and had to trim cores. I scrapped the wonderful register file
>>> that could load four registers at a time, when I realized it looked like a
>>> 16-read port file. 25,000 LUTs. A four-port register file is used now with
>>> serial reads / writes for multi-register access. 2k LUTs. Same ISA,
>>> implemented differently.
>>>
>> Yeah...
>>
>>
>> With "all the features", my core is closer to 40k LUT.
>> A dual core setup currently uses 66% of an XC7A200T.
>> Single core fits on an XC7A100T at around 70%.
>>
>> If I trim it down, it can fit on an XC7S50.
>>
>> For these, this is with a core that does:
>> 3-wide pipeline;
>> 6R3W register file;
>> ...
>>
>> A simple RISC-like subset can fit onto an XC7S25.
>> But, at this point, may as well just use RV32I or similar...
>>
>> Where, device capacities are, roughly:
>> XC7A200T 135k LUTs
>> XC7A100T 68k LUTs
>> XC7S50 34k LUTs
>> XC7S25 17k LUTs
>>
>> Don't have a Kintex mostly because, even if I got the device itself, the
>> license needed for Vivado is expensive... (similar issue for Virtex).
>
> wake me up again when you've installed this :)
> https://github.com/openXC7
>

Click here to read the complete article

On 5/18/2023 1:12 PM, MitchAlsup wrote:
> On Thursday, May 18, 2023 at 12:43:06 PM UTC-5, BGB wrote:
>> On 5/18/2023 8:48 AM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>
>>>> I have 3-cycle memory access.
>>>
>>> To L1? Virtually indexed?
>>>
>> Yes, both.
>>
>> L1 D$ access has a 3-cycle latency, 1-cycle throughput (so, one memory
>> access every clock-cycle in most cases).
>>
>> The L1 is indexed based on virtual address, though in this case it is a
>> modulo-mapped direct-mapped cache, so as long as the virtual and
>> physical pages have the same alignment, the difference becomes
>> insignificant.
>>
>>
>> With a 16K L1 D$ and 16K page size, there is no difference.
>> Was using 32K for a while, but this makes timing more difficult.
>> A 64K L1 D$ basically explodes timing.
>>
>>
>> Ideally, to support 32K and (possibly) 64K L1 caches, a 64K alignment is
>> recommended. However, strict 64K alignment and/or a 64K page size is
>> undesirable as it reduces memory efficiency (mostly in terms of the
>> amount of padding space needed for mmap()/VirtualAlloc() and large
>> object "malloc()", *).
>>
>> I ended up going with 16K pages mostly as this significantly reduced TLB
>> miss rate without suffering the same adverse effects as 64K pages (and,
>> in my testing, there was very little difference between 16K and 64K in
>> terms of nominal TLB miss rate for a given size of TLB).
> <
> Conversely, I went with 8K pages to cut down on the number of layers in
> the MMU tables. My sequence is easy to remember:: 8K, 8M, 8G, 8T, 8E, 8A.
> And so is the bit position 13, 23, 33, 43, 53, 63.
> <
> Then (before actually) I put in level skipping in the tables and put a level in
> the ROOT pointer. One can map a 8Mbyte application using 1 page of MMU
> tables.
> <
> Now, once one can skip levels in the tables and terminate the walk at any
> level, one has "large pages" drop out for free. In these large table entries
> there are unused address bits. So I used these as a limit on the number of
> pages the large PTE points at. This enables one to use an 8G entry and only
> map one 8K page (should anyone want to do something like that).

With 16K pages, I can cover 47 bits in a 3-level page table.

But, a full 96-bit space would require 8 levels.

One option could be a 5-level page table, which would cover 69 bits; or,
using hybrid tables.

Unclear is if there is a good intermediate option (in terms of
performance) between B-Trees and nested page tables.

Though, one hybrid option is:
Lower 2 levels are page-tables (covering the low 36 bits);
Upper level(s) are B-Tree (covering the high 60 bits).

This structure could fit ~ 1023 entries per node (if 16B each), or 1365
(if 12-byte entries are used), where one needs a little bit of space
typically to hold information about the node-depth and the number of
filled entries in the node.

>>
>> Meanwhile: 8K was merely intermediate between 4K and 16K (still fairly
>> high miss rate, but lower than 4K). 32K had basically similar miss rates
>> to 16K and 64K, but worse memory overhead properties than 16K.
>>
>>
>> *: There is an issue of basically how big of objects can be handled by
>> allocating a memory block within a larger shared memory chunk, and when
>> one effectively needs to invoke a "mmap()" call to allocate it in terms
>> of pages. With 64K pages, one either needs to set this limit fairly high
>> (in turn potentially wasting memory by the heap chunks being larger than
>> ideal), or waste memory by the page-alloc cases near these transition
>> points having a significant amount of the total object size just in the
>> "wasted" memory at the end of the final page.
> <
> This is where using the unused bits in a PTE allows for limits.
>>
>> Say, if you want to malloc 67K and get 128K, this is a waste. If you get
>> 80K, this is less of a waste.
> <
> 72K is even less of a waste (first multiple of 8K larger than 67K)

Possibly, but as noted, it was more about TLB miss rate, where
increasing the page size to 16K significantly reduced TLB miss rate
(more so than would be gained by increasing the TLB size).

On 5/18/2023 12:07 PM, luke.l...@gmail.com wrote:
> On Thursday, May 18, 2023 at 5:52:18 PM UTC+1, Marcus wrote:
>
>> So, first of all I am a novice when it comes to CPU and ISA design, so I
>> don't claim that I've made even close to perfect decisions... ;-)
>
> pfhh, you an me both :) only been at this 4 years.
>

My case, 7 years...

I got started at this by tinkering around with the SH-2 and SH-4 ISA
designs.

Where, tinkering with (and extending) SH-4, this eventually became BJX1,
which was partially rebooted into BJX2 (initially, cleaning up the
encoding, and dropping a few of the more troublesome ISA features).

>> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
>> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
>> but that is another ISA and another story - no binary compatibility is
>> planned etc).
>

In my case:
64b: 1x64, 2x32, 4x16
128b: 1x128, 2x64, 4x32

No 8-wide vectors, and no direct packed-byte operations, ...

There are a bunch of formats that are supported exclusively using
converter ops.
4x Byte <-> 4x Int16 (several variants)
RGB555 <-> 4x Int16
3x FP10 <-> 4x Binary16
4x Fp8 <-> 4x Binary16
4x A-Law <-> 4x Binary16
...

I am mostly using a non-standard RGB555 variant:
0rrrrrgg-gggb-bbbb //Opaque Pixel
1rrrragg-ggab-bbba //Translucent Pixel

This only gives 8 alpha levels, and alpha comes at the cost of color,
but it works.

Generally, image quality is higher than RGBA4444, since the common case
(fully opaque pixel) is encoded as full RGB555. Though, for reasons,
alpha level does interfere with color.

> discuss under new comp.arch thread? (please just not a reply-to
> with change-of-subject, google groups seriously creaking under
> the load). your ISA: you start it?
>
> l.

On 2023-05-18, MitchAlsup wrote:
> On Thursday, May 18, 2023 at 11:52:18 AM UTC-5, Marcus wrote:
>> On 2023-05-16, luke.l...@gmail.com wrote:
>>> On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:
>>
>>
>> So, first of all I am a novice when it comes to CPU and ISA design, so I
>> don't claim that I've made even close to perfect decisions... ;-)
>>
>> With that said, the reasoning roughly went like this:
>>
>> I wanted a way to *easily* saturate the memory interface (i.e. use it
>> to its full potential) when working with byte-sized elements. With
> <
> A good starting point.
> <
>> vector operations, I could only utilize the full memory bandwidth when
>> all 32 bits of the vector elements are loaded/stored. With byte-sized
>> vector load/store I only got 1/4th of the bandwidth, and I could not
>> figure out a simple way to quadruple vector register file write/read
>> traffic when doing byte-sized loads/stores.
> <
> This is where multiple lanes are used to consume more bandwidth
> when you know the memory reference pattern is "dense".
> <
> The vector alternative is to use gather/scatter memory references
> and perform multiple AGENs per cycle--a much more costly alternative.
>>
>> I also realized that since I had a scalable solution for implementation
>> defined vector register sizes, there would be little harm in fixating
>> the "packed SIMD width" to 32 bits (unlike traditional packed SIMD
>> solutions where you need to alter the ISA and change the register
>> SIMD/register width every time you wish to increase parallelism).
>>
>> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
>> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
>> but that is another ISA and another story - no binary compatibility is
>> planned etc).
> <
> Like you, I prefer that SIMD-style calculations use natural register
> widths. Unlike you, I left SIMD out of my ISA and found (what I consider)
> a better alternative than {calculations}×{widths}×{special-properties}
> that accompany SIMD. Since memory references come with {widths}
> (and signed, unsigned semantics), and calculations are self describing,
> AND you have predication in the ISA, then synthesizing SIMD using VVM
> is actually straightforward.
> <
> This, then, gives the HW freedom to implement the SIMD width appropriate
> for that implementation, and preserve code-comparability across all SIMD
> widths and across all implementations.
> <
> It also eliminates 1280 = {{16}×{4})×{4}×{5} instructions from ISA. (More
> if you support 8-bit and 16-bit FP in SIMD.)

It's also a question of what you define as "one instruction".

When I laid out the instruction encoding puzzle I ended up having 2+2
bits left in the 32-bit wide instruction format. These turned into a
"Vector Mode" field ("V") + a "Data Type" field ("T").

The V field is consumed fairly early in the pipeline (during decode, to
properly address the RF etc), while the T field is effectively an
additional argument that is passed on to the EU:s, and each individual
EU interprets what to do with it.

For most arithmetic operations, the T-field is interpreted as:

* 00 - 1x32 bits (regular scalar, no packed SIMD)
* 01 - 4x8 bits (packed SIMD, "byte")
* 10 - 2x16 bits (packed SIMD, "half-word")
* 11 - Reserved (will come to use in MRISC64)

For load/store instructions the T-field is used for the index scaling
factor (*1, *2, *4 and *8).

For bitwise operations (xor, or, and) the T-field is used as a couple of
operand binary negation flags (e.g. R2 = ~R5 | R6), similar to how
My 66000 supports arithmetic operand negation AFAICT.

....and then there are a few more specialized interpretations of the T
field.

Anyway, so what I have is almost a 16x fold of instructions, but should
they be counted as individual instructions, or should they be counted as
variants? I choose to see it as the latter (I even count immediate
operand versions of an instruction as a variant rather than a separate
instruction, unlike some other RISC ISA:s). E.g. the "ADD" instruction
has the following 14 variants (see chapter 7.3 in the ISA manual,
https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf):

ADD Ra,Rb,Rc
ADD.B Ra,Rb,Rc
ADD.H Ra,Rb,Rc
ADD Va,Vb,Rc
ADD.B Va,Vb,Rc
ADD.H Va,Vb,Rc
ADD Va,Vb,Vc
ADD.B Va,Vb,Vc
ADD.H Va,Vb,Vc
ADD/F Va,Vb,Vc
ADD.B/F Va,Vb,Vc
ADD.H/F Va,Vb,Vc
ADD Ra,Rb,#imm
ADD Va,Vb,#imm

R-operands are scalar registers
..B/.H means packed BYTE / HALF-WORD
V-operands are vector registers
/F means "fold" (combine upper/lower halves of two source operands)
#imm is a 14+1-bit immediate (the 1 extra bit is a hi/lo shift flag)

> <
> Sooner or later the R in RISC should stand for "reduced".
> It is my contention that any ISA with more than 200-ish instructions
> ceases to be RISC.
> <
> I don't know of a architecture with SIMD instructions that fits under 200
> total instructions.

If I count as described above, I currently have 103 instructions
(including floating-point, SIMD packing/unpacking, saturating arithmetic
etc).

....if I count all the variants as separate instructions, it's just north
of 1000 instructions.

Your pick.

>>
>> I have also noticed that the uint8x4_t type (i.e. vec4<byte>) can be
>> quite useful when working with ARGB, for instance, and it's also quite
>> convenient to be able to perform packed SIMD on both vector and scalar
>> registers.
>>
>> Now, I am not 100% happy with the solution, but at least it's much nicer
>> to work with than ISA:s such as SSE or NEON, and it's much more future
>> proof.
>>
>> /Marcus

On Friday, May 19, 2023 at 2:33:14 PM UTC-5, Marcus wrote:
> On 2023-05-18, MitchAlsup wrote:
> > On Thursday, May 18, 2023 at 11:52:18 AM UTC-5, Marcus wrote:
> >> On 2023-05-16, luke.l...@gmail.com wrote:
> >>> On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:
> >>
> >>
> >> So, first of all I am a novice when it comes to CPU and ISA design, so I
> >> don't claim that I've made even close to perfect decisions... ;-)
> >>
> >> With that said, the reasoning roughly went like this:
> >>
> >> I wanted a way to *easily* saturate the memory interface (i.e. use it
> >> to its full potential) when working with byte-sized elements. With
> > <
> > A good starting point.
> > <
> >> vector operations, I could only utilize the full memory bandwidth when
> >> all 32 bits of the vector elements are loaded/stored. With byte-sized
> >> vector load/store I only got 1/4th of the bandwidth, and I could not
> >> figure out a simple way to quadruple vector register file write/read
> >> traffic when doing byte-sized loads/stores.
> > <
> > This is where multiple lanes are used to consume more bandwidth
> > when you know the memory reference pattern is "dense".
> > <
> > The vector alternative is to use gather/scatter memory references
> > and perform multiple AGENs per cycle--a much more costly alternative.
> >>
> >> I also realized that since I had a scalable solution for implementation
> >> defined vector register sizes, there would be little harm in fixating
> >> the "packed SIMD width" to 32 bits (unlike traditional packed SIMD
> >> solutions where you need to alter the ISA and change the register
> >> SIMD/register width every time you wish to increase parallelism).
> >>
> >> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
> >> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
> >> but that is another ISA and another story - no binary compatibility is
> >> planned etc).
> > <
> > Like you, I prefer that SIMD-style calculations use natural register
> > widths. Unlike you, I left SIMD out of my ISA and found (what I consider)
> > a better alternative than {calculations}×{widths}×{special-properties}
> > that accompany SIMD. Since memory references come with {widths}
> > (and signed, unsigned semantics), and calculations are self describing,
> > AND you have predication in the ISA, then synthesizing SIMD using VVM
> > is actually straightforward.
> > <
> > This, then, gives the HW freedom to implement the SIMD width appropriate
> > for that implementation, and preserve code-comparability across all SIMD
> > widths and across all implementations.
> > <
> > It also eliminates 1280 = {{16}×{4})×{4}×{5} instructions from ISA. (More
> > if you support 8-bit and 16-bit FP in SIMD.)
<
> It's also a question of what you define as "one instruction".
<
Loosely:: it is a spelling in assembly language. (see below)
>
> When I laid out the instruction encoding puzzle I ended up having 2+2
> bits left in the 32-bit wide instruction format. These turned into a
> "Vector Mode" field ("V") + a "Data Type" field ("T").
>
> The V field is consumed fairly early in the pipeline (during decode, to
> properly address the RF etc), while the T field is effectively an
> additional argument that is passed on to the EU:s, and each individual
> EU interprets what to do with it.
>
> For most arithmetic operations, the T-field is interpreted as:
>
> * 00 - 1x32 bits (regular scalar, no packed SIMD)
> * 01 - 4x8 bits (packed SIMD, "byte")
> * 10 - 2x16 bits (packed SIMD, "half-word")
> * 11 - Reserved (will come to use in MRISC64)
>
> For load/store instructions the T-field is used for the index scaling
> factor (*1, *2, *4 and *8).
>
> For bitwise operations (xor, or, and) the T-field is used as a couple of
> operand binary negation flags (e.g. R2 = ~R5 | R6), similar to how
> My 66000 supports arithmetic operand negation AFAICT.
>
> ...and then there are a few more specialized interpretations of the T
> field.
>
> Anyway, so what I have is almost a 16x fold of instructions, but should
> they be counted as individual instructions, or should they be counted as
> variants? I choose to see it as the latter (I even count immediate
> operand versions of an instruction as a variant rather than a separate
> instruction, unlike some other RISC ISA:s). E.g. the "ADD" instruction
> has the following 14 variants (see chapter 7.3 in the ISA manual,
> https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf):
>
> ADD Ra,Rb,Rc
> ADD.B Ra,Rb,Rc
> ADD.H Ra,Rb,Rc
> ADD Va,Vb,Rc
> ADD.B Va,Vb,Rc
> ADD.H Va,Vb,Rc
> ADD Va,Vb,Vc
> ADD.B Va,Vb,Vc
> ADD.H Va,Vb,Vc
> ADD/F Va,Vb,Vc
> ADD.B/F Va,Vb,Vc
> ADD.H/F Va,Vb,Vc
> ADD Ra,Rb,#imm
> ADD Va,Vb,#imm
<
My 66000 ADD has an IMM16 format and a 2-operand format.
<
The 2-operand format can encode 32 individual variations
of sign-control, {32-bit and 64-bit} immediate in Rs1 position or
Rs2 position substitution of 5-bit register field as 5-bit immediate,
and arithmetic family {{signed, unsigned}, {float, double}}.
<
I count these as 2 instructions since there are 2 OpCodes.
{1 Major OpCode (6-bit) holding the IMM16 field, one minor
OpCode (6-bit) that uses a 5-bit field to denote variations
of sign control, immediates and their position, and family.}
<
That is, I do not count instruction modifiers as instructions.
{But I have to be really careful as instruction-modifier is
an instruction. The hyphen is the visible difference.}
>
> R-operands are scalar registers
> .B/.H means packed BYTE / HALF-WORD
> V-operands are vector registers
> /F means "fold" (combine upper/lower halves of two source operands)
> #imm is a 14+1-bit immediate (the 1 extra bit is a hi/lo shift flag)
> > <
> > Sooner or later the R in RISC should stand for "reduced".
> > It is my contention that any ISA with more than 200-ish instructions
> > ceases to be RISC.
> > <
> > I don't know of a architecture with SIMD instructions that fits under 200
> > total instructions.
<
> If I count as described above, I currently have 103 instructions
> (including floating-point, SIMD packing/unpacking, saturating arithmetic
> etc).
<
I will grant you a "well done" at this point.
>
> ...if I count all the variants as separate instructions, it's just north
> of 1000 instructions.
<
>
> Your pick.
<
The real question is how many individual units of work does it take
to encode an application, and how many units of work does it take
to execute an application. {where "individual unit of work" corresponds
to "its an instruction" notation the DECODE stage of the pipeline makes.}
<
My 66000 is currently running 75% average, 70% geomean, 69% harmonic
mean of RISC-V in the encode and execute metrics from the LLVM compiler
(same flags, optimizations, etc.).
<
> >>
> >> I have also noticed that the uint8x4_t type (i.e. vec4<byte>) can be
> >> quite useful when working with ARGB, for instance, and it's also quite
> >> convenient to be able to perform packed SIMD on both vector and scalar
> >> registers.
> >>
> >> Now, I am not 100% happy with the solution, but at least it's much nicer
> >> to work with than ISA:s such as SSE or NEON, and it's much more future
> >> proof.
> >>
> >> /Marcus

Re: Encoding saturating arithmetic

<a7a7dc88-d60f-4e9f-8478-da4a0ccb27bbn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=32302&group=comp.arch#32302

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5585:0:b0:61a:83a4:661c with SMTP id f5-20020ad45585000000b0061a83a4661cmr718146qvx.10.1684539184084;
Fri, 19 May 2023 16:33:04 -0700 (PDT)
X-Received: by 2002:a05:6870:768b:b0:192:a532:36d7 with SMTP id
dx11-20020a056870768b00b00192a53236d7mr975202oab.5.1684539183821; Fri, 19 May
2023 16:33:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 19 May 2023 16:33:03 -0700 (PDT)
In-Reply-To: <1a3f40b3-a344-4914-82ea-7ec858311e6en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:710e:6962:91a3:a25f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:710e:6962:91a3:a25f
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u45l0h$asnb$1@dont-email.me>
<37b7d936-4ee2-4939-93cd-347d9427f773n@googlegroups.com> <u48itm$ouqk$1@dont-email.me>
<1a3f40b3-a344-4914-82ea-7ec858311e6en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a7a7dc88-d60f-4e9f-8478-da4a0ccb27bbn@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 19 May 2023 23:33:04 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2624

by: MitchAlsup - Fri, 19 May 2023 23:33 UTC

On Friday, May 19, 2023 at 4:11:57 PM UTC-5, MitchAlsup wrote:
> On Friday, May 19, 2023 at 2:33:14 PM UTC-5, Marcus wrote:
>
> > Your pick.
> <
> The real question is how many individual units of work does it take
> to encode an application, and how many units of work does it take
> to execute an application. {where "individual unit of work" corresponds
> to "its an instruction" notation the DECODE stage of the pipeline makes.}
> <
> My 66000 is currently running 75% average, 70% geomean, 69% harmonic
> mean of RISC-V in the encode and execute metrics from the LLVM compiler
> (same flags, optimizations, etc.).
Adding::
<
In My 66000 ISA there are 5-bits that are used by the path from the RF to
the operand flip-flops in the Function unit--that is, these bits determine how
instruction operands arrive at the function unit--and are independent of what
function unit does the calculation, and are independent of what calculation
transpires. I do not consider these 5-bits to "denote" instructions but to denote
the route the operands takes to arrive at its calculation.

BGB <cr88192@gmail.com> wrote:
> On 5/18/2023 12:07 PM, luke.l...@gmail.com wrote:
>> On Thursday, May 18, 2023 at 5:52:18 PM UTC+1, Marcus wrote:
>>
>>> So, first of all I am a novice when it comes to CPU and ISA design, so I
>>> don't claim that I've made even close to perfect decisions... ;-)
>>
>> pfhh, you an me both :) only been at this 4 years.
>>
>
> My case, 7 years...
>
> I got started at this by tinkering around with the SH-2 and SH-4 ISA
> designs.
>
> Where, tinkering with (and extending) SH-4, this eventually became BJX1,
> which was partially rebooted into BJX2 (initially, cleaning up the
> encoding, and dropping a few of the more troublesome ISA features).
>
>
>>> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
>>> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
>>> but that is another ISA and another story - no binary compatibility is
>>> planned etc).
>>
>
> In my case:
> 64b: 1x64, 2x32, 4x16
> 128b: 1x128, 2x64, 4x32
>
> No 8-wide vectors, and no direct packed-byte operations, ...
>
>
> There are a bunch of formats that are supported exclusively using
> converter ops.
> 4x Byte <-> 4x Int16 (several variants)
> RGB555 <-> 4x Int16
> 3x FP10 <-> 4x Binary16
> 4x Fp8 <-> 4x Binary16
> 4x A-Law <-> 4x Binary16
> ...
>
>
> I am mostly using a non-standard RGB555 variant:
> 0rrrrrgg-gggb-bbbb //Opaque Pixel
> 1rrrragg-ggab-bbba //Translucent Pixel
>
> This only gives 8 alpha levels, and alpha comes at the cost of color,
> but it works.
>
> Generally, image quality is higher than RGBA4444, since the common case
> (fully opaque pixel) is encoded as full RGB555. Though, for reasons,
> alpha level does interfere with color.

Alpha does not interfere with color unless you are a bad artist that uses
black as your alpha color, which causes a black fringe on foliage.

Solved by using a alpha mask and spreading the color past the edge and into
the alpha mask area.

>> discuss under new comp.arch thread? (please just not a reply-to
>> with change-of-subject, google groups seriously creaking under
>> the load). your ISA: you start it?
>>
>> l.
>
>

On 5/20/2023 1:34 AM, Brett wrote:
> BGB <cr88192@gmail.com> wrote:
>> On 5/18/2023 12:07 PM, luke.l...@gmail.com wrote:
>>> On Thursday, May 18, 2023 at 5:52:18 PM UTC+1, Marcus wrote:
>>>
>>>> So, first of all I am a novice when it comes to CPU and ISA design, so I
>>>> don't claim that I've made even close to perfect decisions... ;-)
>>>
>>> pfhh, you an me both :) only been at this 4 years.
>>>
>>
>> My case, 7 years...
>>
>> I got started at this by tinkering around with the SH-2 and SH-4 ISA
>> designs.
>>
>> Where, tinkering with (and extending) SH-4, this eventually became BJX1,
>> which was partially rebooted into BJX2 (initially, cleaning up the
>> encoding, and dropping a few of the more troublesome ISA features).
>>
>>
>>>> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
>>>> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
>>>> but that is another ISA and another story - no binary compatibility is
>>>> planned etc).
>>>
>>
>> In my case:
>> 64b: 1x64, 2x32, 4x16
>> 128b: 1x128, 2x64, 4x32
>>
>> No 8-wide vectors, and no direct packed-byte operations, ...
>>
>>
>> There are a bunch of formats that are supported exclusively using
>> converter ops.
>> 4x Byte <-> 4x Int16 (several variants)
>> RGB555 <-> 4x Int16
>> 3x FP10 <-> 4x Binary16
>> 4x Fp8 <-> 4x Binary16
>> 4x A-Law <-> 4x Binary16
>> ...
>>
>>
>> I am mostly using a non-standard RGB555 variant:
>> 0rrrrrgg-gggb-bbbb //Opaque Pixel
>> 1rrrragg-ggab-bbba //Translucent Pixel
>>
>> This only gives 8 alpha levels, and alpha comes at the cost of color,
>> but it works.
>>
>> Generally, image quality is higher than RGBA4444, since the common case
>> (fully opaque pixel) is encoded as full RGB555. Though, for reasons,
>> alpha level does interfere with color.
>
> Alpha does not interfere with color unless you are a bad artist that uses
> black as your alpha color, which causes a black fringe on foliage.
>

This is on the BJX2 Core, not on a GPU...

> Solved by using a alpha mask and spreading the color past the edge and into
> the alpha mask area.
>

It does if one is using RGB555 and the same bits are reused for both
Alpha and the LSB of each color component.

But, this was a tradeoff...

The main alternative would be to repeat the high-bit...

The unpacking scheme from 5 to 8 bits being:
ABCDE -> ABCDEABC
Or, to 16-bit components:
ABCDEABC-ABCDEABC
Or, alternately:
ABCDEABC-10000000

An alternate scheme would have been, if Bit 15 is set:
ABCDE -> ABCDAABC
But, there were tradeoffs either way, and always unpacking the RGB
values as in "normal" RGB555 seemed like the "less bad" option for the
RGB ops (avoids penalizing using these ops for "normal" RGB555).

But, as noted, it was "generally better" in average-case use than using
RGBA with 4-bit components; since images with opaque pixels are more
common than ones with translucent pixels, and generally for translucent
pixels one doesn't care quite as much about the color.

32-bit RGBA is not used mostly as it eats twice the memory bandwidth.
Both uncompressed textures and the framebuffer using RGB555.
Also with a 16-bit Z-Buffer, etc.

A related scheme is used for UTX2, which can mimic both DXT1 and DXT5
behavior, with several modes:
00: Opaque, 2-bit linear interpolation
00=ColorB, 01=5/8*A+3/8*B, 10=3/8*A+5/8*B, 11=ColorA
01: 1-bit color selector, 1-bit alpha selector
10: Mimic DXT1's transparent mode:
00=ColorB, 01=1/2*A+1/2*B, 10=Transparent, 11=ColorA
11: Translucent, 2-bit linear interpolation
Alpha is encoded as in RGB555A.
Alpha is interpolated along with the color.

There are decoder ops for texture-fetch, and some helper ops for the
encoding process.

There is also UTX1, which uses 32-bit blocks. Not ended op used as much
as it only encodes opaque textures and has worse quality.

And, UTX3, which is effectively (128-bit blocks):
2x 32 bits: RGBA endpoints as FP8 microfloats.
32-bits: RGB interpolation
32-bits: Alpha Interpolation

UTX3 being a little funky in that interpolates the FP8 values and then
unpacks them to Binary16 (so the interpolation space is non-linear).

Some of this being because hardware decoders for the "standard" formats
would have been more expensive, and it is easy enough to transcode DXT1
or DXT5 to UTX2 on texture upload (with UTX3 taking the role of BC6H or
BC7).

>>> discuss under new comp.arch thread? (please just not a reply-to
>>> with change-of-subject, google groups seriously creaking under
>>> the load). your ISA: you start it?
>>>
>>> l.
>>
>>
>
>
>

You will have a head crash on your private pack.

devel / comp.arch / Re: Encoding saturating arithmetic

Subject	Author
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	robf...@gmail.com
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	Marcus
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	Scott Lurndal
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	Marcus
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	Marcus
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	Brett
Re: Encoding saturating arithmetic	BGB
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	luke.l...@gmail.com
Re: Encoding saturating arithmetic	Marcus
Re: Encoding saturating arithmetic	MitchAlsup
Re: Encoding saturating arithmetic	MitchAlsup