Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Build a system that even a fool can use and only a fool will want to use it.


devel / comp.arch / Re: Encoding saturating arithmetic

SubjectAuthor
* Re: Encoding saturating arithmeticluke.l...@gmail.com
+* Re: Encoding saturating arithmeticBGB
|+* Re: Encoding saturating arithmeticMitchAlsup
||`- Re: Encoding saturating arithmeticBGB
|`* Re: Encoding saturating arithmeticluke.l...@gmail.com
| `* Re: Encoding saturating arithmeticBGB
|  +* Re: Encoding saturating arithmeticluke.l...@gmail.com
|  |`* Re: Encoding saturating arithmeticMitchAlsup
|  | `* Re: Encoding saturating arithmeticluke.l...@gmail.com
|  |  `* Re: Encoding saturating arithmeticMitchAlsup
|  |   `- Re: Encoding saturating arithmeticluke.l...@gmail.com
|  `* Re: Encoding saturating arithmeticMitchAlsup
|   `* Re: Encoding saturating arithmeticBGB
|    +* Re: Encoding saturating arithmeticrobf...@gmail.com
|    |+* Re: Encoding saturating arithmeticluke.l...@gmail.com
|    ||`- Re: Encoding saturating arithmeticMarcus
|    |`* Re: Encoding saturating arithmeticBGB
|    | `* Re: Encoding saturating arithmeticluke.l...@gmail.com
|    |  `- Re: Encoding saturating arithmeticBGB
|    +* Re: Encoding saturating arithmeticluke.l...@gmail.com
|    |`* Re: Encoding saturating arithmeticBGB
|    | `- Re: Encoding saturating arithmeticMitchAlsup
|    +* Re: Encoding saturating arithmeticScott Lurndal
|    |`* Re: Encoding saturating arithmeticBGB
|    | `* Re: Encoding saturating arithmeticMitchAlsup
|    |  `- Re: Encoding saturating arithmeticBGB
|    `- Re: Encoding saturating arithmeticMitchAlsup
`* Re: Encoding saturating arithmeticMarcus
 +* Re: Encoding saturating arithmeticluke.l...@gmail.com
 |+- Re: Encoding saturating arithmeticMarcus
 |+- Re: Encoding saturating arithmeticMitchAlsup
 |`* Re: Encoding saturating arithmeticBGB
 | `* Re: Encoding saturating arithmeticBrett
 |  `- Re: Encoding saturating arithmeticBGB
 `* Re: Encoding saturating arithmeticMitchAlsup
  +- Re: Encoding saturating arithmeticluke.l...@gmail.com
  `* Re: Encoding saturating arithmeticMarcus
   `* Re: Encoding saturating arithmeticMitchAlsup
    `- Re: Encoding saturating arithmeticMitchAlsup

Pages:12
Re: Encoding saturating arithmetic

<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32224&group=comp.arch#32224

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:25d2:b0:754:8657:7b9c with SMTP id y18-20020a05620a25d200b0075486577b9cmr10730093qko.10.1684201591656;
Mon, 15 May 2023 18:46:31 -0700 (PDT)
X-Received: by 2002:aca:b982:0:b0:394:45fa:ca24 with SMTP id
j124-20020acab982000000b0039445faca24mr3467760oif.5.1684201591380; Mon, 15
May 2023 18:46:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 15 May 2023 18:46:31 -0700 (PDT)
In-Reply-To: <tmhtv4$3lm4g$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 16 May 2023 01:46:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Tue, 16 May 2023 01:46 UTC

On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:

> MRISC32 has saturating arithmetic:
>
> https://github.com/mrisc32/mrisc32/releases/latest/download/mrisc32-instruction-set-manual.pdf

chapter 7. after seeing how out of control this can get
in the AndesSTAR DSP ISA i always feel uneasy whenever
i see explicit saturation opcodes added to an ISA that
only has 32-bit available for instruction format.

> The way I solved the size issue (8, 16, 32 bits in my case) is that I
> have dedicated two bits of the instruction word for specifying the size.
> See section 1.4 "Instruction encoding" (the "T" field).

like that. room to expland to 64 later (even in a 32-bit ISA)

> This implies packed SIMD in a 32-bit register (or a 32-bit element of a
> vector register), as specified in chapter 4.

i *really* don't understand why you would add fantastic Vector
capability then irrevocably damage the ISA by adding PackedSIMD.
if it was vec2/3/4 on *top* of the Vector capability (vec3 being
the really important one as far as 3D is concerned) i would get it.

PackedSIMD only works successfully where the data encountered
is *exactly* matched to the ISA. vec2 for Left and Right Audio.
vec3 for RGB. vec4 for ARGB or Quaternions XYZW.

l.

Re: Encoding saturating arithmetic

<u40gpi$3i77i$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32237&group=comp.arch#32237

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Tue, 16 May 2023 13:07:44 -0500
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <u40gpi$3i77i$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 16 May 2023 18:07:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ae5434b0caf4137d90646bd7df3f61ed";
logging-data="3742962"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19htsyJeRA7ICprxTqi6ii8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:doPNiSqZxTJ/Z8YV+q7fHiGtCsI=
Content-Language: en-US
In-Reply-To: <32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
 by: BGB - Tue, 16 May 2023 18:07 UTC

On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:
>
>> MRISC32 has saturating arithmetic:
>>
>> https://github.com/mrisc32/mrisc32/releases/latest/download/mrisc32-instruction-set-manual.pdf
>
> chapter 7. after seeing how out of control this can get
> in the AndesSTAR DSP ISA i always feel uneasy whenever
> i see explicit saturation opcodes added to an ISA that
> only has 32-bit available for instruction format.
>

I can note that I still don't have any dedicated saturating ops, but
this is partly for cost and timing concerns (and I haven't yet
encountered a case where I "strongly needed" saturating ops).

>
>> The way I solved the size issue (8, 16, 32 bits in my case) is that I
>> have dedicated two bits of the instruction word for specifying the size.
>> See section 1.4 "Instruction encoding" (the "T" field).
>
> like that. room to expland to 64 later (even in a 32-bit ISA)
>
>> This implies packed SIMD in a 32-bit register (or a 32-bit element of a
>> vector register), as specified in chapter 4.
>
> i *really* don't understand why you would add fantastic Vector
> capability then irrevocably damage the ISA by adding PackedSIMD.
> if it was vec2/3/4 on *top* of the Vector capability (vec3 being
> the really important one as far as 3D is concerned) i would get it.
>

In my case, I only had SIMD.

Things like (non SIMD) vector ops seemed too complicated and like a
worse fit to most use-cases, whereas SIMD was straightforward and
reasonably cost-effective to implement.

Or, at least in a simple form, when one wants "fast-ish" FP-SIMD (say,
4-wide Binary32 with a 3L/1T timing; vs, say 10 cycles), it gets more
expensive...

> PackedSIMD only works successfully where the data encountered
> is *exactly* matched to the ISA. vec2 for Left and Right Audio.
> vec3 for RGB. vec4 for ARGB or Quaternions XYZW.
>

2/3/4 wide vectors are *very* common.
Wider vectors are much less common.

This is partly why I have thus far not bothered with wider vectors, and
also why BGBCC's SIMD vector extensions were partly modeled after GLSL.

Emphasis is more on trying to fit stuff that already used 2/3/4 wide
vectors to them (but would have otherwise needed to be expressed with
scalar operations), rather than trying to shoe-horn scalar loops into
vectors (seemingly the assumption with many vector and SIMD ISA designs).

Things like RGBA vectors are already a potentially fairly large use-case
(though TKRA-GL was internally mostly written around fixed-point SIMD,
as a fast Binary16 FP-SIMD unit didn't exist when development started).
If I were doing it now, I might be tempted to make more use floating
point vectors (and possibly also perspective-correct texturing).

( Though, this still doesn't go as far as me having any idea how to make
GLSL fragment shaders viable in terms of performance... ).

Partly this was because the performance balance ended up being skewed
differently than originally imagined. I had originally assumed that
span-drawing would be the main bottleneck, rather than the frontend
application and geometry transform.

The dynamic tessellation, meanwhile, allows for cheaper span-drawing at
the cost of a more expensive transform stage.

Early on, I also didn't expect that programs like Quake would be
spending quite as much of the total CPU time in the BSP walk and
similar. Granted, I have a possible workaround (namely building
triangle/quad lists and only rebuilding the list when the PVS changes),
but haven't gotten around to this (this would be a non-trivial change to
GLQuake; well, more-so than some of the stuff that has been done
already, 1).

Some tweaks, like moving some of the "mathlib" functions over to SIMD,
has already been done, but Quake is limited as heavy use of "float *"
and function calls limits how much one can do here (and a more
significant rewrite of the engine would make it "not really Quake anymore").

....

*1: This port having already been moved over to vertex-lighting rather
than lightmaps (could be better), the use of DDS textures, a sort of
"poor man's LOD" (alias models switch to using a low-res sprite of the
model after a certain distance), ... In my case, this means additional
non-standard "pak's", which is potentially an issue (I can't really
distribute them legally, but the port wont really work as-intended
otherwise).

Also a whole lot of extra complexities pop up the moment one even dares
look at 8-wide SIMD vectors, like some design gremlin lays in wait to
pop out and be like, "So, you have considered one of the forbidden
features...".

Re: Encoding saturating arithmetic

<2212bd32-ef71-4201-8182-530abcbcbbcan@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32238&group=comp.arch#32238

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2a04:b0:74d:f7d0:6a5f with SMTP id o4-20020a05620a2a0400b0074df7d06a5fmr132344qkp.0.1684271730737;
Tue, 16 May 2023 14:15:30 -0700 (PDT)
X-Received: by 2002:a05:6870:d7a7:b0:19a:1:ed8a with SMTP id
bd39-20020a056870d7a700b0019a0001ed8amr36765oab.4.1684271730386; Tue, 16 May
2023 14:15:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 16 May 2023 14:15:30 -0700 (PDT)
In-Reply-To: <u40gpi$3i77i$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7878:f2e8:43d6:8f47;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7878:f2e8:43d6:8f47
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2212bd32-ef71-4201-8182-530abcbcbbcan@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 16 May 2023 21:15:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6340
 by: MitchAlsup - Tue, 16 May 2023 21:15 UTC

On Tuesday, May 16, 2023 at 1:09:05 PM UTC-5, BGB wrote:
> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:

> >
> > i *really* don't understand why you would add fantastic Vector
> > capability then irrevocably damage the ISA by adding PackedSIMD.
> > if it was vec2/3/4 on *top* of the Vector capability (vec3 being
> > the really important one as far as 3D is concerned) i would get it.
> >
> In my case, I only had SIMD.
>
> Things like (non SIMD) vector ops seemed too complicated and like a
> worse fit to most use-cases, whereas SIMD was straightforward and
> reasonably cost-effective to implement.
<
A CRAY 1 is merely a DMA device that happens to mangle the data on the
way through. A CRAY YMP is merely a 4-wide DMA device that happens to
mangle the data on the way through.
>
> Or, at least in a simple form, when one wants "fast-ish" FP-SIMD (say,
> 4-wide Binary32 with a 3L/1T timing; vs, say 10 cycles), it gets more
> expensive...
<
> > PackedSIMD only works successfully where the data encountered
> > is *exactly* matched to the ISA. vec2 for Left and Right Audio.
> > vec3 for RGB. vec4 for ARGB or Quaternions XYZW.
<
Technically {XYZW} is not Quaternion but 3 linear directions and a
weight W = SQRT(x^2+Y^2+Z^2). A Quaternions has 1 real dimension
and 3 imaginary dimensions. Physicists use Quaternions when they
don't want to worry about sign control in the arithmetic (i*j = k and
j*i = -k}
<
> >
> 2/3/4 wide vectors are *very* common.
> Wider vectors are much less common.
<
Perhaps you should look at BLAS, not a short vector to be found;
or any of the other algebra/physics support packages.
>
> This is partly why I have thus far not bothered with wider vectors, and
> also why BGBCC's SIMD vector extensions were partly modeled after GLSL.
<
The converse is also true:: My 66000 does not need SIMD because one can
synthesize SIMD with VVM and also long vectors with VVM. However, VVM
does not require the programmer or compiler to solve hard to do memory
aliasing problems that SIMD and long vectors CRAY-style do.
>
> Emphasis is more on trying to fit stuff that already used 2/3/4 wide
> vectors to them (but would have otherwise needed to be expressed with
> scalar operations), rather than trying to shoe-horn scalar loops into
> vectors (seemingly the assumption with many vector and SIMD ISA designs).
>
> Things like RGBA vectors are already a potentially fairly large use-case
> (though TKRA-GL was internally mostly written around fixed-point SIMD,
> as a fast Binary16 FP-SIMD unit didn't exist when development started).
> If I were doing it now, I might be tempted to make more use floating
> point vectors (and possibly also perspective-correct texturing).
>
> ( Though, this still doesn't go as far as me having any idea how to make
> GLSL fragment shaders viable in terms of performance... ).
>
>
> Partly this was because the performance balance ended up being skewed
> differently than originally imagined. I had originally assumed that
> span-drawing would be the main bottleneck, rather than the frontend
> application and geometry transform.
<
Are these helped big-time(t) by SIMD ???
>
> The dynamic tessellation, meanwhile, allows for cheaper span-drawing at
> the cost of a more expensive transform stage.
>
> Early on, I also didn't expect that programs like Quake would be
> spending quite as much of the total CPU time in the BSP walk and
> similar. Granted, I have a possible workaround (namely building
> triangle/quad lists and only rebuilding the list when the PVS changes),
> but haven't gotten around to this (this would be a non-trivial change to
> GLQuake; well, more-so than some of the stuff that has been done
> already, 1).
>
> Some tweaks, like moving some of the "mathlib" functions over to SIMD,
> has already been done, but Quake is limited as heavy use of "float *"
> and function calls limits how much one can do here (and a more
> significant rewrite of the engine would make it "not really Quake anymore").
>
> ...
>
> *1: This port having already been moved over to vertex-lighting rather
> than lightmaps (could be better), the use of DDS textures, a sort of
> "poor man's LOD" (alias models switch to using a low-res sprite of the
> model after a certain distance), ... In my case, this means additional
> non-standard "pak's", which is potentially an issue (I can't really
> distribute them legally, but the port wont really work as-intended
> otherwise).
>
>
>
> Also a whole lot of extra complexities pop up the moment one even dares
> look at 8-wide SIMD vectors, like some design gremlin lays in wait to
> pop out and be like, "So, you have considered one of the forbidden
> features...".

Re: Encoding saturating arithmetic

<u41i80$3p9of$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32243&group=comp.arch#32243

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Tue, 16 May 2023 22:38:39 -0500
Organization: A noiseless patient Spider
Lines: 294
Message-ID: <u41i80$3p9of$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<2212bd32-ef71-4201-8182-530abcbcbbcan@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 17 May 2023 03:38:40 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="128d9a9495092751bd26e1d2938e0dc7";
logging-data="3974927"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19W+cK98OUmjIi60jDVZn/l"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:HcyHNizQzXU7OshxxWuPX4GgOMI=
In-Reply-To: <2212bd32-ef71-4201-8182-530abcbcbbcan@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 17 May 2023 03:38 UTC

On 5/16/2023 4:15 PM, MitchAlsup wrote:
> On Tuesday, May 16, 2023 at 1:09:05 PM UTC-5, BGB wrote:
>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>
>>>
>>> i *really* don't understand why you would add fantastic Vector
>>> capability then irrevocably damage the ISA by adding PackedSIMD.
>>> if it was vec2/3/4 on *top* of the Vector capability (vec3 being
>>> the really important one as far as 3D is concerned) i would get it.
>>>
>> In my case, I only had SIMD.
>>
>> Things like (non SIMD) vector ops seemed too complicated and like a
>> worse fit to most use-cases, whereas SIMD was straightforward and
>> reasonably cost-effective to implement.
> <
> A CRAY 1 is merely a DMA device that happens to mangle the data on the
> way through. A CRAY YMP is merely a 4-wide DMA device that happens to
> mangle the data on the way through.

One needs a mechanism to load values, operate on them, and store them
again, in a pipelined fashion.

This makes things significantly more complicated than not doing vectors
this way...

SIMD requires no real change to the design of the pipeline beyond having
logic to operate on it.

Originally, the FPU did internally pipeline the vector elements through
the FADD or FMUL unit for SIMD ops in my case, but this was still
simpler as it didn't require getting any memory access in the mix. All
this happened with the main pipeline stalled.

>>
>> Or, at least in a simple form, when one wants "fast-ish" FP-SIMD (say,
>> 4-wide Binary32 with a 3L/1T timing; vs, say 10 cycles), it gets more
>> expensive...
> <
>>> PackedSIMD only works successfully where the data encountered
>>> is *exactly* matched to the ISA. vec2 for Left and Right Audio.
>>> vec3 for RGB. vec4 for ARGB or Quaternions XYZW.
> <
> Technically {XYZW} is not Quaternion but 3 linear directions and a
> weight W = SQRT(x^2+Y^2+Z^2). A Quaternions has 1 real dimension
> and 3 imaginary dimensions. Physicists use Quaternions when they
> don't want to worry about sign control in the arithmetic (i*j = k and
> j*i = -k}
> <

Yeah, a quaternion would actually be labeled IJKR rather than XYZW, and
faces ambiguity between IJKR ordering (consistent with other vectors)
and RIJK ordering (consistent with _Complex; where typically the real
component comes before the imaginary component).

Other than this, a quaternion product is fairly complicated, so is
handled in my case with a runtime call rather than inline (essentially
it is a modified/extended form of a vector cross product).

Not really a commonly used operator though.

In my experience, they are mostly used for things like 3D rotation math
(where they are better behaved albeit less "intuitive" than the
"yaw/pitch/roll" system).

Though, yeah/pitch/roll is generally better for user input as it
implicitly maintains a "this way is up" concept, which rapidly breaks
down with unconstrained quaternions.

Well, unless the setting is in zero-G, where they would likely be more
realistic for the whole zero-G experience.

>>>
>> 2/3/4 wide vectors are *very* common.
>> Wider vectors are much less common.
> <
> Perhaps you should look at BLAS, not a short vector to be found;
> or any of the other algebra/physics support packages.
>>
>> This is partly why I have thus far not bothered with wider vectors, and
>> also why BGBCC's SIMD vector extensions were partly modeled after GLSL.
> <
> The converse is also true:: My 66000 does not need SIMD because one can
> synthesize SIMD with VVM and also long vectors with VVM. However, VVM
> does not require the programmer or compiler to solve hard to do memory
> aliasing problems that SIMD and long vectors CRAY-style do.

Possibly. In my use, vectors were generally treated more like value
types than as memory arrays.

And, aliasing semantics "aren't really a thing" for value types in most
cases (except when one takes the address of a vector).

>>
>> Emphasis is more on trying to fit stuff that already used 2/3/4 wide
>> vectors to them (but would have otherwise needed to be expressed with
>> scalar operations), rather than trying to shoe-horn scalar loops into
>> vectors (seemingly the assumption with many vector and SIMD ISA designs).
>>
>> Things like RGBA vectors are already a potentially fairly large use-case
>> (though TKRA-GL was internally mostly written around fixed-point SIMD,
>> as a fast Binary16 FP-SIMD unit didn't exist when development started).
>> If I were doing it now, I might be tempted to make more use floating
>> point vectors (and possibly also perspective-correct texturing).
>>
>> ( Though, this still doesn't go as far as me having any idea how to make
>> GLSL fragment shaders viable in terms of performance... ).
>>
>>
>> Partly this was because the performance balance ended up being skewed
>> differently than originally imagined. I had originally assumed that
>> span-drawing would be the main bottleneck, rather than the frontend
>> application and geometry transform.
> <
> Are these helped big-time(t) by SIMD ???

The span drawing is fairly dense packed-integer SIMD in this case.

The vertex projection stage also uses SIMD (mostly 4x Binary32).

However, drawing each primitive also involves pushing it onto a stack,
popping it off, projecting it, and if it is too big, breaking it apart
(along the midpoint of each edge) and pushing each piece back onto the
stack, only drawing pieces that are below the size limit. The primitive
is fully drawn once this stack is empty.

However, this process isn't particularly efficient.

Absent perspective-correct texturing, subdividing primitives is
basically required for stuff to not look like broken crap.

However, with perspective-correct texturing, one would not need to
dynamically subdivide the primitives into smaller pieces, but does need
to recalculate "{S1, T1} = {S, T} / W" (and the ST step vector) pretty
much every 8 or 16 pixels or so.

It is hit or miss which strategy is faster. Most modern GPUs use
perspective-correct texturing, whereas the original PlayStation and Sega
Saturn used software-managed tessellation.

Though, sadly, still falls a bit short of the 3D performance of the
PlayStation. But, OTOH, the PlayStation had a dedicated graphics
processor (and probably wouldn't have been quite so hot if people were
doing software rendering on its MIPS CPU).

The OpenGL frontend API (and the Quake engine) is, however, almost
entirely scalar code.

On x86 and ARM, these can be mapped partly to xmmintrin and the GCC
vectors. However, on these targets, the span drawing tends to be
significantly slower (and thus usually ends up being the primary
bottleneck).

Granted, BJX2 also has some specialized helper ops which can help a fair
bit with span-drawing performance (along with the SIMD, and predicated
ops for things like depth and alpha testing, ...).

Comparably, something like Doom uses a much simpler span and column
drawing strategy, which would not benefit much from SIMD. Basically, it
fetches bytes from a memory, feeds them through a color-map table, and
then sticks these values into the framebuffer.

Quake 3 would almost be promising, except that it seemingly has a
comparably much more expensive BSP walk and front-end drawing process
(and a layer-oriented "shader" system; where each surface may be drawn
multiple times in a row with different textures and dynamically
recalculated texture coordinates).

So, effectively, it would end up bottlenecked in the front-end 3D engine
rather than in "actually drawing stuff".

But, I have noted that, in terms of clock-cycles per pixel, BJX2 is
seemingly nearly 5x faster than my Ryzen, and roughly 20x faster than a
Cortex-A53 (which, despite its huge clock-speed advantage, still can't
seem to get playable frame-rates in software-emulated OpenGL; and at
this task seems to take roughly 4x as many clock-cycles on the A53 vs
the Ryzen).

But, in both of these cases, the CPUs seemingly get "totally owned" by
the span-drawing.

Also, the usability of Packed-Integer SIMD on x86-64 is kinda hosed by
the XMM/GPR split, unpredictable branches "suck real hard", ...

Also, x86-64 and ARM are hurt by the lack of special-purpose helper ops.

But, it is things like:
Packing/unpacking RGB555:
x86 and ARM: Shifts and masking;
BJX2: There are dedicated helper ops.
Texel fetch:
x86 and ARM: Whole involved process;
BJX2: Dedicated helper ops.
BJX2 can use predication for depth and alpha test;
...

Granted, maybe one can argue that ASM and helper ops is cheating, but
alas...

Similarly, had also noted that with some helper ops, it is able to run a
color-cell encoder at roughly 5 megapixels/second, ... (vs around 1.5
for plain C version).


Click here to read the complete article
Re: Encoding saturating arithmetic

<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32250&group=comp.arch#32250

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1764:b0:5ef:4807:3864 with SMTP id et4-20020a056214176400b005ef48073864mr21945qvb.4.1684338705402;
Wed, 17 May 2023 08:51:45 -0700 (PDT)
X-Received: by 2002:aca:3d07:0:b0:38e:da4f:4980 with SMTP id
k7-20020aca3d07000000b0038eda4f4980mr5879666oia.10.1684338705083; Wed, 17 May
2023 08:51:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 08:51:44 -0700 (PDT)
In-Reply-To: <u40gpi$3i77i$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 17 May 2023 15:51:45 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3594
 by: luke.l...@gmail.com - Wed, 17 May 2023 15:51 UTC

On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> > chapter 7. after seeing how out of control this can get
> > in the AndesSTAR DSP ISA i always feel uneasy whenever
> > i see explicit saturation opcodes added to an ISA that
> > only has 32-bit available for instruction format.
> >
> I can note that I still don't have any dedicated saturating ops, but
> this is partly for cost and timing concerns (and I haven't yet
> encountered a case where I "strongly needed" saturating ops).

if you are doing Video Encode/Decode (try AV1 for example)
you'll need them to stand a chance of any kind of power-efficient
operation.

in SVP64, Saturation is a *prefix* Mode-bit:
https://libre-soc.org/openpower/sv/normal/
<pre>
| 0-1 | 2 | 3 4 | description |
| ------ | --- |---------|-------------------------- |
| 0 0 | 0 | dz sz | simple mode |
| 0 0 | 1 | RG 0 | scalar reduce mode (mapreduce) |
| 0 0 | 1 | / 1 | reserved |
| 1 0 | N | dz sz | sat mode: N=0/1 u/s | <------ here
| VLi 1 | inv | CR-bit | Rc=1: ffirst CR sel |
| VLi 1 | inv | zz RC1 | Rc=0: ffirst z/nonz |
</pre>

so that means that *all* Prefixable (Vectoriseable) instructions
may have arithmetic saturation. for Logical operations we
choose a different meaning (yet to decide what that is,
probably involving testing of bit zero => extend all 1s)

yes an entire bit of the Prefix (ok under a Mode) dedicated
to whether Saturation is enabled or not.

this has the advantage that we do not end up poisoning
the encoding of the Suffixes with difficult choices about
whether to prioritise this or prioritise that: *all* Arithmetic
and Logical instructions have Signed/Unsigned Saturation.

if you look at the Power ISA v3.1 VSX subset you find there
are an awful lot of saturation instructions: add subtract
mul-add sum-across, pack (over 10 instructions alone),
convert fixed-point, it's a *lot* - that's just the integer set!

and that's what i warned about: when you get down to it,
saturation turns out to need to be applied to such a vast
number of operations that it is about 0.5 of a bit's worth
of encoding needed.

l.

Re: Encoding saturating arithmetic

<u43523$3uml6$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32256&group=comp.arch#32256

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Wed, 17 May 2023 13:05:52 -0500
Organization: A noiseless patient Spider
Lines: 112
Message-ID: <u43523$3uml6$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 17 May 2023 18:05:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="128d9a9495092751bd26e1d2938e0dc7";
logging-data="4151974"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xRwOD22CkBx7PuIgJ4PEd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:nrftPARu/aqYVJyXeGO/HNRHS/4=
In-Reply-To: <13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 17 May 2023 18:05 UTC

On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>> chapter 7. after seeing how out of control this can get
>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>> i see explicit saturation opcodes added to an ISA that
>>> only has 32-bit available for instruction format.
>>>
>> I can note that I still don't have any dedicated saturating ops, but
>> this is partly for cost and timing concerns (and I haven't yet
>> encountered a case where I "strongly needed" saturating ops).
>
> if you are doing Video Encode/Decode (try AV1 for example)
> you'll need them to stand a chance of any kind of power-efficient
> operation.
>

There are usually workarounds, say, using the SIMD ops as 2.14 rather
than 0.16, and then clamping after the fact.
Say: High 2 bits:
00: Value in range
01: Value out of range on positive side, clamp to 3FFF
11: Value out of range on negative side, clamp to 0000
10: Ambiguous, shouldn't happen.

Though, this doesn't deal with stuff going significantly out of range.
But, then one can go 4.12 if they need more range for overflow, ...
0001..0111: Positive overflow.
1000..1111: Negative overflow.

Or, for some other cases, it is sufficient merely to tweak the value
ranges slightly, such as mapping RGB values from:
0080 to FF80 rather than 0000..FFFF
Which then allows the arithmetic to go "very slightly" out of range,
without resulting in a visible overflow (TKRA-GL had used this strategy,
otherwise one could get ugly artifacts when interpolating colors).

So, the lack of saturating packed integer operations hasn't usually been
a huge issue thus far; as there are usually workarounds.

> in SVP64, Saturation is a *prefix* Mode-bit:
> https://libre-soc.org/openpower/sv/normal/
> <pre>
> | 0-1 | 2 | 3 4 | description |
> | ------ | --- |---------|-------------------------- |
> | 0 0 | 0 | dz sz | simple mode |
> | 0 0 | 1 | RG 0 | scalar reduce mode (mapreduce) |
> | 0 0 | 1 | / 1 | reserved |
> | 1 0 | N | dz sz | sat mode: N=0/1 u/s | <------ here
> | VLi 1 | inv | CR-bit | Rc=1: ffirst CR sel |
> | VLi 1 | inv | zz RC1 | Rc=0: ffirst z/nonz |
> </pre>
>
> so that means that *all* Prefixable (Vectoriseable) instructions
> may have arithmetic saturation. for Logical operations we
> choose a different meaning (yet to decide what that is,
> probably involving testing of bit zero => extend all 1s)
>
> yes an entire bit of the Prefix (ok under a Mode) dedicated
> to whether Saturation is enabled or not.
>
> this has the advantage that we do not end up poisoning
> the encoding of the Suffixes with difficult choices about
> whether to prioritise this or prioritise that: *all* Arithmetic
> and Logical instructions have Signed/Unsigned Saturation.
>
> if you look at the Power ISA v3.1 VSX subset you find there
> are an awful lot of saturation instructions: add subtract
> mul-add sum-across, pack (over 10 instructions alone),
> convert fixed-point, it's a *lot* - that's just the integer set!
>
> and that's what i warned about: when you get down to it,
> saturation turns out to need to be applied to such a vast
> number of operations that it is about 0.5 of a bit's worth
> of encoding needed.
>

OK.

Doesn't mean I intend to add general saturation.

If anything, probably either things like:
PADDUS.W Rm, Ro, Rn //Packed Add with Unsigned Saturate.
PADDSS.W Rm, Ro, Rn //Packed Add with Signed Saturate.
PSUBUS.W Rm, Ro, Rn //Packed Sub with Unsigned Saturate.
PSUBSS.W Rm, Ro, Rn //Packed Sub with Signed Saturate.
Then probably leave it at that...

Unless I used 64-bit ops, doing anything much beyond this would burn too
much encoding space.

Or, define lazy multi-part ops:
PSHR2U.W //Packed Shift right, 2 bits unsigned.
PADD.W Rm, Ro, Rn //Normal packed ADD
PSHL2US.W Rm, Rn //Rn[i] = USatW(Rm[i]<<2) (New)

Note that there are currently no general packed-shift operators either,
but this hasn't been a huge loss (packed shift by non-constant values
being "not really a thing").

Granted, one could argue that this "isn't as good" (needs 3 ops and
loses 2 bits of precision), but, does avoid needing to burn 3R space
(and is more easy to generalize to other contexts).

But, as noted, this is a corner of things I had been mostly ignoring
thus far...

Re: Encoding saturating arithmetic

<d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32258&group=comp.arch#32258

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1992:b0:759:5ed2:3240 with SMTP id bm18-20020a05620a199200b007595ed23240mr316552qkb.13.1684351520398;
Wed, 17 May 2023 12:25:20 -0700 (PDT)
X-Received: by 2002:a05:6830:11d8:b0:6a2:e6f6:b484 with SMTP id
v24-20020a05683011d800b006a2e6f6b484mr7890798otq.1.1684351520143; Wed, 17 May
2023 12:25:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 12:25:19 -0700 (PDT)
In-Reply-To: <u43523$3uml6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 17 May 2023 19:25:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 22
 by: luke.l...@gmail.com - Wed, 17 May 2023 19:25 UTC

On Wednesday, May 17, 2023 at 7:07:37 PM UTC+1, BGB wrote:

> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> than 0.16, and then clamping after the fact.

i picked AV1 as a specific example because one of my associates
pointed me only just night at the convolve.c code which works
on 16-bit RGB and in NEON assembler required conversion of
that to 32-bit sign-extended values precisely because of the
lack of appropriate sign-extended 16-bit add.

these are routines that are so heavily used in AV1 that once optimised
resulted in a whopping 40% reduction in completion time: 90 minutes
dropped down to 55.

it sounds great in theory to sacrifice 2 bits (the other example given
here was 12-bits) but in practice the real-world A/V DSP usage does
not fit with that assumption. YUV decode maybe. ARGB encode
most certainly not.

l.

Re: Encoding saturating arithmetic

<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32259&group=comp.arch#32259

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:4d8e:0:b0:309:35d3:2e82 with SMTP id b14-20020a5d4d8e000000b0030935d32e82mr264286wru.8.1684354433803;
Wed, 17 May 2023 13:13:53 -0700 (PDT)
X-Received: by 2002:a05:6870:3a03:b0:196:7859:a2da with SMTP id
du3-20020a0568703a0300b001967859a2damr35184oab.3.1684354432808; Wed, 17 May
2023 13:13:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 13:13:52 -0700 (PDT)
In-Reply-To: <u43523$3uml6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6517:4b00:b56e:240c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6517:4b00:b56e:240c
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 17 May 2023 20:13:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 17 May 2023 20:13 UTC

On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> > On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> >> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> >>> chapter 7. after seeing how out of control this can get
> >>> in the AndesSTAR DSP ISA i always feel uneasy whenever
> >>> i see explicit saturation opcodes added to an ISA that
> >>> only has 32-bit available for instruction format.
> >>>
> >> I can note that I still don't have any dedicated saturating ops, but
> >> this is partly for cost and timing concerns (and I haven't yet
> >> encountered a case where I "strongly needed" saturating ops).
> >
> > if you are doing Video Encode/Decode (try AV1 for example)
> > you'll need them to stand a chance of any kind of power-efficient
> > operation.
> >
> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> than 0.16, and then clamping after the fact.
> Say: High 2 bits:
> 00: Value in range
> 01: Value out of range on positive side, clamp to 3FFF
> 11: Value out of range on negative side, clamp to 0000
> 10: Ambiguous, shouldn't happen.
<
This brings to mind:: the application:::
<
CPUs try to achieve highest frequency of operation and pipeline
away logic delay problems--LDs are now 4 and 5 cycles rather than
2 (MIPS R3000); because that is where performance is as there is
rarely enough parallelism to utilize more than a "few" cores.
<
GPUs on the other hand, seem to be content to stay near 1 GHz
and just throw shader cores at the problem rather than fight for
frequency. Since GPUs process embarrassingly parallel applications
one can freely trade cores for frequency (and vice versa).
<
So, in GPUs, there are arithmetic designs can fully absorb the
delays of saturation, whereas in CPUs it is not so simple.
<merciful snip>
> > and that's what i warned about: when you get down to it,
> > saturation turns out to need to be applied to such a vast
> > number of operations that it is about 0.5 of a bit's worth
> > of encoding needed.
> >
> OK.
>
> Doesn't mean I intend to add general saturation.
<
Your application is mid-way between CPUs and GPUs.

Re: Encoding saturating arithmetic

<1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32260&group=comp.arch#32260

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:ed43:0:b0:306:3b78:fe44 with SMTP id u3-20020adfed43000000b003063b78fe44mr379851wro.1.1684354800497;
Wed, 17 May 2023 13:20:00 -0700 (PDT)
X-Received: by 2002:a05:6808:2597:b0:38e:30:121b with SMTP id
cm23-20020a056808259700b0038e0030121bmr15459oib.5.1684354799864; Wed, 17 May
2023 13:19:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 13:19:59 -0700 (PDT)
In-Reply-To: <d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6517:4b00:b56e:240c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6517:4b00:b56e:240c
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 17 May 2023 20:20:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 17 May 2023 20:19 UTC

On Wednesday, May 17, 2023 at 2:25:22 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, May 17, 2023 at 7:07:37 PM UTC+1, BGB wrote:
>
> > There are usually workarounds, say, using the SIMD ops as 2.14 rather
> > than 0.16, and then clamping after the fact.
> i picked AV1 as a specific example because one of my associates
> pointed me only just night at the convolve.c code which works
> on 16-bit RGB and in NEON assembler required conversion of
> that to 32-bit sign-extended values precisely because of the
> lack of appropriate sign-extended 16-bit add.
<
One of the interesting things about VVM is that the size of the SIMD
operands does not have to be the same as the size of the SIMD results.
One can have operands of size byte and results of size halfword; or any
other cross product.
>
> these are routines that are so heavily used in AV1 that once optimised
> resulted in a whopping 40% reduction in completion time: 90 minutes
> dropped down to 55.
>
> it sounds great in theory to sacrifice 2 bits (the other example given
> here was 12-bits) but in practice the real-world A/V DSP usage does
> not fit with that assumption. YUV decode maybe. ARGB encode
> most certainly not.
<
And thanks for pointing out that existing applications utilize the full
container range of the operands and results.
>
> l.

Re: Encoding saturating arithmetic

<d459241b-8720-4f0d-9bca-d7e74301e1b7n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32262&group=comp.arch#32262

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6000:1001:b0:2e1:1569:6e07 with SMTP id a1-20020a056000100100b002e115696e07mr387817wrx.9.1684361615490;
Wed, 17 May 2023 15:13:35 -0700 (PDT)
X-Received: by 2002:a05:6870:989c:b0:192:ad93:b17f with SMTP id
eg28-20020a056870989c00b00192ad93b17fmr125500oab.4.1684361614653; Wed, 17 May
2023 15:13:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 15:13:34 -0700 (PDT)
In-Reply-To: <1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com> <1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d459241b-8720-4f0d-9bca-d7e74301e1b7n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 17 May 2023 22:13:35 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 75
 by: luke.l...@gmail.com - Wed, 17 May 2023 22:13 UTC

On Wednesday, May 17, 2023 at 9:20:03 PM UTC+1, MitchAlsup wrote:
> On Wednesday, May 17, 2023 at 2:25:22 PM UTC-5, luke.l...@gmail.com wrote:
> > On Wednesday, May 17, 2023 at 7:07:37 PM UTC+1, BGB wrote:
> >
> > > There are usually workarounds, say, using the SIMD ops as 2.14 rather
> > > than 0.16, and then clamping after the fact.
> > i picked AV1 as a specific example because one of my associates
> > pointed me only just night at the convolve.c code which works
> > on 16-bit RGB and in NEON assembler required conversion of
> > that to 32-bit sign-extended values precisely because of the
> > lack of appropriate sign-extended 16-bit add.
> <
> One of the interesting things about VVM is that the size of the SIMD
> operands does not have to be the same as the size of the SIMD results.
> One can have operands of size byte and results of size halfword; or any
> other cross product.

likewise in SVP64 although the suffix (Power ISA Scalar ops) may
have been originally designed as 64-bit only, the SVP64 *Prefix*
dedicates a whopping 4 out of its (eyebrow-raising) 24 bits to
override source and destination widths, allowing them *each*
to be 8/16/32/64.

thus, because the Vector Length is in *elements* - not bit-sizes
not size-of-register, you get the same freedom as VVM.
back-end hardware gets told "source is 32-bit, result is 64-bit"
and it just ends up putting results into twice as many
*scalar* regfile entries as there were source regfile entries.
big deal.

the "price" for this much flexibility in SVP64 is that 24-bits
worth of Prefix.
* 2 for src-width
* 2 for dest-width
* 1+3+3 for source *and* destination Predicate Masks
* 9 for marking registers as Scalar/Vector and extending to 128
* 5 for "Modes" (Saturate, Reduction, Zeroing, Fail-First)

it adds up pretty damn fast!

> And thanks for pointing out that existing applications utilize the full
> container range of the operands and results.

https://aomedia.googlesource.com/aom/+/refs/heads/m72/av1/common/convolve.c#453

here i believe was the function my associate was discussing last
night, it performs an *unsigned* 32-bit addition of a 16-bit *sign-extended*
operand with a 16-bit *non*-extended operand (!?!?!?!) because
this is how apparently you do the clipping (saturation). notice
that the temporaries *had* to be 32-bit - not 16-bit - even though
the pixels are 15-bit 5R5G5B. bit 16 is the "sign" bit used for
saturation-overflow detection and subsequent clipping.

using a single temporary *scalar* register, not a problem.

using temporary SIMD or Vector registers at double the width,
now not only do you have a mismatched size/width problem
(fixed-width SIMD registers can only hold half the number of
elements at double-element size) but you have a regfile resource
allocation problem as well.

indeed VVM (and SVP64, and the Mill) free you from this
otherwise-intractable problem.

also VVM would solve the temporary-allocation problem by
eliding(?) the double-width temporary scalar registers used
within the loop entirely into Reservation Stations that never
actually hit the regfile at all.

l.

Re: Encoding saturating arithmetic

<e545abd0-7a12-49be-b238-b97261921765n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32263&group=comp.arch#32263

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:5272:0:b0:2f7:bf42:6726 with SMTP id l18-20020a5d5272000000b002f7bf426726mr416474wrc.7.1684364288846;
Wed, 17 May 2023 15:58:08 -0700 (PDT)
X-Received: by 2002:a05:6870:3a2d:b0:196:bc94:bff6 with SMTP id
du45-20020a0568703a2d00b00196bc94bff6mr103475oab.3.1684364287501; Wed, 17 May
2023 15:58:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 May 2023 15:58:07 -0700 (PDT)
In-Reply-To: <d459241b-8720-4f0d-9bca-d7e74301e1b7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6517:4b00:b56e:240c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6517:4b00:b56e:240c
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com> <1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>
<d459241b-8720-4f0d-9bca-d7e74301e1b7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e545abd0-7a12-49be-b238-b97261921765n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 17 May 2023 22:58:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5445
 by: MitchAlsup - Wed, 17 May 2023 22:58 UTC

On Wednesday, May 17, 2023 at 5:13:38 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, May 17, 2023 at 9:20:03 PM UTC+1, MitchAlsup wrote:
> > On Wednesday, May 17, 2023 at 2:25:22 PM UTC-5, luke.l...@gmail..com wrote:
> > > On Wednesday, May 17, 2023 at 7:07:37 PM UTC+1, BGB wrote:
> > >
> > > > There are usually workarounds, say, using the SIMD ops as 2.14 rather
> > > > than 0.16, and then clamping after the fact.
> > > i picked AV1 as a specific example because one of my associates
> > > pointed me only just night at the convolve.c code which works
> > > on 16-bit RGB and in NEON assembler required conversion of
> > > that to 32-bit sign-extended values precisely because of the
> > > lack of appropriate sign-extended 16-bit add.
> > <
> > One of the interesting things about VVM is that the size of the SIMD
> > operands does not have to be the same as the size of the SIMD results.
> > One can have operands of size byte and results of size halfword; or any
> > other cross product.
> likewise in SVP64 although the suffix (Power ISA Scalar ops) may
> have been originally designed as 64-bit only, the SVP64 *Prefix*
> dedicates a whopping 4 out of its (eyebrow-raising) 24 bits to
> override source and destination widths, allowing them *each*
> to be 8/16/32/64.
>
> thus, because the Vector Length is in *elements* - not bit-sizes
> not size-of-register, you get the same freedom as VVM.
<
Very good.
<
> back-end hardware gets told "source is 32-bit, result is 64-bit"
> and it just ends up putting results into twice as many
> *scalar* regfile entries as there were source regfile entries.
> big deal.
<
So the vector length is limited by register count.
>
> the "price" for this much flexibility in SVP64 is that 24-bits
> worth of Prefix.
> * 2 for src-width
> * 2 for dest-width
> * 1+3+3 for source *and* destination Predicate Masks
> * 9 for marking registers as Scalar/Vector and extending to 128
> * 5 for "Modes" (Saturate, Reduction, Zeroing, Fail-First)
<
And VVM pays 0-bits {src-width, dst-width, predicate masks}
but, realistically, I don't have any of those modes.
>
> it adds up pretty damn fast!
<
0×any reasonable number == 0
<
> > And thanks for pointing out that existing applications utilize the full
> > container range of the operands and results.
> https://aomedia.googlesource.com/aom/+/refs/heads/m72/av1/common/convolve..c#453
>
> here i believe was the function my associate was discussing last
> night, it performs an *unsigned* 32-bit addition of a 16-bit *sign-extended*
> operand with a 16-bit *non*-extended operand (!?!?!?!) because
> this is how apparently you do the clipping (saturation). notice
> that the temporaries *had* to be 32-bit - not 16-bit - even though
> the pixels are 15-bit 5R5G5B. bit 16 is the "sign" bit used for
> saturation-overflow detection and subsequent clipping.
>
> using a single temporary *scalar* register, not a problem.
>
> using temporary SIMD or Vector registers at double the width,
> now not only do you have a mismatched size/width problem
> (fixed-width SIMD registers can only hold half the number of
> elements at double-element size) but you have a regfile resource
> allocation problem as well.
>
> indeed VVM (and SVP64, and the Mill) free you from this
> otherwise-intractable problem.
>
> also VVM would solve the temporary-allocation problem by
> eliding(?) the double-width temporary scalar registers used
> within the loop entirely into Reservation Stations that never
> actually hit the regfile at all.
<
Yes.
>
> l.

Re: Encoding saturating arithmetic

<u447cg$5vgt$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32266&group=comp.arch#32266

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Wed, 17 May 2023 22:51:41 -0500
Organization: A noiseless patient Spider
Lines: 177
Message-ID: <u447cg$5vgt$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
<u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 03:51:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2b7f7e0b92dbf5449a07a2cc8713764a";
logging-data="196125"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19EUNuJp08qLb7vVMd0DvdO"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:vj2uHPWyaV3+JWLDFjuLvXvJEVM=
In-Reply-To: <3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 18 May 2023 03:51 UTC

On 5/17/2023 3:13 PM, MitchAlsup wrote:
> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>> chapter 7. after seeing how out of control this can get
>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>> i see explicit saturation opcodes added to an ISA that
>>>>> only has 32-bit available for instruction format.
>>>>>
>>>> I can note that I still don't have any dedicated saturating ops, but
>>>> this is partly for cost and timing concerns (and I haven't yet
>>>> encountered a case where I "strongly needed" saturating ops).
>>>
>>> if you are doing Video Encode/Decode (try AV1 for example)
>>> you'll need them to stand a chance of any kind of power-efficient
>>> operation.
>>>
>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>> than 0.16, and then clamping after the fact.
>> Say: High 2 bits:
>> 00: Value in range
>> 01: Value out of range on positive side, clamp to 3FFF
>> 11: Value out of range on negative side, clamp to 0000
>> 10: Ambiguous, shouldn't happen.
> <
> This brings to mind:: the application:::
> <
> CPUs try to achieve highest frequency of operation and pipeline
> away logic delay problems--LDs are now 4 and 5 cycles rather than
> 2 (MIPS R3000); because that is where performance is as there is
> rarely enough parallelism to utilize more than a "few" cores.
> <

I have 3-cycle memory access.

Early on, load/store was not pipelined (and would always take 3 clock
cycles), but slow memory ops were not ideal for performance. I had
extended the pipeline to 3 execute stages mostly as this allowed for
pipelining both load/store and also integer multiply.

If the pipeline were extended to 6 execute stages, this would also allow
for things like pipelined double-precision ops, or single-precision
multiply-accumulate.

But, this would also require more complicated register forwarding, would
make branch mispredict slower, etc, so didn't seem worthwhile. In all,
it would likely end up hurting performance more than it would help.

As can be noted, current pipeline is roughly:
PF IF ID1 ID2 EX1 EX2 EX3 WB
Or:
PF IF ID RF EX1 EX2 EX3 WB

Since ID2 doesn't actually decode anything, just fetches and forwards
register values in preparation for EX1.

From what I can gather, it seems a fair number of other RISC's had also
ended up with a similar pipeline (somewhat more so than the 5-stage
pipeline).

> GPUs on the other hand, seem to be content to stay near 1 GHz
> and just throw shader cores at the problem rather than fight for
> frequency. Since GPUs process embarrassingly parallel applications
> one can freely trade cores for frequency (and vice versa).
> <
> So, in GPUs, there are arithmetic designs can fully absorb the
> delays of saturation, whereas in CPUs it is not so simple.
> <merciful snip>

For many use-cases, running at a lower clock-cycle and focusing more on
shoveling stuff through the pipeline may make more sense than trying to
run at a higher clock speed.

As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
RISC with fairly small L1 caches. Didn't really seem like a win, and I
can't really make the RAM any faster.

Though, it is very possible that programs like Doom and similar might do
better with a 100MHz RISC than a 50MHz VLIW.

Things like "spin in a tight loop executing a relatively small number of
serially dependent instructions" is something where a 100MHz 1-wide core
has an obvious advantage over a 50MHz 3-wide core.

>>> and that's what i warned about: when you get down to it,
>>> saturation turns out to need to be applied to such a vast
>>> number of operations that it is about 0.5 of a bit's worth
>>> of encoding needed.
>>>
>> OK.
>>
>> Doesn't mean I intend to add general saturation.
> <
> Your application is mid-way between CPUs and GPUs.
>

Probably true, and it seems like I am getting properties that at times
seem more GPU-like than CPU-like.

Then, I am still off trying to get RISC-V code running on top of BJX2 as
well.

But, at the moment, the issue isn't so much with the RISC-V ISA per se,
so much as trying to get GCC to produce output that I can really use in
TestKern...

Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
only really supports fixed-address binaries (with the libraries
apparently being static linked into the binaries).

People had apparently argued back and forth between whether to enable
shared-objects and similar, but apparently tended to leave it off
because dynamic linking is prone to breaking stuff.

I hadn't imagined the situation would be anywhere near this weak...

I had sort of thought being able to have shared objects, PIE
executables, etc, was sort of the whole point of ELF.

Also, the toolchain doesn't support PE/COFF for this target either
(apparently PE/COFF only being available for x86/ARM/SH4/etc).

Where, typically, PE/COFF binaries have a base-relocation table, ...

Most strategies for giving a program its own logical address space would
be kind of a pain for TestKern.

I would need to decide between having multiple 48-bit address spaces, or
make use of the 96-bit address space; say, loading a RV64 process at,
say, 0000_0000xxxx_0000_0xxxxxxx or similar...

Though, at least the 96-bit address space option means that the kernel
can still have pointers into the program's space (but, would mean that
stuff servicing system calls would need to start working with 128-bit
pointers).

Well, at least short of other address space hacks, say:
0000_00000123_0000_0xxxxxxx
Is mirrored at, say:
7123_0xxxxxxx

So that syscall handlers don't need to use bigger pointers, but the
program can still pretend to have its own virtual address space.

Well, this or add some addressing hacks (say, a mode allowing
0000_xxxxxxxx to be remapped within the larger 48-bit space).

I would rather have had PIE binaries or similar and not need to deal
with any of this...

Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
address space" to actually be an issue...

I can note, by extension, that BGBCC's PEL4 output can be loaded
anywhere in the address space.

Still mostly static-linking everything, but (unexpectedly) I am not
actually behind on this front (and the DLLs do actually exist, sort of;
even if at present they are more used as loadable modules than as OS
libraries).

....

Re: Encoding saturating arithmetic

<66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32268&group=comp.arch#32268

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1483:b0:3f6:821e:5a70 with SMTP id t3-20020a05622a148300b003f6821e5a70mr957662qtx.1.1684400914386;
Thu, 18 May 2023 02:08:34 -0700 (PDT)
X-Received: by 2002:a05:6830:153:b0:69f:ac19:a41f with SMTP id
j19-20020a056830015300b0069fac19a41fmr488549otp.5.1684400914026; Thu, 18 May
2023 02:08:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 02:08:33 -0700 (PDT)
In-Reply-To: <u447cg$5vgt$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:10d4:c934:4e4a:71ff;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:10d4:c934:4e4a:71ff
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Thu, 18 May 2023 09:08:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 10248
 by: robf...@gmail.com - Thu, 18 May 2023 09:08 UTC

On Wednesday, May 17, 2023 at 11:51:49 PM UTC-4, BGB wrote:
> On 5/17/2023 3:13 PM, MitchAlsup wrote:
> > On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
> >> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> >>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> >>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> >>>>> chapter 7. after seeing how out of control this can get
> >>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
> >>>>> i see explicit saturation opcodes added to an ISA that
> >>>>> only has 32-bit available for instruction format.
> >>>>>
> >>>> I can note that I still don't have any dedicated saturating ops, but
> >>>> this is partly for cost and timing concerns (and I haven't yet
> >>>> encountered a case where I "strongly needed" saturating ops).
> >>>
> >>> if you are doing Video Encode/Decode (try AV1 for example)
> >>> you'll need them to stand a chance of any kind of power-efficient
> >>> operation.
> >>>
> >> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> >> than 0.16, and then clamping after the fact.
> >> Say: High 2 bits:
> >> 00: Value in range
> >> 01: Value out of range on positive side, clamp to 3FFF
> >> 11: Value out of range on negative side, clamp to 0000
> >> 10: Ambiguous, shouldn't happen.
> > <
> > This brings to mind:: the application:::
> > <
> > CPUs try to achieve highest frequency of operation and pipeline
> > away logic delay problems--LDs are now 4 and 5 cycles rather than
> > 2 (MIPS R3000); because that is where performance is as there is
> > rarely enough parallelism to utilize more than a "few" cores.
> > <
> I have 3-cycle memory access.
>
> Early on, load/store was not pipelined (and would always take 3 clock
> cycles), but slow memory ops were not ideal for performance. I had
> extended the pipeline to 3 execute stages mostly as this allowed for
> pipelining both load/store and also integer multiply.
>
>
> If the pipeline were extended to 6 execute stages, this would also allow
> for things like pipelined double-precision ops, or single-precision
> multiply-accumulate.
>
> But, this would also require more complicated register forwarding, would
> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
> it would likely end up hurting performance more than it would help.
>
>
> As can be noted, current pipeline is roughly:
> PF IF ID1 ID2 EX1 EX2 EX3 WB
> Or:
> PF IF ID RF EX1 EX2 EX3 WB
>
> Since ID2 doesn't actually decode anything, just fetches and forwards
> register values in preparation for EX1.
>
> From what I can gather, it seems a fair number of other RISC's had also
> ended up with a similar pipeline (somewhat more so than the 5-stage
> pipeline).
> > GPUs on the other hand, seem to be content to stay near 1 GHz
> > and just throw shader cores at the problem rather than fight for
> > frequency. Since GPUs process embarrassingly parallel applications
> > one can freely trade cores for frequency (and vice versa).
> > <
> > So, in GPUs, there are arithmetic designs can fully absorb the
> > delays of saturation, whereas in CPUs it is not so simple.
> > <merciful snip>
> For many use-cases, running at a lower clock-cycle and focusing more on
> shoveling stuff through the pipeline may make more sense than trying to
> run at a higher clock speed.
>
>
> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
> RISC with fairly small L1 caches. Didn't really seem like a win, and I
> can't really make the RAM any faster.
>
>
> Though, it is very possible that programs like Doom and similar might do
> better with a 100MHz RISC than a 50MHz VLIW.
>
> Things like "spin in a tight loop executing a relatively small number of
> serially dependent instructions" is something where a 100MHz 1-wide core
> has an obvious advantage over a 50MHz 3-wide core.
> >>> and that's what i warned about: when you get down to it,
> >>> saturation turns out to need to be applied to such a vast
> >>> number of operations that it is about 0.5 of a bit's worth
> >>> of encoding needed.
> >>>
> >> OK.
> >>
> >> Doesn't mean I intend to add general saturation.
> > <
> > Your application is mid-way between CPUs and GPUs.
> >
> Probably true, and it seems like I am getting properties that at times
> seem more GPU-like than CPU-like.
>
>
> Then, I am still off trying to get RISC-V code running on top of BJX2 as
> well.
>
> But, at the moment, the issue isn't so much with the RISC-V ISA per se,
> so much as trying to get GCC to produce output that I can really use in
> TestKern...
>
> Turns out that neither FDPIC nor PIE is supported on RISC-V; rather it
> only really supports fixed-address binaries (with the libraries
> apparently being static linked into the binaries).
>
> People had apparently argued back and forth between whether to enable
> shared-objects and similar, but apparently tended to leave it off
> because dynamic linking is prone to breaking stuff.
>
> I hadn't imagined the situation would be anywhere near this weak...
>
>
> I had sort of thought being able to have shared objects, PIE
> executables, etc, was sort of the whole point of ELF.
>
> Also, the toolchain doesn't support PE/COFF for this target either
> (apparently PE/COFF only being available for x86/ARM/SH4/etc).
>
> Where, typically, PE/COFF binaries have a base-relocation table, ...
>
>
>
> Most strategies for giving a program its own logical address space would
> be kind of a pain for TestKern.
>
> I would need to decide between having multiple 48-bit address spaces, or
> make use of the 96-bit address space; say, loading a RV64 process at,
> say, 0000_0000xxxx_0000_0xxxxxxx or similar...
>
> Though, at least the 96-bit address space option means that the kernel
> can still have pointers into the program's space (but, would mean that
> stuff servicing system calls would need to start working with 128-bit
> pointers).
>
> Well, at least short of other address space hacks, say:
> 0000_00000123_0000_0xxxxxxx
> Is mirrored at, say:
> 7123_0xxxxxxx
>
> So that syscall handlers don't need to use bigger pointers, but the
> program can still pretend to have its own virtual address space.
>
> Well, this or add some addressing hacks (say, a mode allowing
> 0000_xxxxxxxx to be remapped within the larger 48-bit space).
>
>
> I would rather have had PIE binaries or similar and not need to deal
> with any of this...
>
>
> Somehow, I didn't expect "Yeah, you can load stuff anywhere in the
> address space" to actually be an issue...
>
>
> I can note, by extension, that BGBCC's PEL4 output can be loaded
> anywhere in the address space.
>
> Still mostly static-linking everything, but (unexpectedly) I am not
> actually behind on this front (and the DLLs do actually exist, sort of;
> even if at present they are more used as loadable modules than as OS
> libraries).
>
> ...

I think BJX2 is doing very well if data access is only three cycles.

I think Thor is sitting at six cycle data memory access, I$ access is single
cycle. Data: 1 to load the memory request queue. 1 to pull from the queue
to the data cache, two to access the data cache, 1 to put the response into
a response fifo and 1 to off load the response back in the cpu. I think
there may also be an agen cycle happening too. There is probably at least
one cycle that could be eliminated, but eliminating it would improve
performance by about 4% overall and likely cost clock cycle time. ATM
writes write all the way through to memory and therefore take a
horrendous number of clock cycles eg. 30. Writes to some of the SoC
devices are much faster.

I really need a larger FPGA for my designs, any suggestions? I broke 500k
LUTs again and had to trim cores. I scrapped the wonderful register file
that could load four registers at a time, when I realized it looked like a
16-read port file. 25,000 LUTs. A four-port register file is used now with
serial reads / writes for multi-register access. 2k LUTs. Same ISA,
implemented differently.


Click here to read the complete article
Re: Encoding saturating arithmetic

<b63a3eee-27f5-4e46-ada0-e6a05ccb601bn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32269&group=comp.arch#32269

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4481:b0:74e:362b:2f94 with SMTP id x1-20020a05620a448100b0074e362b2f94mr957141qkp.3.1684413621311;
Thu, 18 May 2023 05:40:21 -0700 (PDT)
X-Received: by 2002:a05:6808:3614:b0:394:45fa:d4de with SMTP id
ct20-20020a056808361400b0039445fad4demr766592oib.6.1684413620994; Thu, 18 May
2023 05:40:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 05:40:20 -0700 (PDT)
In-Reply-To: <e545abd0-7a12-49be-b238-b97261921765n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<d8a85e5c-03ae-463e-9e63-78d8bc0b9bb5n@googlegroups.com> <1a3265b2-2641-44be-842a-24c97bee4d0dn@googlegroups.com>
<d459241b-8720-4f0d-9bca-d7e74301e1b7n@googlegroups.com> <e545abd0-7a12-49be-b238-b97261921765n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b63a3eee-27f5-4e46-ada0-e6a05ccb601bn@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 18 May 2023 12:40:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 11245
 by: luke.l...@gmail.com - Thu, 18 May 2023 12:40 UTC

On Wednesday, May 17, 2023 at 11:58:11 PM UTC+1, MitchAlsup wrote:

> > thus, because the Vector Length is in *elements* - not bit-sizes
> > not size-of-register, you get the same freedom as VVM.
> <
> Very good.

the simplicity that results when converting algorithms written
in a high-level language into SVP64 assembler is so profound
that we are running into a "disbelief and it-must-be-snake-oil"
problem rather than any kind of technical or design issue.

part of that disbelief stems from the fact that there *are* no
other ISAs that have a Horizontal-First Mode where the ISA
has its Vector Length in elements.

context (important for the next para): the "Vector" length
(which is not a Vector at all, it is a loop-construct that
*looks* like a Vector ISA) very deliberately allows
arbitrary unlimited-length "spill" into consecutive regs:

.....for i in range(VL): GPR(RT+i) = GPR(RA+i) + GPR(RB+i)

i found *one* Academic paper that explores the concept
of allowing for a similar type of "spill":
https://www.tdx.cat/bitstream/handle/10803/674224/TCRL1de1.pdf

but this paper explores the concept of redefining the
*fixed-width Vector* registers (example) normally defined as
QTY 32of 256-bit to instead be (example) defined as
QTY 8of 1024-bit registers.

which is *not* the same - at all - as the Simple-V Loop
Construct, but unfortunately may be confused for such
because of the overlapping redefinition of the registers.

the key difference: that Academic Paper still defines
"Vector" registers as *fixed-bit-width* and consequently
forces the Programmer to *divide* that fixed-bit-width by
the number of desired elements.

i think one of the reasons why nobody has considered
this before is because they did not think through a solution
to the Register Hazard Management, in full, to its logical
conclusion. i heard of *one* other team (led by Peter Hsu,
designer of the MIPS R8000) back around 1994, who gave
serious consideration to the Vector-Loop Concept, but
because MIPS did not have an OoO Micro-Architecture at
the time they had to abandon it.

> > back-end hardware gets told "source is 32-bit, result is 64-bit"
> > and it just ends up putting results into twice as many
> > *scalar* regfile entries as there were source regfile entries.
> > big deal.
> <
> So the vector length is limited by register count.

partly/mostly/conceptually correct: the devil-in-the-details
is that the compiler (statically, compile-time) must set a
hard limit known as "MAXVL", which (statically) prohibits
over-run.

if the programmer (or, the compiler) forgets that it is
physically impossible to over-run the end of the register file,
then hardware kindly informs them of their forgetfulness by
throwing an Illegal Instruction trap :)

the reason i mention that is because we do run into
conflation issues between the meaning and purpose of
MAXVL and VL.

* MAXVL is a static (immediate-only) hard-limit on VL
* VL is *dynamic* (register operand settable) but it is
still *not* possible to directly set VL. VL *must* go
through the equation "RT = VL = MIN(RA, MAXVL)"

thus you can see, where you (we) say "Vector Length",
i apologise for over-simplifying, it was unintentionally
misleading.

bringing us back to the topic at hand (saturation), you can
see that it is just as critical in SVP64 to have Saturated
instructions as it is for *any* ISA, because if SVP64 did
not have Saturation then it would suffer from exactly
the same problem as all other SIMD/Vector ISAs:
double the register allocation required due to having
only power-of-two (actual) register sizes, you would
indeed end up trying to over-run the end of the regfile.

RISC-V RVV *claims* to solve this through having similar
"RT = VL = MIN(RA, MAXVL(bits)" behaviour
but in RVV *MAXVL is a hard Architectural limit*
and that limit is - you guessed it - *BIT* based.

thus in RVV programmers are forced in every single
algorithm without fail without exception to use a
Cray-style Vector-Loop Construct. including the branch.

loop:
..... setvl r5, r3 # copy of VL gets put into RT(5)
..... vadd r0,r1,imm2
..... subf r3, r5 # subtract VL(r5) from r3
..... bz r3, loop # test r3 if it reached zero yet

if you have hardware that will allow VL to be set to 4,
and you have r3=4, the loop gets executed once.

if however on *completely different* hardware you have
enough Lanes to allow up to say 6 elements, then you get
the first time the Lanes running 100% full and on the 2nd
loop they run 50% empty!

and - again bringing it back on-topic - if you want to do
Saturated Add (without actual Saturated Add instructions)
you *CANNOT* SIMD-allocate half-half, you cannot
over-allocate (physically impossible to overrun SRAM) - your
ONLY option is to work with a 32-bit allocation for the 16-bit
values, wasting half of the Vector Register SRAM in the process,
exactly like any SIMD ISA would have to [if it also did not
have Saturated Add]

that needs a little explanation: back at the RVV setvl
instruction i missed out that there is an extra parameter
(or an extra instruction, i forget which) that specifies
*how the Vector SRAM is to be subdivided*.

you can say "the elements are all 32-bit", or you can say
"the elements are all 64-bit" or you can say "the elements
are all 16-bit". it appears that it is down to the programmer
and/or compiler to modify the (global) size and work out where
the hell everything is... wait... i *think* this is intended to
"fix" this particular problem: vlmul "grouping":
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20

but i am having a hard time reading it and understanding,
because i lack both experience and context with this version
of RVV.
> > the "price" for this much flexibility in SVP64 is that 24-bits
> > worth of Prefix.
> > * 2 for src-width
> > * 2 for dest-width
> > * 1+3+3 for source *and* destination Predicate Masks
> > * 9 for marking registers as Scalar/Vector and extending to 128
> > * 5 for "Modes" (Saturate, Reduction, Zeroing, Fail-First)
> <
> And VVM pays 0-bits {src-width, dst-width, predicate masks}
> but, realistically, I don't have any of those modes.

i'm not sure it's even conceptually possible to introduce
Twin-Predication onto even a Scalar ISA, for similar
reasons that a "mv.x" instruction in a Scalar ISA is
a no-no total-nightmare [mv.x: GPR(GPR(RT)) = RA,
or : GPR(RT) = GPR(GPR(RA))]

it would seem completely ridiculous to any programmer
to *appear* to waste *two* predicate bits for a *Scalar*
instruction: one for reading and one for writing! what
on earth would you need to do that in a Scalar ISA for??!!

so it literally makes no sense at all.... *until* (like
mv.x) you Vectorise it. then, "ohhhhh", it means
that the "Lanes" can cross over! you can have one
time round the loop use the read-operands
*of any previous loop*!

as in: you are literally linking the registers marked as "write"
(from a PREVIOUS loop) directly to the registers marked
as "read" from the CURRENT loop.

or - and this is the bit i find really hilarious - vice-versa!

and that's not limited to just the previous loop: the "connection"
is from *any* loop to any loop! and you do so by making
the Predicate Masks a multi-bit field (just as they are in
all other Vector ISAs). the following will connect Loop 1 (read)
with Loop *SEVEN* (write)

read-operands-mask: 0b0000001
write-operands-mask: 0b1000000

even that makes no sense until you also have scalar operands
that are *unpredicated* which for example increment a scalar
counter (it should be obvious that un-predicated operations
would continue to be executed on each loop-operation
regardless of what is in the two Predicates, and that it is
the interaction between *all three* that results in the
desired behaviour).

example: you have LD operations that are Read-Predicated,
where the inputs to the Effective Address are *not* predicated,
and you have similarly ST operations that are Write-Predicated
where the EA regs are *also* not Predicated.

in my mind it is pretty obvious that as a programming
paradigm this is so mind-meltingly complex (and fragile) that
to me it is unthinkable except as an academic exercise.
but you may have a different perspective and know of a
way to make this workable.

(in Simple-V i had to impose some extremely drastic limitations
and warnings to anyone attempting to use Twin-Predication
with Vertical-First Mode - the conceptual explicit equivalent
of VVM)

in Hardware-terms it is effectively a way to "Lane-cross" in the
internally- SIMD-ified Reservation Stations! which makes it
tantalisingly appealing (if it weren't for the programming
complexity that comes with it).

> 0×any reasonable number == 0

indeed :)

> > also VVM would solve the temporary-allocation problem by
> > eliding(?) the double-width temporary scalar registers used
> > within the loop entirely into Reservation Stations that never
> > actually hit the regfile at all.
> <
> Yes.


Click here to read the complete article
Re: Encoding saturating arithmetic

<e5e8be3e-c05d-43c2-9a19-22a1a852b9e8n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32270&group=comp.arch#32270

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1aa8:b0:3ef:5c07:f789 with SMTP id s40-20020a05622a1aa800b003ef5c07f789mr1141965qtc.10.1684413833119;
Thu, 18 May 2023 05:43:53 -0700 (PDT)
X-Received: by 2002:a4a:e0c5:0:b0:54f:6f75:473 with SMTP id
e5-20020a4ae0c5000000b0054f6f750473mr9373370oot.0.1684413832752; Thu, 18 May
2023 05:43:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 05:43:52 -0700 (PDT)
In-Reply-To: <u447cg$5vgt$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e5e8be3e-c05d-43c2-9a19-22a1a852b9e8n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 18 May 2023 12:43:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2103
 by: luke.l...@gmail.com - Thu, 18 May 2023 12:43 UTC

On Thursday, May 18, 2023 at 4:51:49 AM UTC+1, BGB wrote:
> On 5/17/2023 3:13 PM, MitchAlsup wrote:
> > Your application is mid-way between CPUs and GPUs.
> >
> Probably true, and it seems like I am getting properties that at times
> seem more GPU-like than CPU-like.
>
>
> Then, I am still off trying to get RISC-V code running on top of BJX2 as
> well.

did you by any chance Micro-code it? did you put in some
internal re-writing into BJX2 internal operations, in some
fashion? i would be interested to hear if you did so, and how.
or, if it is a shared Micro-coding back-end with two disparate
front-end ISAs.

l.

Re: Encoding saturating arithmetic

<e157bbc6-8c3a-4a68-a5c2-829cf42c6fa2n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32271&group=comp.arch#32271

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5dd4:0:b0:3f6:83be:8270 with SMTP id e20-20020ac85dd4000000b003f683be8270mr1205749qtx.0.1684414529073;
Thu, 18 May 2023 05:55:29 -0700 (PDT)
X-Received: by 2002:a05:6870:98af:b0:187:e0b6:40de with SMTP id
eg47-20020a05687098af00b00187e0b640demr533509oab.0.1684414528803; Thu, 18 May
2023 05:55:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 05:55:28 -0700 (PDT)
In-Reply-To: <66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
<66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e157bbc6-8c3a-4a68-a5c2-829cf42c6fa2n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 18 May 2023 12:55:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3062
 by: luke.l...@gmail.com - Thu, 18 May 2023 12:55 UTC

On Thursday, May 18, 2023 at 10:08:36 AM UTC+1, robf...@gmail.com wrote:

> I really need a larger FPGA for my designs, any suggestions? I broke 500k
> LUTs again and had to trim cores.

yyeahh, million-LUT FPGAs are in the USD 10,000 and above
bracket, and they make the U.S. Military so nervous that their
access is usually restricted to very large Corporates (Intel,
AMD, Apple, Texas Instruments - companies that will take
a BXPA Munitions-Grade Classification seriously) and to
U.S. Academic institutions (likewise).

if you get a "no" when asking any Sales Rep about them, please
for your own peace of mind take that "no" as a "Hard no", ok?

if your design was Open Source or was exclusively Academic-related
and/or implemented the Power ISA then i could potentially help
point you in the right hint-of-a-direction (off-list only). but if it
isn't (FOSS, Academic, PowerISA) then the answer would be no,
even if you asked. sorry.

an alternative idea that you might like to consider is to get FPGAs
with high-speed inter-connect (some FPGAs have 25 gigabit SERDES
now), then develop - or find - a "Coherent Bus Inter-Connect" protocol,
and get two (or more) of the (smaller) FPGAs.

if you're genuinely interested to go that route i have been considering
putting in an NLnet Grant Request to cover it - but you would be
required to create the entirety of the high-speed inter-connect HDL
and associated documentation under FOSS Licenses as an inviolate
pre-condition of the work.

l.

Re: Encoding saturating arithmetic

<41q9M.259746$qpNc.185276@fx03.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32273&group=comp.arch#32273

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Encoding saturating arithmetic
Newsgroups: comp.arch
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me> <32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me> <13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me> <3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
Lines: 39
Message-ID: <41q9M.259746$qpNc.185276@fx03.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 18 May 2023 13:48:48 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 18 May 2023 13:48:48 GMT
X-Received-Bytes: 2642
 by: Scott Lurndal - Thu, 18 May 2023 13:48 UTC

BGB <cr88192@gmail.com> writes:
>On 5/17/2023 3:13 PM, MitchAlsup wrote:
>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>>> chapter 7. after seeing how out of control this can get
>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>>> i see explicit saturation opcodes added to an ISA that
>>>>>> only has 32-bit available for instruction format.
>>>>>>
>>>>> I can note that I still don't have any dedicated saturating ops, but
>>>>> this is partly for cost and timing concerns (and I haven't yet
>>>>> encountered a case where I "strongly needed" saturating ops).
>>>>
>>>> if you are doing Video Encode/Decode (try AV1 for example)
>>>> you'll need them to stand a chance of any kind of power-efficient
>>>> operation.
>>>>
>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>>> than 0.16, and then clamping after the fact.
>>> Say: High 2 bits:
>>> 00: Value in range
>>> 01: Value out of range on positive side, clamp to 3FFF
>>> 11: Value out of range on negative side, clamp to 0000
>>> 10: Ambiguous, shouldn't happen.
>> <
>> This brings to mind:: the application:::
>> <
>> CPUs try to achieve highest frequency of operation and pipeline
>> away logic delay problems--LDs are now 4 and 5 cycles rather than
>> 2 (MIPS R3000); because that is where performance is as there is
>> rarely enough parallelism to utilize more than a "few" cores.
>> <
>
>I have 3-cycle memory access.

To L1? Virtually indexed?

Re: Encoding saturating arithmetic

<b126728b-3684-4a5b-a211-5a1f5f6badfbn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32276&group=comp.arch#32276

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1a1c:b0:3f4:e7d9:a4d4 with SMTP id f28-20020a05622a1a1c00b003f4e7d9a4d4mr53522qtb.13.1684425626425;
Thu, 18 May 2023 09:00:26 -0700 (PDT)
X-Received: by 2002:a05:6870:c396:b0:192:4da7:fe61 with SMTP id
g22-20020a056870c39600b001924da7fe61mr692862oao.4.1684425626041; Thu, 18 May
2023 09:00:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 09:00:25 -0700 (PDT)
In-Reply-To: <u447cg$5vgt$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84a2:55c:f9c8:b506;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84a2:55c:f9c8:b506
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com> <u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com> <u447cg$5vgt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b126728b-3684-4a5b-a211-5a1f5f6badfbn@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 18 May 2023 16:00:26 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6271
 by: MitchAlsup - Thu, 18 May 2023 16:00 UTC

On Wednesday, May 17, 2023 at 10:51:49 PM UTC-5, BGB wrote:
> On 5/17/2023 3:13 PM, MitchAlsup wrote:
> > On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
> >> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
> >>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
> >>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
> >>>>> chapter 7. after seeing how out of control this can get
> >>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
> >>>>> i see explicit saturation opcodes added to an ISA that
> >>>>> only has 32-bit available for instruction format.
> >>>>>
> >>>> I can note that I still don't have any dedicated saturating ops, but
> >>>> this is partly for cost and timing concerns (and I haven't yet
> >>>> encountered a case where I "strongly needed" saturating ops).
> >>>
> >>> if you are doing Video Encode/Decode (try AV1 for example)
> >>> you'll need them to stand a chance of any kind of power-efficient
> >>> operation.
> >>>
> >> There are usually workarounds, say, using the SIMD ops as 2.14 rather
> >> than 0.16, and then clamping after the fact.
> >> Say: High 2 bits:
> >> 00: Value in range
> >> 01: Value out of range on positive side, clamp to 3FFF
> >> 11: Value out of range on negative side, clamp to 0000
> >> 10: Ambiguous, shouldn't happen.
> > <
> > This brings to mind:: the application:::
> > <
> > CPUs try to achieve highest frequency of operation and pipeline
> > away logic delay problems--LDs are now 4 and 5 cycles rather than
> > 2 (MIPS R3000); because that is where performance is as there is
> > rarely enough parallelism to utilize more than a "few" cores.
> > <
> I have 3-cycle memory access.
>
> Early on, load/store was not pipelined (and would always take 3 clock
> cycles), but slow memory ops were not ideal for performance. I had
> extended the pipeline to 3 execute stages mostly as this allowed for
> pipelining both load/store and also integer multiply.
>
>
> If the pipeline were extended to 6 execute stages, this would also allow
> for things like pipelined double-precision ops, or single-precision
> multiply-accumulate.
>
> But, this would also require more complicated register forwarding, would
> make branch mispredict slower, etc, so didn't seem worthwhile. In all,
> it would likely end up hurting performance more than it would help.
>
>
> As can be noted, current pipeline is roughly:
> PF IF ID1 ID2 EX1 EX2 EX3 WB
> Or:
> PF IF ID RF EX1 EX2 EX3 WB
>
> Since ID2 doesn't actually decode anything, just fetches and forwards
> register values in preparation for EX1.
>
> From what I can gather, it seems a fair number of other RISC's had also
> ended up with a similar pipeline (somewhat more so than the 5-stage
> pipeline).
<
The 5-stage "classical" RISC pipeline requires ½ cycle cache access
{both I and D}, by expanding to 7 stage, one can accommodate SRAMs
that take 1 full cycle to access. And by expanding to 8 stage pipeline
one can accommodate FP into the same pipeline. I term this::
<
Main side
Fetch-Parse-Decode-Execute-Cache-Align-Wait-Write.
FP side
Fetch-Parse-Decode-Execut1-Exec2-Exec3-Exe4-write.
<
> > GPUs on the other hand, seem to be content to stay near 1 GHz
> > and just throw shader cores at the problem rather than fight for
> > frequency. Since GPUs process embarrassingly parallel applications
> > one can freely trade cores for frequency (and vice versa).
> > <
> > So, in GPUs, there are arithmetic designs can fully absorb the
> > delays of saturation, whereas in CPUs it is not so simple.
> > <merciful snip>
<
> For many use-cases, running at a lower clock-cycle and focusing more on
> shoveling stuff through the pipeline may make more sense than trying to
> run at a higher clock speed.
>
>
> As noted before, I could get 100MHz, but (sadly), it would mean a 1-wide
> RISC with fairly small L1 caches. Didn't really seem like a win, and I
> can't really make the RAM any faster.
>
>
> Though, it is very possible that programs like Doom and similar might do
> better with a 100MHz RISC than a 50MHz VLIW.
>
> Things like "spin in a tight loop executing a relatively small number of
> serially dependent instructions" is something where a 100MHz 1-wide core
> has an obvious advantage over a 50MHz 3-wide core.
<

Re: Encoding saturating arithmetic

<u45l0h$asnb$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32277&group=comp.arch#32277

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Thu, 18 May 2023 18:50:25 +0200
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <u45l0h$asnb$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 16:50:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="299569eec61287a2813cc72eb0614622";
logging-data="357099"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ywmSZcz6LGIeaE+CKLErFFp3M52A7pfg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:feyb+DGiPxfkGzJH3dL2V5cbJDY=
In-Reply-To: <32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
Content-Language: en-US
 by: Marcus - Thu, 18 May 2023 16:50 UTC

On 2023-05-16, luke.l...@gmail.com wrote:
> On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:
>
>> MRISC32 has saturating arithmetic:
>>
>> https://github.com/mrisc32/mrisc32/releases/latest/download/mrisc32-instruction-set-manual.pdf
>
> chapter 7. after seeing how out of control this can get
> in the AndesSTAR DSP ISA i always feel uneasy whenever
> i see explicit saturation opcodes added to an ISA that
> only has 32-bit available for instruction format.
>
>
>> The way I solved the size issue (8, 16, 32 bits in my case) is that I
>> have dedicated two bits of the instruction word for specifying the size.
>> See section 1.4 "Instruction encoding" (the "T" field).
>
> like that. room to expland to 64 later (even in a 32-bit ISA)
>
>> This implies packed SIMD in a 32-bit register (or a 32-bit element of a
>> vector register), as specified in chapter 4.
>
> i *really* don't understand why you would add fantastic Vector
> capability then irrevocably damage the ISA by adding PackedSIMD.
> if it was vec2/3/4 on *top* of the Vector capability (vec3 being
> the really important one as far as 3D is concerned) i would get it.
>
> PackedSIMD only works successfully where the data encountered
> is *exactly* matched to the ISA. vec2 for Left and Right Audio.
> vec3 for RGB. vec4 for ARGB or Quaternions XYZW.
>
> l.
>

So, first of all I am a novice when it comes to CPU and ISA design, so I
don't claim that I've made even close to perfect decisions... ;-)

With that said, the reasoning roughly went like this:

I wanted a way to *easily* saturate the memory interface (i.e. use it
to its full potential) when working with byte-sized elements. With
vector operations, I could only utilize the full memory bandwidth when
all 32 bits of the vector elements are loaded/stored. With byte-sized
vector load/store I only got 1/4th of the bandwidth, and I could not
figure out a simple way to quadruple vector register file write/read
traffic when doing byte-sized loads/stores.

I also realized that since I had a scalable solution for implementation
defined vector register sizes, there would be little harm in fixating
the "packed SIMD width" to 32 bits (unlike traditional packed SIMD
solutions where you need to alter the ISA and change the register
SIMD/register width every time you wish to increase parallelism).

In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
for all eternity. (In MRISC64 the packed SIMD width would be doubled,
but that is another ISA and another story - no binary compatibility is
planned etc).

I have also noticed that the uint8x4_t type (i.e. vec4<byte>) can be
quite useful when working with ARGB, for instance, and it's also quite
convenient to be able to perform packed SIMD on both vector and scalar
registers.

Now, I am not 100% happy with the solution, but at least it's much nicer
to work with than ISA:s such as SSE or NEON, and it's much more future
proof.

/Marcus

Re: Encoding saturating arithmetic

<14e172b0-5b31-4f53-b4b5-a3f0b7d2a3ben@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32278&group=comp.arch#32278

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1a95:b0:74d:33a7:1049 with SMTP id bl21-20020a05620a1a9500b0074d33a71049mr40978qkb.14.1684429673403;
Thu, 18 May 2023 10:07:53 -0700 (PDT)
X-Received: by 2002:a05:6870:5a8b:b0:192:7333:5afd with SMTP id
dt11-20020a0568705a8b00b0019273335afdmr792048oab.7.1684429673208; Thu, 18 May
2023 10:07:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 10:07:52 -0700 (PDT)
In-Reply-To: <u45l0h$asnb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u45l0h$asnb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <14e172b0-5b31-4f53-b4b5-a3f0b7d2a3ben@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: luke.leighton@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 18 May 2023 17:07:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 17
 by: luke.l...@gmail.com - Thu, 18 May 2023 17:07 UTC

On Thursday, May 18, 2023 at 5:52:18 PM UTC+1, Marcus wrote:

> So, first of all I am a novice when it comes to CPU and ISA design, so I
> don't claim that I've made even close to perfect decisions... ;-)

pfhh, you an me both :) only been at this 4 years.
> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
> but that is another ISA and another story - no binary compatibility is
> planned etc).

discuss under new comp.arch thread? (please just not a reply-to
with change-of-subject, google groups seriously creaking under
the load). your ISA: you start it?

l.

Re: Encoding saturating arithmetic

<u45m2d$asnb$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32279&group=comp.arch#32279

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Thu, 18 May 2023 19:08:29 +0200
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <u45m2d$asnb$2@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
<u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
<u447cg$5vgt$1@dont-email.me>
<66edf634-cc68-47ae-90f2-d4d422feca26n@googlegroups.com>
<e157bbc6-8c3a-4a68-a5c2-829cf42c6fa2n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 17:08:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="299569eec61287a2813cc72eb0614622";
logging-data="357099"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19qI5Rmp4FBtt5KQrIusG43Pj/qEu3d/u8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:mAWIKlFEa8lMvjgnHDJGV0VL8QI=
In-Reply-To: <e157bbc6-8c3a-4a68-a5c2-829cf42c6fa2n@googlegroups.com>
Content-Language: en-US
 by: Marcus - Thu, 18 May 2023 17:08 UTC

On 2023-05-18, luke.l...@gmail.com wrote:
> On Thursday, May 18, 2023 at 10:08:36 AM UTC+1, robf...@gmail.com wrote:
>
>> I really need a larger FPGA for my designs, any suggestions? I broke 500k
>> LUTs again and had to trim cores.
>
> yyeahh, million-LUT FPGAs are in the USD 10,000 and above
> bracket, and they make the U.S. Military so nervous that their
> access is usually restricted to very large Corporates (Intel,
> AMD, Apple, Texas Instruments - companies that will take
> a BXPA Munitions-Grade Classification seriously) and to
> U.S. Academic institutions (likewise).
>
> if you get a "no" when asking any Sales Rep about them, please
> for your own peace of mind take that "no" as a "Hard no", ok?
>
> if your design was Open Source or was exclusively Academic-related
> and/or implemented the Power ISA then i could potentially help
> point you in the right hint-of-a-direction (off-list only). but if it
> isn't (FOSS, Academic, PowerISA) then the answer would be no,
> even if you asked. sorry.
>
> an alternative idea that you might like to consider is to get FPGAs
> with high-speed inter-connect (some FPGAs have 25 gigabit SERDES
> now), then develop - or find - a "Coherent Bus Inter-Connect" protocol,
> and get two (or more) of the (smaller) FPGAs.

When I did my MSc thesis work (late 90's) we used two Xilinx FPGA:s in
tandem (goes looking... Virtex XCV400, pretty beefy at the time IIRC).

I guess one way to partition the problem would be to do memory I/O
and caches in one FPGA, and the CPU pipeline and execution units etc
in the other FPGA.

>
> if you're genuinely interested to go that route i have been considering
> putting in an NLnet Grant Request to cover it - but you would be
> required to create the entirety of the high-speed inter-connect HDL
> and associated documentation under FOSS Licenses as an inviolate
> pre-condition of the work.
>
> l.
>

Re: Encoding saturating arithmetic

<u45msk$asnb$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32280&group=comp.arch#32280

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Thu, 18 May 2023 19:22:28 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <u45msk$asnb$3@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u45l0h$asnb$1@dont-email.me>
<14e172b0-5b31-4f53-b4b5-a3f0b7d2a3ben@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 17:22:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="299569eec61287a2813cc72eb0614622";
logging-data="357099"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+bnu2nTBdYteTldxMbwBStgVVcTrgQK7E="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:ch9bjGtTOkqfaLPTDolufY7SxCI=
Content-Language: en-US
In-Reply-To: <14e172b0-5b31-4f53-b4b5-a3f0b7d2a3ben@googlegroups.com>
 by: Marcus - Thu, 18 May 2023 17:22 UTC

On 2023-05-18, luke.l...@gmail.com wrote:
> On Thursday, May 18, 2023 at 5:52:18 PM UTC+1, Marcus wrote:
>
>> So, first of all I am a novice when it comes to CPU and ISA design, so I
>> don't claim that I've made even close to perfect decisions... ;-)
>
> pfhh, you an me both :) only been at this 4 years.
>
>> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
>> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
>> but that is another ISA and another story - no binary compatibility is
>> planned etc).
>
> discuss under new comp.arch thread? (please just not a reply-to
> with change-of-subject, google groups seriously creaking under
> the load). your ISA: you start it?
>

So here's the reply-to... ;-)

I don't really have the time or energy to start that thread right now.

There are a few starting points:

* https://mrisc32.bitsnbites.eu/
* https://www.bitsnbites.eu/category/hardware-development/mrisc32/
* https://github.com/mrisc32

The MRISC64 ISA is not even started. I have a GitHub project where I
note down ideas every now and then:

* https://github.com/mbitsnbites/mrisc64

If you have any specific questions or discussion that you'd like to
have on the topic, feel free to start a new thread (I'll try to keep an
eye on comp.arch).

/Marcus

Re: Encoding saturating arithmetic

<u45o06$b96e$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32281&group=comp.arch#32281

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Thu, 18 May 2023 12:41:23 -0500
Organization: A noiseless patient Spider
Lines: 182
Message-ID: <u45o06$b96e$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
<u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
<u447cg$5vgt$1@dont-email.me> <41q9M.259746$qpNc.185276@fx03.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 17:41:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2b7f7e0b92dbf5449a07a2cc8713764a";
logging-data="369870"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tVBkqm/J95TRFyQTUQZxa"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:nzrsP9cZ1/5Nj+KLNUvyP8Le0tU=
In-Reply-To: <41q9M.259746$qpNc.185276@fx03.iad>
Content-Language: en-US
 by: BGB - Thu, 18 May 2023 17:41 UTC

On 5/18/2023 8:48 AM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 5/17/2023 3:13 PM, MitchAlsup wrote:
>>> On Wednesday, May 17, 2023 at 1:07:37 PM UTC-5, BGB wrote:
>>>> On 5/17/2023 10:51 AM, luke.l...@gmail.com wrote:
>>>>> On Tuesday, May 16, 2023 at 7:09:05 PM UTC+1, BGB wrote:
>>>>>> On 5/15/2023 8:46 PM, luke.l...@gmail.com wrote:
>>>>>>> chapter 7. after seeing how out of control this can get
>>>>>>> in the AndesSTAR DSP ISA i always feel uneasy whenever
>>>>>>> i see explicit saturation opcodes added to an ISA that
>>>>>>> only has 32-bit available for instruction format.
>>>>>>>
>>>>>> I can note that I still don't have any dedicated saturating ops, but
>>>>>> this is partly for cost and timing concerns (and I haven't yet
>>>>>> encountered a case where I "strongly needed" saturating ops).
>>>>>
>>>>> if you are doing Video Encode/Decode (try AV1 for example)
>>>>> you'll need them to stand a chance of any kind of power-efficient
>>>>> operation.
>>>>>
>>>> There are usually workarounds, say, using the SIMD ops as 2.14 rather
>>>> than 0.16, and then clamping after the fact.
>>>> Say: High 2 bits:
>>>> 00: Value in range
>>>> 01: Value out of range on positive side, clamp to 3FFF
>>>> 11: Value out of range on negative side, clamp to 0000
>>>> 10: Ambiguous, shouldn't happen.
>>> <
>>> This brings to mind:: the application:::
>>> <
>>> CPUs try to achieve highest frequency of operation and pipeline
>>> away logic delay problems--LDs are now 4 and 5 cycles rather than
>>> 2 (MIPS R3000); because that is where performance is as there is
>>> rarely enough parallelism to utilize more than a "few" cores.
>>> <
>>
>> I have 3-cycle memory access.
>
> To L1? Virtually indexed?
>

Yes, both.

L1 D$ access has a 3-cycle latency, 1-cycle throughput (so, one memory
access every clock-cycle in most cases).

The L1 is indexed based on virtual address, though in this case it is a
modulo-mapped direct-mapped cache, so as long as the virtual and
physical pages have the same alignment, the difference becomes
insignificant.

With a 16K L1 D$ and 16K page size, there is no difference.
Was using 32K for a while, but this makes timing more difficult.
A 64K L1 D$ basically explodes timing.

Ideally, to support 32K and (possibly) 64K L1 caches, a 64K alignment is
recommended. However, strict 64K alignment and/or a 64K page size is
undesirable as it reduces memory efficiency (mostly in terms of the
amount of padding space needed for mmap()/VirtualAlloc() and large
object "malloc()", *).

I ended up going with 16K pages mostly as this significantly reduced TLB
miss rate without suffering the same adverse effects as 64K pages (and,
in my testing, there was very little difference between 16K and 64K in
terms of nominal TLB miss rate for a given size of TLB).

Meanwhile: 8K was merely intermediate between 4K and 16K (still fairly
high miss rate, but lower than 4K). 32K had basically similar miss rates
to 16K and 64K, but worse memory overhead properties than 16K.

*: There is an issue of basically how big of objects can be handled by
allocating a memory block within a larger shared memory chunk, and when
one effectively needs to invoke a "mmap()" call to allocate it in terms
of pages. With 64K pages, one either needs to set this limit fairly high
(in turn potentially wasting memory by the heap chunks being larger than
ideal), or waste memory by the page-alloc cases near these transition
points having a significant amount of the total object size just in the
"wasted" memory at the end of the final page.

Say, if you want to malloc 67K and get 128K, this is a waste. If you get
80K, this is less of a waste.

As for timing:
L1: 3L/1T
L2: ~ 10 cycles per cache line average.
DRAM: +34 cycles per 64B / 512b, ~ 45 cycles per cache line average.

L1 RAM access (benchmarks, 50MHz):
~ 440 MB/s unidirectional Load/Store
~ 281 MB/s memcpy
L2:
~ 70 MB/s memcpy, 110 MB/s unidirectional.
DRAM:
~ 18 MB/s at present.

L2 and DRAM were faster before, but L2 and DRAM speeds were negatively
effected by going dual-core, and "fixing" a timing issue with the
ring-bus (needed to add an extra delay cycle where the CPU core connects
to the outside world, but with dual core this adds an extra 2-cycles to
the total ring latency).

Also, the RAM is slower on my newer board due to the DDR3 chip having a
higher minimum CAS latency than the DDR2 chip, ... (Ideally, would be
using SERDES and running the RAM at a more proper clock speed, but...
alas...).

But, on the XC7A200T, I can use a 512K L2 cache, which partly
compensates for the slower RAM...

Where, with the ringbus design, L2 and DRAM performance depends a fair
bit on how long it takes memory requests and responses to make a
round-trip around the ring (although, there are a few "shortcut" paths
to reduce this latency at various points, such as allowing DRAM requests
to more tightly cycle around the L2 cache rather than take a full trip
around the ring every time, or messages may skip over the ROM and MMIO
areas if the request isn't directed to them, etc).

Theoretical hard limits:
64-bit Load: 400MB/s
128-bit Load: 800MB/s
64-bit Copy: 200MB/s
128-bit Copy: 400MB/s

For the L1 tests, I seem to be within a factor of 2 of the theoretical
hard-limits.

Most other tests are within a factor of 2 of the hard-limits.

Have noted that, curiously, L1 memcpy speeds at 50MHz are within a
factor of 3x of the numbers I can get from a laptop from 2003 (it only
gets ~ 800 MB/s when using memcpy() to copy small items in RAM).

Though, DRAM memcpy speed is around 8x slower...

Ironically, even a simple RasPi stomps all over the memory-speed numbers
from the laptop... (Despite the theoretically slower CPU).

For purely CPU based tasks, the laptop holds up fairly well.

For "general performance", the laptop is roughly matched with a RasPi2
or RasPi3... (and for some of my compression and codec tasks, the
RasPi's are ahead).

Though, for some intensive floating point tasks...

Yeah... I have beaten the laptop with my BJX2 core via the powers of
FP-SIMD (vs x87).

It seems like x87 is a pretty severe bottleneck even if the operations
themselves aren't too slow when taken in isolation.

Like, if one does a single 1-cycle FP-SIMD op at 50MHz...

Or 16 x87 ops, and each FADD/FMUL/etc takes multiple clock cycles...
This does not bode particularly well for x87...

However, if I use a slightly newer Vista era Core 2 based laptop, it is
no contest... GHz only slightly higher, but overall performance is
significantly faster.

Laptop does suffer though from its "Intel GMA" chip "totally sucking"...
Like, it has 3D hardware acceleration that is barely much faster than
using a software renderer (and very prone to breaking in stupid ways).

Like, despite being nearly a decade newer, about the fanciest 3D games
it could run at acceptable framerates on its GPU were Quake 2/3 and
Half-Life...

But, fast-ish CPU at least...

Re: Encoding saturating arithmetic

<u45orh$bbcu$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32282&group=comp.arch#32282

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding saturating arithmetic
Date: Thu, 18 May 2023 12:55:59 -0500
Organization: A noiseless patient Spider
Lines: 60
Message-ID: <u45orh$bbcu$1@dont-email.me>
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de>
<tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com>
<u40gpi$3i77i$1@dont-email.me>
<13f7bf2d-fbb8-4b06-a5c4-6715add40c8an@googlegroups.com>
<u43523$3uml6$1@dont-email.me>
<3afa49f5-ce89-4a94-9aa1-d1c12cac5da7n@googlegroups.com>
<u447cg$5vgt$1@dont-email.me>
<e5e8be3e-c05d-43c2-9a19-22a1a852b9e8n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 18 May 2023 17:56:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2b7f7e0b92dbf5449a07a2cc8713764a";
logging-data="372126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Xrcp8t/R3taNtkobU90gd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:QKAxp7WCSY4BWc8F6DJWh/47H6Y=
Content-Language: en-US
In-Reply-To: <e5e8be3e-c05d-43c2-9a19-22a1a852b9e8n@googlegroups.com>
 by: BGB - Thu, 18 May 2023 17:55 UTC

On 5/18/2023 7:43 AM, luke.l...@gmail.com wrote:
> On Thursday, May 18, 2023 at 4:51:49 AM UTC+1, BGB wrote:
>> On 5/17/2023 3:13 PM, MitchAlsup wrote:
>>> Your application is mid-way between CPUs and GPUs.
>>>
>> Probably true, and it seems like I am getting properties that at times
>> seem more GPU-like than CPU-like.
>>
>>
>> Then, I am still off trying to get RISC-V code running on top of BJX2 as
>> well.
>
> did you by any chance Micro-code it? did you put in some
> internal re-writing into BJX2 internal operations, in some
> fashion? i would be interested to hear if you did so, and how.
> or, if it is a shared Micro-coding back-end with two disparate
> front-end ISAs.
>

No micro-code, just an alternate decoder.
There is no micro-code in my core at all, rather direct-logic for
everything.

The pipeline design for the BJX2 core was able to accommodate RISC-V
with only minor alterations, eg:
JAL needs to have a flexible link-register
BJX2's BSR had used a fixed link register.
RISC-V branches are relative to the base-PC of the instruction
BJX2 branches are relative to the following instruction.
RISC-V needs Compare-and-Branch.
This was mostly just annoying as this is more expensive.
A fixed 'compare with 0 and branch' would have been cheaper.
...

The 'M' extension requires hardware divide and modulo, this was added
and back-ported as an optional feature to BJX2.

The 'A' extension requires Load-Op and Store-Op for some ALU ops and
similar, this was also back-ported to BJX2 as an 'LdOp' extension.

The 'F' and 'D' extensions are still not 1:1 at present.

For RISC-V, there needed to be some internal remapping of the register
space relative to BJX2.

There is also now an XG2RV mode, which uses a variant of BJX2's
encoding, but RISC-V's register numbering.

So, basically, two front-end decoders on a shared pipeline.

Would have been a lot harder for many other ISA designs though.

....

> l.

Re: Encoding saturating arithmetic

<37b7d936-4ee2-4939-93cd-347d9427f773n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32283&group=comp.arch#32283

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5c52:0:b0:3f4:ed3e:73fa with SMTP id j18-20020ac85c52000000b003f4ed3e73famr165311qtj.7.1684432600531;
Thu, 18 May 2023 10:56:40 -0700 (PDT)
X-Received: by 2002:a05:6830:4782:b0:6ab:2465:4450 with SMTP id
df2-20020a056830478200b006ab24654450mr1090535otb.0.1684432600348; Thu, 18 May
2023 10:56:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 18 May 2023 10:56:40 -0700 (PDT)
In-Reply-To: <u45l0h$asnb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84a2:55c:f9c8:b506;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84a2:55c:f9c8:b506
References: <tl4k53$3f0sn$1@newsreader4.netcologne.de> <tmhtv4$3lm4g$1@dont-email.me>
<32de14fd-1bbd-4dec-8689-46395ff2fe2en@googlegroups.com> <u45l0h$asnb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <37b7d936-4ee2-4939-93cd-347d9427f773n@googlegroups.com>
Subject: Re: Encoding saturating arithmetic
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 18 May 2023 17:56:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 76
 by: MitchAlsup - Thu, 18 May 2023 17:56 UTC

On Thursday, May 18, 2023 at 11:52:18 AM UTC-5, Marcus wrote:
> On 2023-05-16, luke.l...@gmail.com wrote:
> > On Sunday, December 4, 2022 at 10:49:11 AM UTC, Marcus wrote:
>
>
> So, first of all I am a novice when it comes to CPU and ISA design, so I
> don't claim that I've made even close to perfect decisions... ;-)
>
> With that said, the reasoning roughly went like this:
>
> I wanted a way to *easily* saturate the memory interface (i.e. use it
> to its full potential) when working with byte-sized elements. With
<
A good starting point.
<
> vector operations, I could only utilize the full memory bandwidth when
> all 32 bits of the vector elements are loaded/stored. With byte-sized
> vector load/store I only got 1/4th of the bandwidth, and I could not
> figure out a simple way to quadruple vector register file write/read
> traffic when doing byte-sized loads/stores.
<
This is where multiple lanes are used to consume more bandwidth
when you know the memory reference pattern is "dense".
<
The vector alternative is to use gather/scatter memory references
and perform multiple AGENs per cycle--a much more costly alternative.
>
> I also realized that since I had a scalable solution for implementation
> defined vector register sizes, there would be little harm in fixating
> the "packed SIMD width" to 32 bits (unlike traditional packed SIMD
> solutions where you need to alter the ISA and change the register
> SIMD/register width every time you wish to increase parallelism).
>
> In MRISC32, each packed SIMD unit is 1x32 bits, 2x16 bits or 4x8 bits,
> for all eternity. (In MRISC64 the packed SIMD width would be doubled,
> but that is another ISA and another story - no binary compatibility is
> planned etc).
<
Like you, I prefer that SIMD-style calculations use natural register
widths. Unlike you, I left SIMD out of my ISA and found (what I consider)
a better alternative than {calculations}×{widths}×{special-properties}
that accompany SIMD. Since memory references come with {widths}
(and signed, unsigned semantics), and calculations are self describing,
AND you have predication in the ISA, then synthesizing SIMD using VVM
is actually straightforward.
<
This, then, gives the HW freedom to implement the SIMD width appropriate
for that implementation, and preserve code-comparability across all SIMD
widths and across all implementations.
<
It also eliminates 1280 = {{16}×{4})×{4}×{5} instructions from ISA. (More
if you support 8-bit and 16-bit FP in SIMD.)
<
Sooner or later the R in RISC should stand for "reduced".
It is my contention that any ISA with more than 200-ish instructions
ceases to be RISC.
<
I don't know of a architecture with SIMD instructions that fits under 200
total instructions.
>
> I have also noticed that the uint8x4_t type (i.e. vec4<byte>) can be
> quite useful when working with ARGB, for instance, and it's also quite
> convenient to be able to perform packed SIMD on both vector and scalar
> registers.
>
> Now, I am not 100% happy with the solution, but at least it's much nicer
> to work with than ISA:s such as SSE or NEON, and it's much more future
> proof.
>
> /Marcus

Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor