Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

If a listener nods his head when you're explaining your program, wake him up.

Re: "Mini" tags to reduce the number of op codes

Subject	Author
"Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	Anton Ertl
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	EricP
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	Anton Ertl
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	Chris M. Thomasson
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Chris M. Thomasson
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	Brian G. Lucas
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1

Pages:1 234

Re: "Mini" tags to reduce the number of op codes

<uv9ahu$1r74h$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38274&group=comp.arch#38274

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 13:35:41 -0500
Organization: A noiseless patient Spider
Lines: 114
Message-ID: <uv9ahu$1r74h$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 20:35:43 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="059e35bc5e274e101eeeb06f16103042";
logging-data="1940625"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NuuYR9BmlGmdQxs+2l5crRA39OsERurk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:s+bksfgZkRXLHS+U+gzx9TwjQAE=
In-Reply-To: <20240411141324.0000090d@yahoo.com>
Content-Language: en-US

by: BGB - Thu, 11 Apr 2024 18:35 UTC

On 4/11/2024 6:13 AM, Michael S wrote:
> On Wed, 10 Apr 2024 23:30:02 +0000
> mitchalsup@aol.com (MitchAlsup1) wrote:
>
>> Scott Lurndal wrote:
>>
>>> mitchalsup@aol.com (MitchAlsup1) writes:
>>>> BGB wrote:
>>>>
>>>>
>>>> In My 66000 case, the constant is the word following the
>>>> instruction. Easy to find, easy to access, no register pollution,
>>>> no DCache pollution.
>>
>>> It does occupy some icache space, however; have you boosted the
>>> icache size to compensate?
>>
>> The space occupied in the ICache is freed up from being in the DCache
>> so the overall hit rate goes up !! At typical sizes, ICache miss rate
>> is about ¼ the miss rate of DCache.
>>
>> Besides:: if you had to LD the constant from memory, you use a LD
>> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
>> overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
>>
>> Alternatively:: if you paste constants together (LUI, AUPIC) you have
>> no direct route to either 64-bit constants or 64-bit address spaces.
>>
>> It looks to be a win-win !!
>
> Win-win under constraints of Load-Store Arch. Otherwise, it depends.
>

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

In misc news:

Some compiler fiddling has now dropped the ".text" overhead (vs RV64G)
from 10% to 5%.

This was mostly in the form of adding dependency tracking logic to ASM
code (albeit in a form where it needs to use ".global" and ".extern"
statements for things to work correctly), and no longer giving it a free
pass.

This in turn allowed it to effectively cull some parts of the dynamic
typesystem runtime and a bunch of the Binary128 support code (shaving
roughly 14K off of the Doom build).

Does have a non-zero code impact (mostly in the form of requiring adding
".global" and ".extern" lines to the ASM code in some cases where they
were absent).

Looks like a fair chunk of the dynamic types runtime is still present
though, which appears to be culled in the GCC build (since GCC doesn't
use the dynamic typesystem at all). Theoretically, Doom should not need
it, as Doom is entirely "plain old C".

Main part that ended up culled with this change was seemingly most of
the code for ex-nihilo objects and similar (which does not seem to be
reachable from any of the Doom code).

There is a printf extension for printing variant types, but this is
still present in the RV64G build (this would mostly include code needed
for the "toString" operation). I guess, one could debate whether printf
actually needs support for variant types (as can be noted, most normal C
code will not use it).

Though, I guess one option could be to modify it to call toString via a
function pointer which is only set if other parts of the dynamic
typesystem are initialized (could potentially save several kB off the
size of the binary it looks like). Might break stuff though if one ties
to printf a variant but had not used any types much beyond fixnum and
flonum, which would not have triggered the typesystem to initialize itself.

Probably doesn't matter too much, as this code is not likely a factor in
the delta between the ISAs.

Note that if the size of Doom's ".text" section dropped by another 15K,
it would reach parity with the RV64G build (which was around 290K in the
relevant build ATM; goal being to keep the code fairly close to parity
in this case, with the differences mostly allowed for ISA specific stuff).

Though, this is ignoring that roughly 11K of this delta are Jumbo
prefixes (so the delta in instruction count is now roughly 1.3% at the
moment); and RV64G has an additional 24K in its ".rodata" section
(beyond what could be accounted for in string literals and similar).

So, in terms of text+rodata (+strtab *), my stuff is smaller at the moment.

*: Where GCC rolls its string literals into '.rodata', vs BGBCC having a
dedicated section for string literals.

....

Re: "Mini" tags to reduce the number of op codes

<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38275&group=comp.arch#38275

copy link Newsgroups: comp.arch

Date: Thu, 11 Apr 2024 18:46:54 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$Ide.9PKN.1mz56Q6kKRMcupnkjECTEkGZTVal2ocVRHBOGJN9P7Km
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>

by: MitchAlsup1 - Thu, 11 Apr 2024 18:46 UTC

BGB wrote:

> On 4/11/2024 6:13 AM, Michael S wrote:
>> On Wed, 10 Apr 2024 23:30:02 +0000
>> mitchalsup@aol.com (MitchAlsup1) wrote:
>>
>>>
>>>> It does occupy some icache space, however; have you boosted the
>>>> icache size to compensate?
>>>
>>> The space occupied in the ICache is freed up from being in the DCache
>>> so the overall hit rate goes up !! At typical sizes, ICache miss rate
>>> is about ¼ the miss rate of DCache.
>>>
>>> Besides:: if you had to LD the constant from memory, you use a LD
>>> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
>>> overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
>>>
>>> Alternatively:: if you paste constants together (LUI, AUPIC) you have
>>> no direct route to either 64-bit constants or 64-bit address spaces.
>>>
>>> It looks to be a win-win !!
>>
>> Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

> FWIW:
> The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
> and needs less encoding space than the LUI route.

> MOV Imm16. Rn
> SHORI Imm16, Rn
> SHORI Imm16, Rn
> SHORI Imm16, Rn

> Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

As compared to::

CALK Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least
5 cycles to use the loaded/built constant.}}

> An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
> 1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

Re: "Mini" tags to reduce the number of op codes

<uv9i0i$1srig$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38276&group=comp.arch#38276

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 15:42:59 -0500
Organization: A noiseless patient Spider
Lines: 129
Message-ID: <uv9i0i$1srig$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 22:42:59 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="4e6cedd45fc4a12a57db9991b60fc324";
logging-data="1994320"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JyzN5Kxep3PJLAE4QVgTLkhqBNeEutPA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:QqHj4nEeeoq0FJFOjFLqAbkFbpQ=
In-Reply-To: <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
Content-Language: en-US

by: BGB-Alt - Thu, 11 Apr 2024 20:42 UTC

On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/11/2024 6:13 AM, Michael S wrote:
>>> On Wed, 10 Apr 2024 23:30:02 +0000
>>> mitchalsup@aol.com (MitchAlsup1) wrote:
>>>
>>>>
>>>>> It does occupy some icache space, however; have you boosted the
>>>>> icache size to compensate?
>>>>
>>>> The space occupied in the ICache is freed up from being in the DCache
>>>> so the overall hit rate goes up !! At typical sizes, ICache miss rate
>>>> is about ¼ the miss rate of DCache.
>>>>
>>>> Besides:: if you had to LD the constant from memory, you use a LD
>>>> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
>>>> overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
>>>>
>>>> Alternatively:: if you paste constants together (LUI, AUPIC) you have
>>>> no direct route to either 64-bit constants or 64-bit address spaces.
>>>>
>>>> It looks to be a win-win !!
>>>
>>> Win-win under constraints of Load-Store Arch. Otherwise, it depends.
>
> Never seen a LD-OP architecture where the inbound memory can be in the
> Rs1 position of the instruction.
>
>>>
>
>> FWIW:
>> The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
>> and needs less encoding space than the LUI route.
>
>>    MOV Imm16. Rn
>>    SHORI Imm16, Rn
>>    SHORI Imm16, Rn
>>    SHORI Imm16, Rn
>
>> Granted, if each is a 1-cycle instruction, this still takes 4 clock
>> cycles.
>
> As compared to::
>
>     CALK   Rd,Rs1,#imm64
>
> Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
> of the constant is free !! (0 cycles) !! {{The above example uses at least
> 5 cycles to use the loaded/built constant.}}
>

The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding. Also technically could be retrofitted onto
RISC-V without any significant change, unlike some other options (as
noted, I don't argue for adding Jumbo prefixes to RV under the basis
that there is no real viable way to add them to RV, *).

Sadly, the closest option to viable for RV would be to add the SHORI
instruction and optionally pattern match it in the fetch/decode.

Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16

Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
only if the Imm12 is positive), and 2x SHORI into a combined
"Xn=(Xn<<32)|Imm32" operation.

This could potentially get it down to 2 clock cycles.

*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.

Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal
encoding space mostly because if I put instructions in there (with the
existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16 ops
or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).

Though, 14x of these spaces would likely be alternate forms of Jumbo
prefix (with another 14 in unconditional-scalar-op land). No immediate
need to re-add an equivalent of the 40x2 encoding (from Baseline mode),
as most of what 40x2 addressed can be encoded natively in XG2 Mode.

Technically, I also have 2 unused bits in the Imm16 ops as well in XG2
Mode. I "could" in theory, if I wanted, use them to extend the:
MOV Imm17s, Rn
Case, to:
MOV Imm19s, Rn
Though, the other option is to leave them reserved if I later want more
Imm16 ops.

For now, current plan is to leave this stuff as reserved.

>> An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
>> 1-cycle, is preferable....
>
> A consuming instruction where you don't even use a register is better
> still !!

Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
to addresses around 99% of uses (for normal ALU ops and similar).

Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By
themselves though, the difference doesn't seem enough to justify the cost.

Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
12 bytes (and allowing a 16-byte encoding would have too steep of a cost
increase to be worthwhile).

So, alas...

Re: "Mini" tags to reduce the number of op codes

<uv9kh2$1tcks$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38277&group=comp.arch#38277

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 16:25:54 -0500
Organization: A noiseless patient Spider
Lines: 111
Message-ID: <uv9kh2$1tcks$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv7h9k$1ek3q$1@dont-email.me> <7uSRN.161295$m4d.65414@fx43.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 23:25:55 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="4e6cedd45fc4a12a57db9991b60fc324";
logging-data="2011804"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Q+8WN2sAXCnzXaPGLJuwUX4Wlu/Abfas="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:1m7hBkbEHwgvTY7ow8u/MKa5CFQ=
In-Reply-To: <7uSRN.161295$m4d.65414@fx43.iad>
Content-Language: en-US

by: BGB-Alt - Thu, 11 Apr 2024 21:25 UTC

On 4/11/2024 9:30 AM, Scott Lurndal wrote:
> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>> On 4/9/24 8:28 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>> [snip]
>>>> Things like memcpy/memmove/memset/etc, are function calls in
>>>> cases when not directly transformed into register load/store
>>>> sequences.
>>>
>>> My 66000 does not convert them into LD-ST sequences, MM is a
>>> single instruction.
>>
>> I wonder if it would be useful to have an immediate count form of
>> memory move. Copying fixed-size structures would be able to use an
>> immediate. Aside from not having to load an immediate for such
>> cases, there might be microarchitectural benefits to using a
>> constant. Since fixed-sized copies would likely be limited to
>> smaller regions (with the possible exception of 8 MiB page copies)
>> and the overhead of loading a constant for large sizes would be
>> tiny, only providing a 16-bit immediate form might be reasonable.
>
> It seems to me that an offloaded DMA engine would be a far
> better way to do memmove (over some threshhold, perhaps a
> cache line) without trashing the caches. Likewise memset.
>

Probably.
One could argue that, likely, setting up a DMA'ed memmove would be
expensive enough to make it impractical for small copies (in the
category where I am using inline Ld/St sequences or slides).

And, larger copies (where it is most likely to bring benefit) at present
mostly seem to be bus/memory bound.

Sort of reminds me of the thing with the external rasterizer module:
The module itself draws stuff quickly, but getting it set-up this far is
still expensive enough to limit its benefit. So the main benefit it
could bring is seemingly just using it to pull off multi-textured
lightmap rendering, which in this case can run at similar speeds to
vertex lighting (lightmapped rendering being a somewhat slower option
for the software rasterizer).

Well, along with me recently realizing a trick to mimic the look of
trilinear filtering without increasing the number of texture fetches
(mostly by distorting the interpolation coords, *). This trick could
potentially be added to the rasterizer module.

*: Traditional bilinear needs 4 texel fetches and 3 lerps (or, a poor
man's approximation with 3 fetches and 2 lerps). Traditional trilinear
needs 8 fetches and 7 lerps. The "cheap trick" version only needing the
same as bilinear.

One thing that is still needed is a good, fast, and semi-accurate way to
pull off the Z=1.0/Z' calculation, as needed for perspective-correct
rasterization (affine requires subdivision, which adds cost to the
front-end, and interpolating Z directly adds significant distortion for
geometry near the near plane).

Granted, this would almost seem to create a need for an OpenGL
implementation designed around the assumption of a hardware rasterizer
module rather than software span drawing.

Rasterizer module also has its own caching, where it sometimes may be
needed to signal it to perform a cache flush (such as when updating the
contents of a texture, or needing to access the framebuffer for some
other reason, ...).

Potentially, the module could be used to copy/transform images in a
framebuffer (such as for GUI rendering), but would need to be somewhat
generalized for this (such as supporting using non-power-of-2
raster-images as textures).

Though, another possibility could be adding a dedicated DMA module, or
DMA+Image module, or glue dedicated DMA and Raster-Copy functionality
onto the rasterizer module (as a separate thing from its normal "walk
edges and blend pixels" functionality).

>
>>
>>>> Did end up with an intermediate "memcpy slide", which can handle
>>>> medium size memcpy and memset style operations by branching into
>>>> a slide.
>>>
>>> MMs and MSs that do not cross page boundaries are ATOMIC. The
>>> entire system
>>> sees only the before or only the after state and nothing in
>>> between.
>
> One might wonder how that atomicity is guaranteed in a
> SMP processor...
>

Dunno there.

My stuff doesn't guerantee atomicity in general.

Only way to ensure that both parties agree on the contents of memory, is
that both need to flush their L1 caches or similar.

Or use "No Cache" memory accesses, which is basically implemented as the
L1 cache auto-flushing the line as soon as the request finishes; for
good effect one also needs to add a few NOPs after the memory access to
be sure the L1 has a chance to auto-flush it. Though, another
possibility could be to add dedicated non-caching memory access
instructions.

Re: "Mini" tags to reduce the number of op codes

<f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38278&group=comp.arch#38278

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 23:06:05 +0000
Organization: Rocksolid Light
Message-ID: <f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <uv9i0i$1srig$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="872144"; mail-complaints-to="usenet@i2pn2.org";
posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$ddEVLZ.ZcoaLXUjj2vZgneWmznvUA8tOQ3RsYDLZBvcjkRoqrUhLG

by: MitchAlsup1 - Thu, 11 Apr 2024 23:06 UTC

BGB-Alt wrote:

> On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>>>
>>>> Win-win under constraints of Load-Store Arch. Otherwise, it depends.
>>
>> Never seen a LD-OP architecture where the inbound memory can be in the
>> Rs1 position of the instruction.
>>
>>>>
>>
>>> FWIW:
>>> The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
>>> and needs less encoding space than the LUI route.
>>
>>>    MOV Imm16. Rn
>>>    SHORI Imm16, Rn
>>>    SHORI Imm16, Rn
>>>    SHORI Imm16, Rn
>>
>>> Granted, if each is a 1-cycle instruction, this still takes 4 clock
>>> cycles.
>>
>> As compared to::
>>
>>     CALK   Rd,Rs1,#imm64
>>
>> Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
>> of the constant is free !! (0 cycles) !! {{The above example uses at least
>> 5 cycles to use the loaded/built constant.}}
>>

> The main reason one might want SHORI is that it can fit into a
> fixed-length 32-bit encoding.

While 32-bit encoding is RISC mantra, it has NOT been shown to be best
just simplest. Then, once you start widening the microarchitecture, it
is better to fetch wider than decode-issue so that you suffer least
from boundary conditions. Once you start fetching wide OR have wide
decode-issue, you have ALL the infrastructure to do variable length
instructions. Thus, complaining that VLE is hard has already been
eradicated.

> Also technically could be retrofitted onto
> RISC-V without any significant change, unlike some other options (as
> noted, I don't argue for adding Jumbo prefixes to RV under the basis
> that there is no real viable way to add them to RV, *).

The issue is that once you do VLE RISC-Vs ISA is no longer helping you
get the job done, especially when you have to execute 40% more instructions

> Sadly, the closest option to viable for RV would be to add the SHORI
> instruction and optionally pattern match it in the fetch/decode.

> Or, say:
> LUI Xn, Imm20
> ADD Xn, Xn, Imm12
> SHORI Xn, Imm16
> SHORI Xn, Imm16

> Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
> only if the Imm12 is positive), and 2x SHORI into a combined
> "Xn=(Xn<<32)|Imm32" operation.

> This could potentially get it down to 2 clock cycles.

Universal constants gets this down to 0 cycles......

> *: To add a jumbo prefix, one needs an encoding that:
> Uses up a really big chunk of encoding space;
> Is otherwise illegal and unused.
> RISC-V doesn't have anything here.

Which is WHY you should not jump ship from SH to RV, but jump to an
ISA without these problems.

> Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
> space that aren't yet used for anything, but aren't usable as normal
> encoding space mostly because if I put instructions in there (with the
> existing encoding schemes), I couldn't use all the registers (and they
> would not have predication or similar either). Annoyingly, the only
> types of encodings that would fit in there at present are 2RI Imm16 ops
> or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
> encodings for R0..R31 anyways, interpreting the LSB of the register
> field as encoding R32..R63).

Just another reason not to stay with what you have developed.

In comparison, I reserve 6-major OpCodes so that a control transfer into
data is highly likely to get Undefined OpCode exceptions rather than a
try to execute what is in that data. Then, as it is, I still have 21-slots
in the major OpCode group free (27 if you count the permanently reserved).

Much of this comes from side effects of Universal Constants.

>>> An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
>>> 1-cycle, is preferable....
>>
>> A consuming instruction where you don't even use a register is better
>> still !!

> Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
> to addresses around 99% of uses (for normal ALU ops and similar).

What do you do when accessing data that the linker knows is more than 4GB
away from IP ?? or known to be outside of 0-4GB ?? externs, GOT, PLT, ...

> Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
> or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By
> themselves though, the difference doesn't seem enough to justify the cost.

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

> Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
> 12 bytes (and allowing a 16-byte encoding would have too steep of a cost
> increase to be worthwhile).

And yet I did.

> So, alas...

Yes, alas..........

Re: "Mini" tags to reduce the number of op codes

<e4443c417f7145d65b04bec48160c629@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38279&group=comp.arch#38279

copy link Newsgroups: comp.arch

Date: Thu, 11 Apr 2024 23:12:25 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$UjW8C0nybkeBVTsf53OHber1QE.2/Zs.TQp7ADoQd1iamNK2NySCi
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me> <7uSRN.161295$m4d.65414@fx43.iad>
Organization: Rocksolid Light
Message-ID: <e4443c417f7145d65b04bec48160c629@www.novabbs.org>

by: MitchAlsup1 - Thu, 11 Apr 2024 23:12 UTC

Scott Lurndal wrote:

> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>On 4/9/24 8:28 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>[snip]
>>>> Things like memcpy/memmove/memset/etc, are function calls in
>>>> cases when not directly transformed into register load/store
>>>> sequences.
>>>
>>> My 66000 does not convert them into LD-ST sequences, MM is a
>>> single instruction.
>>
>>I wonder if it would be useful to have an immediate count form of
>>memory move. Copying fixed-size structures would be able to use an
>>immediate. Aside from not having to load an immediate for such
>>cases, there might be microarchitectural benefits to using a
>>constant. Since fixed-sized copies would likely be limited to
>>smaller regions (with the possible exception of 8 MiB page copies)
>>and the overhead of loading a constant for large sizes would be
>>tiny, only providing a 16-bit immediate form might be reasonable.

> It seems to me that an offloaded DMA engine would be a far
> better way to do memmove (over some threshhold, perhaps a
> cache line) without trashing the caches. Likewise memset.

Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.

Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width wide}

>>
>>>> Did end up with an intermediate "memcpy slide", which can handle
>>>> medium size memcpy and memset style operations by branching into
>>>> a slide.
>>>
>>> MMs and MSs that do not cross page boundaries are ATOMIC. The
>>> entire system
>>> sees only the before or only the after state and nothing in
>>> between.

> One might wonder how that atomicity is guaranteed in a
> SMP processor...

The entire chunk of data traverses the interconnect as a single
transaction. All interested 3rd parties (not originator nor
recipient) see either the memory state before the transfer or
after the transfer.

Re: "Mini" tags to reduce the number of op codes

<20240412021904.000074f8@yahoo.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38280&group=comp.arch#38280

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Fri, 12 Apr 2024 02:19:04 +0300
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <20240412021904.000074f8@yahoo.com>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com>
<uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me>
<uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com>
<uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Injection-Date: Fri, 12 Apr 2024 01:19:11 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="37f308e7a2375b7c2eef0e4b685e814c";
logging-data="2055534"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193QkZkc9EVj5auAWL9/heMIsOuOdLd+oU="
Cancel-Lock: sha1:0w2SaMFiWfV7r1LfsqNQUJjAdQ4=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)

by: Michael S - Thu, 11 Apr 2024 23:19 UTC

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

>
> > On 4/11/2024 6:13 AM, Michael S wrote:
> >> On Wed, 10 Apr 2024 23:30:02 +0000
> >> mitchalsup@aol.com (MitchAlsup1) wrote:
> >>
> >>>
> >>>> It does occupy some icache space, however; have you boosted the
> >>>> icache size to compensate?
> >>>
> >>> The space occupied in the ICache is freed up from being in the
> >>> DCache so the overall hit rate goes up !! At typical sizes,
> >>> ICache miss rate is about ¼ the miss rate of DCache.
> >>>
> >>> Besides:: if you had to LD the constant from memory, you use a LD
> >>> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
> >>> overall, it takes fewer cycles, fewer GPRs, and fewer
> >>> instructions.
> >>>
> >>> Alternatively:: if you paste constants together (LUI, AUPIC) you
> >>> have no direct route to either 64-bit constants or 64-bit address
> >>> spaces.
> >>>
> >>> It looks to be a win-win !!
> >>
> >> Win-win under constraints of Load-Store Arch. Otherwise, it
> >> depends.
>
> Never seen a LD-OP architecture where the inbound memory can be in
> the Rs1 position of the instruction.
>

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP
architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.

Re: "Mini" tags to reduce the number of op codes

<6b732f05c47dfb8bb9aa2df8d6a68b38@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38281&group=comp.arch#38281

copy link Newsgroups: comp.arch

Date: Thu, 11 Apr 2024 23:22:16 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$w.IwEUpzDkIA5mDGJXa0ReBPzqwcFz.6ugLDEysW067qrsDpKwAie
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me> <7uSRN.161295$m4d.65414@fx43.iad> <uv9kh2$1tcks$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <6b732f05c47dfb8bb9aa2df8d6a68b38@www.novabbs.org>

by: MitchAlsup1 - Thu, 11 Apr 2024 23:22 UTC

BGB-Alt wrote:

> On 4/11/2024 9:30 AM, Scott Lurndal wrote:
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>>

> One thing that is still needed is a good, fast, and semi-accurate way to
> pull off the Z=1.0/Z' calculation, as needed for perspective-correct
> rasterization (affine requires subdivision, which adds cost to the
> front-end, and interpolating Z directly adds significant distortion for
> geometry near the near plane).

I saw a 10-cycle latency 1-cycle throughput divider at Samsumg::
10 stages of 3-bit at a time SRT divider with some exponent stuff
on the side. 1.0/z is a lot simpler than that (float only). A lot
of these great big complicated calculations can be beaten into
submission with a clever attack of brute force HW.....FMUL and FMAC
being the most often cited cases.

Re: "Mini" tags to reduce the number of op codes

<uva1fu$2010o$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38282&group=comp.arch#38282

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 20:07:08 -0500
Organization: A noiseless patient Spider
Lines: 273
Message-ID: <uva1fu$2010o$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
<uv9i0i$1srig$1@dont-email.me>
<f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 12 Apr 2024 03:07:11 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5521f5d032488c5ad7ae13ff64f338b6";
logging-data="2098200"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LtR3y7CIVKFBmGcYWw5HAdLIWRbjEGh0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:nO+CV4X0LtQkrFxVTgZwwPuKPuI=
Content-Language: en-US
In-Reply-To: <f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>

by: BGB - Fri, 12 Apr 2024 01:07 UTC

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>>>
>>>>> Win-win under constraints of Load-Store Arch. Otherwise, it depends.
>>>
>>> Never seen a LD-OP architecture where the inbound memory can be in
>>> the Rs1 position of the instruction.
>>>
>>>>>
>>>
>>>> FWIW:
>>>> The LDSH / SHORI mechanism does provide a way to get 64-bit
>>>> constants, and needs less encoding space than the LUI route.
>>>
>>>>    MOV Imm16. Rn
>>>>    SHORI Imm16, Rn
>>>>    SHORI Imm16, Rn
>>>>    SHORI Imm16, Rn
>>>
>>>> Granted, if each is a 1-cycle instruction, this still takes 4 clock
>>>> cycles.
>>>
>>> As compared to::
>>>
>>>      CALK   Rd,Rs1,#imm64
>>>
>>> Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
>>> of the constant is free !! (0 cycles) !! {{The above example uses at
>>> least
>>> 5 cycles to use the loaded/built constant.}}
>>>
>
>> The main reason one might want SHORI is that it can fit into a
>> fixed-length 32-bit encoding.
>
> While 32-bit encoding is RISC mantra, it has NOT been shown to be best
> just simplest. Then, once you start widening the microarchitecture, it
> is better to fetch wider than decode-issue so that you suffer least from
> boundary conditions. Once you start fetching wide OR have wide
> decode-issue, you have ALL the infrastructure to do variable length
> instructions. Thus, complaining that VLE is hard has already been
> eradicated.
>

As noted, BJX2 is effectively VLE.
Just now split into two sub-variants.

So, as for lengths:
Baseline: 16/32/64/96
XG2: 32/64/96
Original version was 16/32/48.

But, the original 48-bit encoding was dropped, mostly to make the rest
of the encoding more orthogonal, and these were replaced with Jumbo
prefixes. An encoding space exists where 48-bit ops could in theory be
re-added to Baseline, but have not done so as it does not seem be
justifiable in a cost/benefit sense (and would still have some of the
same drawbacks as the original 48 bit ops).

Had also briefly experimented with 24-bit ops, but these were quickly
dropped due to "general suckage" (though, an alternate 16/24/32/48
encoding scheme could have theoretically given better code-density).

However, RISC-V is either 32-bit, or 16/32.

For now, I am not bothering with the 16-bit C extension, not so much for
sake of difficulty of dealing with VLE (the core can already deal with
VLE), but more because the 'C' encodings are such a dog chewed mess that
I don't feel terribly inclined to bother with them.

But, like, I can't really compare BJX2 Baseline with RV64G in terms of
code density, because this wouldn't be a fair comparison. Would need to
compare code-density between Baseline and RV64GC, which would imply
needing to actually support the C extension.

I could already claim a "win" here if I wanted, but as I see it, doing
so would not be valid.

Theoretically, encoding space exists for bigger ops in RISC-V, but no
one has defined ops there yet as far as I know. Also, the way RISC-V
represents larger ops is very different.

However, comparing fixed-length against VLE when the VLE only has larger
instructions, is still acceptable as I see it (even if larger
instructions can still allow a more compact encoding in some cases).

Say, for example, as I see it, SuperH vs Thumb2 would still be a fair
comparison, as would Thumb2 vs RV32GC, but Thumb2 vs RV32G would not.

Unless one only cares about "absolute code density" irrespective of
keeping parity in terms of feature-set.

>> Also technically could be retrofitted
>> onto RISC-V without any significant change, unlike some other options
>> (as noted, I don't argue for adding Jumbo prefixes to RV under the
>> basis that there is no real viable way to add them to RV, *).
>
> The issue is that once you do VLE RISC-Vs ISA is no longer helping you
> get the job done, especially when you have to execute 40% more instructions
>

Yeah.

As noted, I had already been beating RISC-V in terms of performance,
only there was a shortfall in terms of ".text" size (for the XG2 variant).

Initially this was around a 16% delta, now down to around 5%. Nearly all
of the size reduction thus far, has been due to fiddling with stuff in
my compiler.

In theory, BJX2 (XG2) should be able to win in terms of code-density, as
the only cases where RISC-V has an advantage do not appear to be
statistically significant.

As also noted, I am using "-ffunction-sections" and similar (to allow
GCC to prune unreachable functions), otherwise there is "no contest"
(easier to win against 540K than 290K...).

>> Sadly, the closest option to viable for RV would be to add the SHORI
>> instruction and optionally pattern match it in the fetch/decode.
>
>> Or, say:
>>    LUI Xn, Imm20
>>    ADD Xn, Xn, Imm12
>>    SHORI Xn, Imm16
>>    SHORI Xn, Imm16
>
>> Then, combine LUI+ADD into a 32-bit load in the decoder (though
>> probably only if the Imm12 is positive), and 2x SHORI into a combined
>> "Xn=(Xn<<32)|Imm32" operation.
>
>> This could potentially get it down to 2 clock cycles.
>
> Universal constants gets this down to 0 cycles......
>

Possibly.

>> *: To add a jumbo prefix, one needs an encoding that:
>> Uses up a really big chunk of encoding space;
>> Is otherwise illegal and unused.
>> RISC-V doesn't have anything here.
>
> Which is WHY you should not jump ship from SH to RV, but jump to an
> ISA without these problems.
>

Of the options that were available at the time:
SuperH: Simple encoding and decent code density;
RISC-V: Seemed like it would have had worse code density.
Though, it seems that RV beats SH in this area.
Thumb: Uglier encoding and some more awkward limitations vs SH.
Also, condition codes, etc.
Thumb2: Was still patent-encumbered at the time.
PowerPC: Bleh.
...

The main reason for RISC-V support is not due to "betterness", but
rather because RISC-V is at least semi-popular (and not as bad as I
initially thought, in retrospect).

>> Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
>> space that aren't yet used for anything, but aren't usable as normal
>> encoding space mostly because if I put instructions in there (with the
>> existing encoding schemes), I couldn't use all the registers (and they
>> would not have predication or similar either). Annoyingly, the only
>> types of encodings that would fit in there at present are 2RI Imm16
>> ops or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
>> encodings for R0..R31 anyways, interpreting the LSB of the register
>> field as encoding R32..R63).
>
> Just another reason not to stay with what you have developed.
>
> In comparison, I reserve 6-major OpCodes so that a control transfer into
> data is highly likely to get Undefined OpCode exceptions rather than a
> try to execute what is in that data. Then, as it is, I still have 21-slots
> in the major OpCode group free (27 if you count the permanently reserved).
>
> Much of this comes from side effects of Universal Constants.
>
>
>>>> An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
>>>> 1-cycle, is preferable....
>>>
>>> A consuming instruction where you don't even use a register is better
>>> still !!
>
>
>> Can be done, but thus far 33-bit immediate values. Luckily, Imm33s
>> seems to addresses around 99% of uses (for normal ALU ops and similar).
>
> What do you do when accessing data that the linker knows is more than
> 4GB away from IP ?? or known to be outside of 0-4GB ?? externs, GOT,
> PLT, ...
>
>> Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
>> or 2x S.E8.F19), which would have indirectly allowed the Imm57s case.
>> By themselves though, the difference doesn't seem enough to justify
>> the cost.
>
> While I admit that <basically> anything bigger than 50-bits will be fine
> as displacements, they are not fine for constants and especially FP
> constants and many bit twiddling constants.
>

The number of cases where this comes up is not statistically significant
enough to have a meaningful impact on performance.

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38283&group=comp.arch#38283

copy link Newsgroups: comp.arch

Date: Fri, 12 Apr 2024 01:40:27 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$vqR6aeDrfKQetnAntRdf3.B.e0iqsjXeU8PRQMmET3k26oaF5rAGG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <uv9i0i$1srig$1@dont-email.me> <f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org> <uva1fu$2010o$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>

by: MitchAlsup1 - Fri, 12 Apr 2024 01:40 UTC

BGB wrote:

> On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
>>
>>
>> While I admit that <basically> anything bigger than 50-bits will be fine
>> as displacements, they are not fine for constants and especially FP
>> constants and many bit twiddling constants.
>>

> The number of cases where this comes up is not statistically significant
> enough to have a meaningful impact on performance.

> Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
.LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
.LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
.LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
.LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
.LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
.LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
.Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function

Re: "Mini" tags to reduce the number of op codes

<RQaSN.654618$c3Ea.13093@fx10.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38284&group=comp.arch#38284

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!nntp.comgw.net!weretis.net!feeder8.news.weretis.net!news.neodome.net!npeer.as286.net!npeer-ng0.as286.net!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: "Mini" tags to reduce the number of op codes
Newsgroups: comp.arch
References: <uuk100$inj$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <20240412021904.000074f8@yahoo.com>
Lines: 19
Message-ID: <RQaSN.654618$c3Ea.13093@fx10.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Fri, 12 Apr 2024 13:40:01 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Fri, 12 Apr 2024 13:40:01 GMT
X-Received-Bytes: 1775

by: Scott Lurndal - Fri, 12 Apr 2024 13:40 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Thu, 11 Apr 2024 18:46:54 +0000
>mitchalsup@aol.com (MitchAlsup1) wrote:

>> >>> It looks to be a win-win !! =20
>> >>=20
>> >> Win-win under constraints of Load-Store Arch. Otherwise, it
>> >> depends. =20
>>=20
>> Never seen a LD-OP architecture where the inbound memory can be in
>> the Rs1 position of the instruction.
>>=20
>
>May be. But out of 6 major integer OPs it matters only for SUB.
>By now I don't remember for sure, but I think that I had seen LD-OP
>architecture that had SUBR instruction. May be, TI TMS320C30?

ARM has LDADD - negate one argument and it becomes a subtract.

Re: "Mini" tags to reduce the number of op codes

<20240412180833.000035b3@yahoo.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38285&group=comp.arch#38285

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Fri, 12 Apr 2024 18:08:33 +0300
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <20240412180833.000035b3@yahoo.com>
References: <uuk100$inj$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com>
<uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
<20240412021904.000074f8@yahoo.com>
<RQaSN.654618$c3Ea.13093@fx10.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 12 Apr 2024 17:08:41 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="af7207a860320046fa1f9156acfe0411";
logging-data="2551691"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+OK/heuO96IwxAmquLA+Ye64eHG/VF/5Q="
Cancel-Lock: sha1:UOOrbRQO7V3XoII0Tg2MkB5ehFk=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)

by: Michael S - Fri, 12 Apr 2024 15:08 UTC

On Fri, 12 Apr 2024 13:40:01 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >On Thu, 11 Apr 2024 18:46:54 +0000
> >mitchalsup@aol.com (MitchAlsup1) wrote:
>
> >> >>> It looks to be a win-win !! =20
> >> >>=20
> >> >> Win-win under constraints of Load-Store Arch. Otherwise, it
> >> >> depends. =20
> >>=20
> >> Never seen a LD-OP architecture where the inbound memory can be in
> >> the Rs1 position of the instruction.
> >>=20
> >
> >May be. But out of 6 major integer OPs it matters only for SUB.
> >By now I don't remember for sure, but I think that I had seen LD-OP
> >architecture that had SUBR instruction. May be, TI TMS320C30?
>
> ARM has LDADD - negate one argument and it becomes a subtract.
>

ARM LDADD is not a LD-OP instruction. It is RMW.

Re: "Mini" tags to reduce the number of op codes

<uvbtif$2gat0$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38286&group=comp.arch#38286

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Fri, 12 Apr 2024 13:12:28 -0500
Organization: A noiseless patient Spider
Lines: 373
Message-ID: <uvbtif$2gat0$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
<uv9i0i$1srig$1@dont-email.me>
<f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>
<uva1fu$2010o$1@dont-email.me>
<eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 12 Apr 2024 20:12:32 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5521f5d032488c5ad7ae13ff64f338b6";
logging-data="2632608"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18s5PjZeAD3HaKXoLAS7g7rRci10T7eRnA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:QRO0TBS8IlL3p8SRG/rLF/HZ05w=
In-Reply-To: <eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>
Content-Language: en-US

by: BGB - Fri, 12 Apr 2024 18:12 UTC

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
>>>
>>>
>>> While I admit that <basically> anything bigger than 50-bits will be fine
>>> as displacements, they are not fine for constants and especially FP
>>> constants and many bit twiddling constants.
>>>
>
>> The number of cases where this comes up is not statistically
>> significant enough to have a meaningful impact on performance.
>
>> Fraction of a percent edge-cases are not deal-breakers, as I see it.
>
> Idle speculation::
>
>     .globl    r8_erf                          ; -- Begin function r8_erf
>     .type    r8_erf,@function
> r8_erf:                                 ; @r8_erf
> ; %bb.0:
>     add    sp,sp,#-128
>     std    #4614300636657501161,[sp,88]    // a[0]
>     std    #4645348406721991307,[sp,104]    // a[2]
>     std    #4659275911028085274,[sp,112]    // a[3]
>     std    #4595861367557309218,[sp,120]    // a[4]
>     std    #4599171895595656694,[sp,40]    // p[0]
>     std    #4593699784569291823,[sp,56]    // p[2]
>     std    #4580293056851789237,[sp,64]    // p[3]
>     std    #4559215111867327292,[sp,72]    // p[4]
>     std    #4580359811580069319,[sp,80]    // p[4]
>     std    #4612966212090462427,[sp]    // q[0]
>     std    #4602930165995154489,[sp,16]    // q[2]
>     std    #4588882433176075751,[sp,24]    // q[3]
>     std    #4567531038595922641,[sp,32]    // q[4]
>     fabs    r2,r1
>     fcmp    r3,r2,#0x3EF00000        // thresh
>     bnlt    r3,.LBB141_6
> ; %bb.1:
>     fcmp    r3,r2,#4            // xabs <= 4.0
>     bnlt    r3,.LBB141_7
> ; %bb.2:
>     fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
>     bngt    r3,.LBB141_11
> ; %bb.3:
>     fmul    r3,r1,r1
>     fdiv    r3,#1,r3
>     mov    r4,#0x3F90B4FB18B485C7        // p[5]
>     fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
>     fadd    r5,r3,#0x40048C54508800DB    // q[0]
>     fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
>     fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
>     fmul    r4,r3,r4
>     fmul    r6,r3,r6
>     mov    r5,#2
>     add    r7,sp,#40            // p[*]
>     add    r8,sp,#0            // q[*]
> LBB141_4:                              ; %._crit_edge11
>                                        ; =>This Inner Loop Header: Depth=1
>     vec    r9,{r4,r6}
>     ldd    r10,[r7,r5<<3,0]        // p[*]
>     ldd    r11,[r8,r5<<3,0]        // q[*]
>     fadd    r6,r6,r10
>     fadd    r4,r4,r11
>     fmul    r4,r3,r4
>     fmul    r6,r3,r6
>     loop    ne,r5,#4,#1
> ; %bb.5:
>     fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
>     fmul    r3,r3,r5
>     fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
>     fdiv    r3,r3,r4
>     fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
>     fdiv    r3,r3,r2
>     br    .LBB141_10            // common tail
> LBB141_6:                              ; %._crit_edge
>     fmul    r3,r1,r1
>     fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
>     sra    r2,r2,<1:13>
>     cvtsd    r4,#0
>     mux    r2,r2,r3,r4
>     mov    r3,#0x3FC7C7905A31C322        // a[4]
>     fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
>     fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
>     ldd    r4,[sp,104]            // a[2]
>     fmac    r3,r2,r3,r4
>     fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
>     fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
>     fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
>     fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
>     fmul    r1,r3,r1
>     fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
>     fdiv    r2,r1,r2
>     mov    r1,r2
>     add    sp,sp,#128
>     ret                // 68
> LBB141_7:
>     fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
>     mov    r5,#0
>     mov    r4,r2
> LBB141_8:                              ; =>This Inner Loop Header: Depth=1
>     vec    r6,{r3,r4}
>     ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
>     fadd    r3,r3,r7
>     fmul    r3,r2,r3
>     ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
>     fadd    r4,r4,r7
>     fmul    r4,r2,r4
>     loop    ne,r5,#7,#1
> ; %bb.9:
>     fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
>     fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
>     fdiv    r3,r3,r4
> LBB141_10:                // common tail
>     fmul    r4,r2,#0x41800000        // 16.0
>     fmul    r4,r4,#0x3D800000        // 1/16.0
>     cvtds    r4,r4                // (signed)double
>     cvtsd    r4,r4                // (double)signed
>     fadd    r5,r2,-r4
>     fadd    r2,r2,r4
>     fmul    r4,r4,-r4
>     fexp    r4,r4                // exp()
>     fmul    r2,r2,-r5
>     fexp    r2,r2                // exp()
>     fmul    r2,r4,r2
>     fadd    r2,#0,-r2
>     fmac    r2,r2,r3,#0x3F000000        // 0.5
>     fadd    r2,r2,#0x3F000000        // 0.5
>     pflt    r1,0,T
>     fadd    r2,#0,-r2
>     mov    r1,r2
>     add    sp,sp,#128
>     ret
> LBB141_11:
>     fcmp    r1,r1,#0
>     sra    r1,r1,<1:13>
>     cvtsd    r2,#-1                // (double)-1
>     cvtsd    r3,#1                // (double)+1
>     mux    r2,r1,r3,r2
>     mov    r1,r2
>     add    sp,sp,#128
>     ret
> Lfunc_end141:
>     .size    r8_erf, .Lfunc_end141-r8_erf
>                                        ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

> .globl r8_erf ; -- Begin function r8_erf
> .type r8_erf,@function
> r8_erf: ; @r8_erf
> ; %bb.0:
> add sp,sp,#-128
ADD -128, SP
> std #4614300636657501161,[sp,88] // a[0]
MOV 0x400949FB3ED443E9, R3
MOV.Q R3, (SP, 88)
> std #4645348406721991307,[sp,104] // a[2]
MOV 0x407797C38897528B, R3
MOV.Q R3, (SP, 104)
> std #4659275911028085274,[sp,112] // a[3]
> std #4595861367557309218,[sp,120] // a[4]
> std #4599171895595656694,[sp,40] // p[0]
> std #4593699784569291823,[sp,56] // p[2]
> std #4580293056851789237,[sp,64] // p[3]
> std #4559215111867327292,[sp,72] // p[4]
> std #4580359811580069319,[sp,80] // p[4]
> std #4612966212090462427,[sp] // q[0]
> std #4602930165995154489,[sp,16] // q[2]
> std #4588882433176075751,[sp,24] // q[3]
> std #4567531038595922641,[sp,32] // q[4]
.... pattern is obvious enough.
Each constant needs 12 bytes, so 16 bytes/store.

> fabs r2,r1
> fcmp r3,r2,#0x3EF00000 // thresh
> bnlt r3,.LBB141_6
FABS R5, R6
FLDH 0x3780, R3 //A
FCMPGT R3, R6 //A
BT .LBB141_6 //A

Or (FP-IMM extension):

FABS R5, R6
FCMPGE 0x0DE, R6 //B (FP-IMM)
BF .LBB141_6 //B

> ; %bb.1:
> fcmp r3,r2,#4 // xabs <= 4.0
> bnlt r3,.LBB141_7

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<e35fe91ebf098d65ae21e5f599cecc74@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38287&group=comp.arch#38287

copy link Newsgroups: comp.arch

Date: Fri, 12 Apr 2024 23:46:33 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$YAMT0ft/f.o7wqHrzoqy4OpPsNagHRCVbn8fYxEdNCHmg/WWZTLry
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <uv9i0i$1srig$1@dont-email.me> <f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org> <uva1fu$2010o$1@dont-email.me> <eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org> <uvbtif$2gat0$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <e35fe91ebf098d65ae21e5f599cecc74@www.novabbs.org>

by: MitchAlsup1 - Fri, 12 Apr 2024 23:46 UTC

BGB wrote:

> On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
>>>>
>>>>
>>>> While I admit that <basically> anything bigger than 50-bits will be fine
>>>> as displacements, they are not fine for constants and especially FP
>>>> constants and many bit twiddling constants.
>>>>
>>
>>> The number of cases where this comes up is not statistically
>>> significant enough to have a meaningful impact on performance.
>>
>>> Fraction of a percent edge-cases are not deal-breakers, as I see it.
>>
>> Idle speculation::
>>
>>     .globl    r8_erf                          ; -- Begin function r8_erf
>>     .type    r8_erf,@function
>> r8_erf:                                 ; @r8_erf
>> ; %bb.0:
>>     add    sp,sp,#-128
>>     std    #4614300636657501161,[sp,88]    // a[0]
>>     std    #4645348406721991307,[sp,104]    // a[2]
>>     std    #4659275911028085274,[sp,112]    // a[3]
>>     std    #4595861367557309218,[sp,120]    // a[4]
>>     std    #4599171895595656694,[sp,40]    // p[0]
>>     std    #4593699784569291823,[sp,56]    // p[2]
>>     std    #4580293056851789237,[sp,64]    // p[3]
>>     std    #4559215111867327292,[sp,72]    // p[4]
>>     std    #4580359811580069319,[sp,80]    // p[4]
>>     std    #4612966212090462427,[sp]    // q[0]
>>     std    #4602930165995154489,[sp,16]    // q[2]
>>     std    #4588882433176075751,[sp,24]    // q[3]
>>     std    #4567531038595922641,[sp,32]    // q[4]
>>     fabs    r2,r1
>>     fcmp    r3,r2,#0x3EF00000        // thresh
>>     bnlt    r3,.LBB141_6
>> ; %bb.1:
>>     fcmp    r3,r2,#4            // xabs <= 4.0
>>     bnlt    r3,.LBB141_7
>> ; %bb.2:
>>     fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
>>     bngt    r3,.LBB141_11
>> ; %bb.3:
>>     fmul    r3,r1,r1
>>     fdiv    r3,#1,r3
>>     mov    r4,#0x3F90B4FB18B485C7        // p[5]
>>     fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
>>     fadd    r5,r3,#0x40048C54508800DB    // q[0]
>>     fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
>>     fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
>>     fmul    r4,r3,r4
>>     fmul    r6,r3,r6
>>     mov    r5,#2
>>     add    r7,sp,#40            // p[*]
>>     add    r8,sp,#0            // q[*]
>> LBB141_4:                              ; %._crit_edge11
>>                                        ; =>This Inner Loop Header: Depth=1
>>     vec    r9,{r4,r6}
>>     ldd    r10,[r7,r5<<3,0]        // p[*]
>>     ldd    r11,[r8,r5<<3,0]        // q[*]
>>     fadd    r6,r6,r10
>>     fadd    r4,r4,r11
>>     fmul    r4,r3,r4
>>     fmul    r6,r3,r6
>>     loop    ne,r5,#4,#1
>> ; %bb.5:
>>     fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
>>     fmul    r3,r3,r5
>>     fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
>>     fdiv    r3,r3,r4
>>     fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
>>     fdiv    r3,r3,r2
>>     br    .LBB141_10            // common tail
>> LBB141_6:                              ; %._crit_edge
>>     fmul    r3,r1,r1
>>     fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
>>     sra    r2,r2,<1:13>
>>     cvtsd    r4,#0
>>     mux    r2,r2,r3,r4
>>     mov    r3,#0x3FC7C7905A31C322        // a[4]
>>     fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
>>     fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
>>     ldd    r4,[sp,104]            // a[2]
>>     fmac    r3,r2,r3,r4
>>     fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
>>     fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
>>     fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
>>     fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
>>     fmul    r1,r3,r1
>>     fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
>>     fdiv    r2,r1,r2
>>     mov    r1,r2
>>     add    sp,sp,#128
>>     ret                // 68
>> LBB141_7:
>>     fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
>>     mov    r5,#0
>>     mov    r4,r2
>> LBB141_8:                              ; =>This Inner Loop Header: Depth=1
>>     vec    r6,{r3,r4}
>>     ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
>>     fadd    r3,r3,r7
>>     fmul    r3,r2,r3
>>     ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
>>     fadd    r4,r4,r7
>>     fmul    r4,r2,r4
>>     loop    ne,r5,#7,#1
>> ; %bb.9:
>>     fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
>>     fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
>>     fdiv    r3,r3,r4
>> LBB141_10:                // common tail
>>     fmul    r4,r2,#0x41800000        // 16.0
>>     fmul    r4,r4,#0x3D800000        // 1/16.0
>>     cvtds    r4,r4                // (signed)double
>>     cvtsd    r4,r4                // (double)signed
>>     fadd    r5,r2,-r4
>>     fadd    r2,r2,r4
>>     fmul    r4,r4,-r4
>>     fexp    r4,r4                // exp()
>>     fmul    r2,r2,-r5
>>     fexp    r2,r2                // exp()
>>     fmul    r2,r4,r2
>>     fadd    r2,#0,-r2
>>     fmac    r2,r2,r3,#0x3F000000        // 0.5
>>     fadd    r2,r2,#0x3F000000        // 0.5
>>     pflt    r1,0,T
>>     fadd    r2,#0,-r2
>>     mov    r1,r2
>>     add    sp,sp,#128
>>     ret
>> LBB141_11:
>>     fcmp    r1,r1,#0
>>     sra    r1,r1,<1:13>
>>     cvtsd    r2,#-1                // (double)-1
>>     cvtsd    r3,#1                // (double)+1
>>     mux    r2,r1,r3,r2
>>     mov    r1,r2
>>     add    sp,sp,#128
>>     ret
>> Lfunc_end141:
>>     .size    r8_erf, .Lfunc_end141-r8_erf
>>                                        ; -- End function

> These patterns seem rather unusual...
> Don't really know the ABI.

> Patterns don't really fit observations for typical compiler output
> though (mostly in the FP constants, and particular ones that fall
> outside the scope of what can be exactly represented as Binary16 or
> similar, are rare).

> > .globl r8_erf ; -- Begin function r8_erf
> > .type r8_erf,@function
> > r8_erf: ; @r8_erf
> > ; %bb.0:
> > add sp,sp,#-128
> ADD -128, SP
> > std #4614300636657501161,[sp,88] // a[0]
> MOV 0x400949FB3ED443E9, R3
> MOV.Q R3, (SP, 88)
> > std #4645348406721991307,[sp,104] // a[2]
> MOV 0x407797C38897528B, R3
> MOV.Q R3, (SP, 104)
> > std #4659275911028085274,[sp,112] // a[3]
> > std #4595861367557309218,[sp,120] // a[4]
> > std #4599171895595656694,[sp,40] // p[0]
> > std #4593699784569291823,[sp,56] // p[2]
> > std #4580293056851789237,[sp,64] // p[3]
> > std #4559215111867327292,[sp,72] // p[4]
> > std #4580359811580069319,[sp,80] // p[4]
> > std #4612966212090462427,[sp] // q[0]
> > std #4602930165995154489,[sp,16] // q[2]
> > std #4588882433176075751,[sp,24] // q[3]
> > std #4567531038595922641,[sp,32] // q[4]
> .... pattern is obvious enough.
> Each constant needs 12 bytes, so 16 bytes/store.

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38288&group=comp.arch#38288

copy link Newsgroups: comp.arch

Date: Sat, 13 Apr 2024 03:17:43 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$aTUAEOihZSScYNcThaIYS.r/gZ4noLaVUbQzZgqJV0FP2y9f8jwzi
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <uv9i0i$1srig$1@dont-email.me> <f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org> <uva1fu$2010o$1@dont-email.me> <eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org> <uvbtif$2gat0$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>

by: MitchAlsup1 - Sat, 13 Apr 2024 03:17 UTC

BGB wrote:

> These patterns seem rather unusual...
> Don't really know the ABI.

You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}

Re: "Mini" tags to reduce the number of op codes

<uvd7p8$2s5mf$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38289&group=comp.arch#38289

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Sat, 13 Apr 2024 01:12:53 -0500
Organization: A noiseless patient Spider
Lines: 243
Message-ID: <uvd7p8$2s5mf$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
<uv9i0i$1srig$1@dont-email.me>
<f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>
<uva1fu$2010o$1@dont-email.me>
<eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>
<uvbtif$2gat0$1@dont-email.me>
<d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 13 Apr 2024 08:12:57 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="687c92c8c852cc790a82e4f97769c2d1";
logging-data="3020495"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18/W90S3zKG6z6HBpFr8L2exvryMQSqbd8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:8vaF7HRX0PwKQF45rFDLI6vufDc=
Content-Language: en-US
In-Reply-To: <d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>

by: BGB - Sat, 13 Apr 2024 06:12 UTC

On 4/12/2024 10:17 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
>>>>>
>>>>>
>>>>> While I admit that <basically> anything bigger than 50-bits will be
>>>>> fine
>>>>> as displacements, they are not fine for constants and especially FP
>>>>> constants and many bit twiddling constants.
>>>>>
>>>
>>>> The number of cases where this comes up is not statistically
>>>> significant enough to have a meaningful impact on performance.
>>>
>>>> Fraction of a percent edge-cases are not deal-breakers, as I see it.
>>>
>>> Idle speculation::
>>>
>>>      .globl    r8_erf                          ; -- Begin function
>>> r8_erf
>>>      .type    r8_erf,@function
>>> r8_erf:                                 ; @r8_erf
>>> ; %bb.0:
>>>      add    sp,sp,#-128
>>>      std    #4614300636657501161,[sp,88]    // a[0]
>>>      std    #4645348406721991307,[sp,104]    // a[2]
>>>      std    #4659275911028085274,[sp,112]    // a[3]
>>>      std    #4595861367557309218,[sp,120]    // a[4]
>>>      std    #4599171895595656694,[sp,40]    // p[0]
>>>      std    #4593699784569291823,[sp,56]    // p[2]
>>>      std    #4580293056851789237,[sp,64]    // p[3]
>>>      std    #4559215111867327292,[sp,72]    // p[4]
>>>      std    #4580359811580069319,[sp,80]    // p[4]
>>>      std    #4612966212090462427,[sp]    // q[0]
>>>      std    #4602930165995154489,[sp,16]    // q[2]
>>>      std    #4588882433176075751,[sp,24]    // q[3]
>>>      std    #4567531038595922641,[sp,32]    // q[4]
>>>      fabs    r2,r1
>>>      fcmp    r3,r2,#0x3EF00000        // thresh
>>>      bnlt    r3,.LBB141_6
>>> ; %bb.1:
>>>      fcmp    r3,r2,#4            // xabs <= 4.0
>>>      bnlt    r3,.LBB141_7
>>> ; %bb.2:
>>>      fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
>>>      bngt    r3,.LBB141_11
>>> ; %bb.3:
>>>      fmul    r3,r1,r1
>>>      fdiv    r3,#1,r3
>>>      mov    r4,#0x3F90B4FB18B485C7        // p[5]
>>>      fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
>>>      fadd    r5,r3,#0x40048C54508800DB    // q[0]
>>>      fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
>>>      fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
>>>      fmul    r4,r3,r4
>>>      fmul    r6,r3,r6
>>>      mov    r5,#2
>>>      add    r7,sp,#40            // p[*]
>>>      add    r8,sp,#0            // q[*]
>>> LBB141_4:                              ; %._crit_edge11
>>>                                         ; =>This Inner Loop Header:
>>> Depth=1
>>>      vec    r9,{r4,r6}
>>>      ldd    r10,[r7,r5<<3,0]        // p[*]
>>>      ldd    r11,[r8,r5<<3,0]        // q[*]
>>>      fadd    r6,r6,r10
>>>      fadd    r4,r4,r11
>>>      fmul    r4,r3,r4
>>>      fmul    r6,r3,r6
>>>      loop    ne,r5,#4,#1
>>> ; %bb.5:
>>>      fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
>>>      fmul    r3,r3,r5
>>>      fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
>>>      fdiv    r3,r3,r4
>>>      fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
>>>      fdiv    r3,r3,r2
>>>      br    .LBB141_10            // common tail
>>> LBB141_6:                              ; %._crit_edge
>>>      fmul    r3,r1,r1
>>>      fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
>>>      sra    r2,r2,<1:13>
>>>      cvtsd    r4,#0
>>>      mux    r2,r2,r3,r4
>>>      mov    r3,#0x3FC7C7905A31C322        // a[4]
>>>      fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
>>>      fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
>>>      ldd    r4,[sp,104]            // a[2]
>>>      fmac    r3,r2,r3,r4
>>>      fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
>>>      fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
>>>      fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
>>>      fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
>>>      fmul    r1,r3,r1
>>>      fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
>>>      fdiv    r2,r1,r2
>>>      mov    r1,r2
>>>      add    sp,sp,#128
>>>      ret                // 68
>>> LBB141_7:
>>>      fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
>>>      mov    r5,#0
>>>      mov    r4,r2
>>> LBB141_8:                              ; =>This Inner Loop Header:
>>> Depth=1
>>>      vec    r6,{r3,r4}
>>>      ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
>>>      fadd    r3,r3,r7
>>>      fmul    r3,r2,r3
>>>      ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
>>>      fadd    r4,r4,r7
>>>      fmul    r4,r2,r4
>>>      loop    ne,r5,#7,#1
>>> ; %bb.9:
>>>      fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
>>>      fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
>>>      fdiv    r3,r3,r4
>>> LBB141_10:                // common tail
>>>      fmul    r4,r2,#0x41800000        // 16.0
>>>      fmul    r4,r4,#0x3D800000        // 1/16.0
>>>      cvtds    r4,r4                // (signed)double
>>>      cvtsd    r4,r4                // (double)signed
>>>      fadd    r5,r2,-r4
>>>      fadd    r2,r2,r4
>>>      fmul    r4,r4,-r4
>>>      fexp    r4,r4                // exp()
>>>      fmul    r2,r2,-r5
>>>      fexp    r2,r2                // exp()
>>>      fmul    r2,r4,r2
>>>      fadd    r2,#0,-r2
>>>      fmac    r2,r2,r3,#0x3F000000        // 0.5
>>>      fadd    r2,r2,#0x3F000000        // 0.5
>>>      pflt    r1,0,T
>>>      fadd    r2,#0,-r2
>>>      mov    r1,r2
>>>      add    sp,sp,#128
>>>      ret
>>> LBB141_11:
>>>      fcmp    r1,r1,#0
>>>      sra    r1,r1,<1:13>
>>>      cvtsd    r2,#-1                // (double)-1
>>>      cvtsd    r3,#1                // (double)+1
>>>      mux    r2,r1,r3,r2
>>>      mov    r1,r2
>>>      add    sp,sp,#128
>>>      ret
>>> Lfunc_end141:
>>>      .size    r8_erf, .Lfunc_end141-r8_erf
>>>                                         ; -- End function
>
>> These patterns seem rather unusual...
>> Don't really know the ABI.
>
>> Patterns don't really fit observations for typical compiler output
>> though (mostly in the FP constants, and particular ones that fall
>> outside the scope of what can be exactly represented as Binary16 or
>> similar, are rare).
>
> You are N E V E R going to find the coefficients of a Chebyshev
> polynomial to fit in a small FP container; excepting the very
> occasional C0 or C1 term {which are mostly 1.0 and 0.0}

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<uvegmp$34cmh$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38290&group=comp.arch#38290

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bagel99@gmail.com (Brian G. Lucas)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Sat, 13 Apr 2024 12:51:18 -0500
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <uvegmp$34cmh$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 13 Apr 2024 19:51:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c4480c6c08b0d476df38e3909d88a2c1";
logging-data="3289809"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191tFJ2MXnIr2YYGfuTNaFu"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.9.1
Cancel-Lock: sha1:xWMeuZIX/hRkpwd5gPrYAQgOI2o=
In-Reply-To: <uv415n$ck2j$1@dont-email.me>
Content-Language: en-US

by: Brian G. Lucas - Sat, 13 Apr 2024 17:51 UTC

On 4/9/24 13:24, Thomas Koenig wrote:
> I wrote:
>
>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>> Thomas Koenig wrote:
>>>
>>>> John Savard <quadibloc@servername.invalid> schrieb:
>>>
>>>>> Thus, instead of having mode bits, one _could_ do the following:
>>>>>
>>>>> Usually, have 28 bit instructions that are shorter because there's
>>>>> only one opcode for each floating and integer operation. The first
>>>>> four bits in a block give the lengths of data to be used.
>>>>>
>>>>> But have one value for the first four bits in a block that indicates
>>>>> 36-bit instructions instead, which do include type information, so
>>>>> that very occasional instructions for rarely-used types can be mixed
>>>>> in which don't fill a whole block.
>>>>>
>>>>> While that's a theoretical possibility, I don't view it as being
>>>>> worthwhile in practice.
>>>
>>>> I played around a bit with another scheme: Encoding things into
>>>> 128-bit blocks, with either 21-bit or 42-bit or longer instructions
>>>> (or a block header with six bits, and 20 or 40 bits for each
>>>> instruction).
>>>
>>> Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
>>> destructive operand model for the 21-bit encodings. Yes :: no ??
>>
>> It was not very well developed, I gave it up when I saw there wasn't
>> much to gain.
>
> Maybe one more thing: In order to justify the more complex encoding,
> I was going for 64 registers, and that didn't work out too well
> (missing bits).
>
> Having learned about M-Core in the meantime, pure 32-register,
> 21-bit instruction ISA might actually work better.
If you want to know more about MCore, you can contact me.
I was the initial designer of the mcore ISA. It was targeted
at embedded processors, particularly control processors in phones
and radios. It was extended and found its way into GPS receivers and
set top boxes. Motorola licensed it to the Chinese and there it is
known as CSky ISAv1 (there is a different ISAv2). There is even a
supported Linux port of CSky v1.

brian

Re: "Mini" tags to reduce the number of op codes

<c7e4e0b12d9847b05ca17306768a3aba@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38291&group=comp.arch#38291

copy link Newsgroups: comp.arch

Date: Sun, 14 Apr 2024 22:58:22 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$CGfXetj44/OmtNbFlBZAZuXXkAflztHUuIl2ZRSTsAALHjZCiA3V.
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <c7e4e0b12d9847b05ca17306768a3aba@www.novabbs.org>

by: MitchAlsup1 - Sun, 14 Apr 2024 22:58 UTC

Stephen Fuld wrote:
<snip>

> I think this all works fine for a single compilation unit, as the
> compiler certainly knows the type of the data. But what happens with
> separate compilations? The called function probably doesn’t know the
> tag value for callee saved registers. Fortunately, the My 66000
> architecture comes to the rescue here. You would modify the Enter and
> Exit instructions to save/restore the tag bits of the registers they are
> saving or restoring in the same data structure it uses for the registers
> (yes, it adds 32 bits to that structure – minimal cost). The same
> mechanism works for interrupts that take control away from a running
> process.

I had missed this until now:: The stack remains 64-bit aligned at all times,
so if you add 32-bits to the stack you actually add 64-bits to the stack.

Given this, you an effectively use a 2-bit tag {integral, floating, pointing,
describing}. The difference between pointing and describing is that pointing
is C-like, while describing is dope-vector-like. {{Although others may find
something else to put in the 4-th slot.}}

> Any comments are welcome.

Re: "Mini" tags to reduce the number of op codes

<86d1dd03deee83e339afa725524ab259@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38292&group=comp.arch#38292

copy link Newsgroups: comp.arch

Date: Sun, 14 Apr 2024 23:25:52 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$4WbE3sD03P9oHvX9iAsE6OqNUAFYafIG/lCsiGqGzoxZEXDXolrE.
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <2024Apr3.192405@mips.complang.tuwien.ac.at>
Organization: Rocksolid Light
Message-ID: <86d1dd03deee83e339afa725524ab259@www.novabbs.org>

by: MitchAlsup1 - Sun, 14 Apr 2024 23:25 UTC

Anton Ertl wrote:

> I have a similar problem for the carry and overflow bits in
> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
> let those bits not survive across calls; if there was a cheap solution
> for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{ uint64_t c = 0;
for( int i = 0; i < n; i++ )
{
{c, sum[i]} = a[i] + b[i] + c;
}
return
}

Assembly code::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3]
LDD R9,[R3,Ri<<3]
CARRY R5,{{IO}}
ADD R10,R8,R9
STD R10,[R1,Ri<<3]
LOOP LT,R6,#1,R4
RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a
performance gain of 2.00× !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

Re: "Mini" tags to reduce the number of op codes

<uvimv7$629s$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38293&group=comp.arch#38293

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 10:02:46 +0200
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <uvimv7$629s$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<2024Apr3.192405@mips.complang.tuwien.ac.at>
<86d1dd03deee83e339afa725524ab259@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Mon, 15 Apr 2024 10:02:47 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9d6aa58f39643660529f6affbcde0704";
logging-data="198972"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Q8obOmFV0bmQWPLwD+xHwmwY2B6ywk06mpuem5bpdhg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:iik6fZTZy6fBNvgzpmfOpcp6JM4=
In-Reply-To: <86d1dd03deee83e339afa725524ab259@www.novabbs.org>

by: Terje Mathisen - Mon, 15 Apr 2024 08:02 UTC

MitchAlsup1 wrote:
> Anton Ertl wrote:
>
>> I have a similar problem for the carry and overflow bits in
>> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
>> let those bits not survive across calls; if there was a cheap solution
>> for the problem, it would eliminate this drawback of my idea.
>
> My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
> whereas RISC-V encodes the inner loop in 11 instructions.
>
> Source code:
>
> void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
> {
>     uint64_t c = 0;
>     for( int i = 0; i < n; i++ )
>     {
>          {c, sum[i]} = a[i] + b[i] + c;
>     }
>     return
> }
>
> Assembly code::
>
>     .global mpn_add_n
> mpn_add_n:
>     MOV   R5,#0     // c
>     MOV   R6,#0     // i
>
>     VEC   R7,{}
>     LDD   R8,[R2,Ri<<3]
>     LDD   R9,[R3,Ri<<3]
>     CARRY R5,{{IO}}
>     ADD   R10,R8,R9
>     STD   R10,[R1,Ri<<3]
>     LOOP LT,R6,#1,R4
>     RET
>
> So, adding a few "bells and whistles" to RISC-V does give you a
> performance gain (1.38Ã—); using a well designed ISA gives you a
> performance gain of 2.00Ã— !! {{moral: don't stop too early}}
>
> Note that all the register bookkeeping has disappeared !! because
> of the indexed memory reference form.
>
> As I count executing instructions, VEC does not execute, nor does
> CARRY--CARRY causes the subsequent ADD to take C input as carry and
> the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
> BC sequence in a single instruction and in a single clock.

; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
jnz next

The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle

inc ecx
jnz next ; Third cycle

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: "Mini" tags to reduce the number of op codes

<uvir8v$6ua2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38294&group=comp.arch#38294

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 11:16:15 +0200
Organization: A noiseless patient Spider
Lines: 127
Message-ID: <uvir8v$6ua2$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<2024Apr3.192405@mips.complang.tuwien.ac.at>
<86d1dd03deee83e339afa725524ab259@www.novabbs.org>
<uvimv7$629s$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Mon, 15 Apr 2024 11:16:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9d6aa58f39643660529f6affbcde0704";
logging-data="227650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XOWurrERRSJPBlkLLdKolSOd5ddwLJK6xGdRMMxwg7Q=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:N481Kg8dt1yuHmQ5US66IxCdhlQ=
In-Reply-To: <uvimv7$629s$1@dont-email.me>

by: Terje Mathisen - Mon, 15 Apr 2024 09:16 UTC

Terje Mathisen wrote:
> MitchAlsup1 wrote:
>> Anton Ertl wrote:
>>
>>> I have a similar problem for the carry and overflow bits in
>>> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
>>> let those bits not survive across calls; if there was a cheap solution
>>> for the problem, it would eliminate this drawback of my idea.
>>
>> My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
>> whereas RISC-V encodes the inner loop in 11 instructions.
>>
>> Source code:
>>
>> void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
>> {
>> Â Â Â uint64_t c = 0;
>> Â Â Â for( int i = 0; i < n; i++ )
>> Â Â Â {
>> Â Â Â Â Â Â Â Â {c, sum[i]} = a[i] + b[i] + c;
>> Â Â Â }
>> Â Â Â return
>> }
>>
>> Assembly code::
>>
>> Â Â Â .global mpn_add_n
>> mpn_add_n:
>> Â Â Â MOVÂ Â R5,#0Â Â Â Â // c
>> Â Â Â MOVÂ Â R6,#0Â Â Â Â // i
>>
>> Â Â Â VECÂ Â R7,{}
>> Â Â Â LDDÂ Â R8,[R2,Ri<<3]
>> Â Â Â LDDÂ Â R9,[R3,Ri<<3]
>> Â Â Â CARRY R5,{{IO}}
>> Â Â Â ADDÂ Â R10,R8,R9
>> Â Â Â STDÂ Â R10,[R1,Ri<<3]
>> Â Â Â LOOPÂ LT,R6,#1,R4
>> Â Â Â RET
>>
>> So, adding a few "bells and whistles" to RISC-V does give you a
>> performance gain (1.38Ãƒâ€”); using a well designed ISA gives you a
>> performance gain of 2.00Ãƒâ€” !! {{moral: don't stop too early}}
>>
>> Note that all the register bookkeeping has disappeared !! because
>> of the indexed memory reference form.
>>
>> As I count executing instructions, VEC does not execute, nor does
>> CARRY--CARRY causes the subsequent ADD to take C input as carry and
>> the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
>> BC sequence in a single instruction and in a single clock.
>
> ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
> xor rax,rax ;; Clear carry
> next:
> mov rax,[rsi+rcx*8]
> adc rax,[rdx+rcx*8]
> mov [rdi+rcx*8],rax
> inc rcx
>    jnz next
>
> The code above is 5 instructions, or 6 if we avoid the load-op, doing
> two loads and one store, so it should only be limited by the latency of
> the ADC, i.e. one or two cycles.
>
> In the non-OoO (i.e Pentium) days, I would have inverted the loop in
> order to hide the latencies as much as possible, resulting in an inner
> loop something like this:
>
> next:
> adc eax,ebx
> mov ebx,[edx+ecx*4]    ; First cycle
>
> mov [edi+ecx*4],eax
> mov eax,[esi+ecx*4]    ; Second cycle
>
> inc ecx
>    jnz next        ; Third cycle
>

In the same bad old days, the standard way to speed it up would have
used unrolling, but until we got more registers, it would have stopped
itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte
register, but then we would have needed to handle the carry propagation
manually, and that would take longer than a series of ADC/ADX instructions.

next4:
mov eax,[esi]
adc eax,[esi+edx]
mov [esi+edi],eax
mov eax,[esi+4]
adc eax,[esi+edx+4]
mov [esi+edi+4],eax
mov eax,[esi+8]
adc eax,[esi+edx+8]
mov [esi+edi+8],eax
mov eax,[esi+12]
adc eax,[esi+edx+12]
mov [esi+edi+12],eax
lea esi,[esi+16]
dec ecx
jnz next4

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: "Mini" tags to reduce the number of op codes

<199b6b43601e431845e8f520863bcf85@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38295&group=comp.arch#38295

copy link Newsgroups: comp.arch

Date: Mon, 15 Apr 2024 19:03:34 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$NpIl1nlNYAV7ldHX6T4h2OeH1DZA0Ctiz0Iq5DjlHrXQgqUY3ON.u
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad> <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org> <20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me> <0b785ebc54c76e3a10316904c3febba5@www.novabbs.org> <20240412021904.000074f8@yahoo.com>
Organization: Rocksolid Light
Message-ID: <199b6b43601e431845e8f520863bcf85@www.novabbs.org>

by: MitchAlsup1 - Mon, 15 Apr 2024 19:03 UTC

Michael S wrote:

> On Thu, 11 Apr 2024 18:46:54 +0000
> mitchalsup@aol.com (MitchAlsup1) wrote:

>>
>> > On 4/11/2024 6:13 AM, Michael S wrote:
>> >> On Wed, 10 Apr 2024 23:30:02 +0000
>> >> mitchalsup@aol.com (MitchAlsup1) wrote:
>> >>
>> >>>
>> >>>> It does occupy some icache space, however; have you boosted the
>> >>>> icache size to compensate?
>> >>>
>> >>> The space occupied in the ICache is freed up from being in the
>> >>> DCache so the overall hit rate goes up !! At typical sizes,
>> >>> ICache miss rate is about ¼ the miss rate of DCache.
>> >>>
>> >>> Besides:: if you had to LD the constant from memory, you use a LD
>> >>> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
>> >>> overall, it takes fewer cycles, fewer GPRs, and fewer
>> >>> instructions.
>> >>>
>> >>> Alternatively:: if you paste constants together (LUI, AUPIC) you
>> >>> have no direct route to either 64-bit constants or 64-bit address
>> >>> spaces.
>> >>>
>> >>> It looks to be a win-win !!
>> >>
>> >> Win-win under constraints of Load-Store Arch. Otherwise, it
>> >> depends.
>>
>> Never seen a LD-OP architecture where the inbound memory can be in
>> the Rs1 position of the instruction.
>>

> May be. But out of 6 major integer OPs it matters only for SUB.
> By now I don't remember for sure, but I think that I had seen LD-OP
> architecture that had SUBR instruction. May be, TI TMS320C30?
> It was 30 years ago and my memory is not what it used to be.

That a SUBR instruction exists does not disavow my statement that
the inbound memory reference was never in the Rs1 position.

Re: "Mini" tags to reduce the number of op codes

<983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38296&group=comp.arch#38296

copy link Newsgroups: comp.arch

Date: Mon, 15 Apr 2024 20:55:53 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$lKDbwk.OhCpIcHiJMkRdDeinikQmG4HIflwcshee4yEDebUO1SqHu
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <2024Apr3.192405@mips.complang.tuwien.ac.at> <86d1dd03deee83e339afa725524ab259@www.novabbs.org> <uvimv7$629s$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>

by: MitchAlsup1 - Mon, 15 Apr 2024 20:55 UTC

Terje Mathisen wrote:

> MitchAlsup1 wrote:
>>

> In the non-OoO (i.e Pentium) days, I would have inverted the loop in
> order to hide the latencies as much as possible, resulting in an inner
> loop something like this:

> next:
> adc eax,ebx
> mov ebx,[edx+ecx*4] ; First cycle

> mov [edi+ecx*4],eax
> mov eax,[esi+ecx*4] ; Second cycle

> inc ecx
> jnz next ; Third cycle

> Terje

As opposed to::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET

--------------------------------------------------------

LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3

LDD LDd
LDD LDd
ADD
ST STd
LOOP
LDD LDd
LDD LDd
ADD
ST STd
LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!

Re: "Mini" tags to reduce the number of op codes

<uvk5bf$gb3v$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38297&group=comp.arch#38297

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 16:14:24 -0500
Organization: A noiseless patient Spider
Lines: 215
Message-ID: <uvk5bf$gb3v$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
<20240411141324.0000090d@yahoo.com> <uv9ahu$1r74h$1@dont-email.me>
<0b785ebc54c76e3a10316904c3febba5@www.novabbs.org>
<uv9i0i$1srig$1@dont-email.me>
<f4d64e33b721ff6c5bd37f01f2705316@www.novabbs.org>
<uva1fu$2010o$1@dont-email.me>
<eb501756bd1502b3ea65998c8d100c8e@www.novabbs.org>
<uvbtif$2gat0$1@dont-email.me>
<d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 15 Apr 2024 23:14:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9de5f00b3f2e40b205fe1a8635e2b2c5";
logging-data="535679"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GnYPixIiE6LXLDBYrEUlflnQLU93JccY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xFgWxWxIzfJkwWmoi2448SiQ4dQ=
In-Reply-To: <d6914a412e2e25961ea4ef88c1206bfe@www.novabbs.org>
Content-Language: en-US

by: BGB-Alt - Mon, 15 Apr 2024 21:14 UTC

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<uvl6oa$qbkb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38300&group=comp.arch#38300

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Tue, 16 Apr 2024 08:44:26 +0200
Organization: A noiseless patient Spider
Lines: 106
Message-ID: <uvl6oa$qbkb$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<2024Apr3.192405@mips.complang.tuwien.ac.at>
<86d1dd03deee83e339afa725524ab259@www.novabbs.org>
<uvimv7$629s$1@dont-email.me>
<983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Tue, 16 Apr 2024 08:44:27 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="23a6f8e758feed3d6be3f1689c2d8629";
logging-data="863883"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19c3d+1upwhnvW0b9J22Y3uRXp2LFzbqP1On0UE3duqnw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:bel5gKnB+nBXQN6AafAQPUDEb+M=
In-Reply-To: <983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>

by: Terje Mathisen - Tue, 16 Apr 2024 06:44 UTC

MitchAlsup1 wrote:
> Terje Mathisen wrote:
>
>> MitchAlsup1 wrote:
>>>
>
>> In the non-OoO (i.e Pentium) days, I would have inverted the loop in
>> order to hide the latencies as much as possible, resulting in an inner
>> loop something like this:
>
>> next:
>>    adc eax,ebx
>>    mov ebx,[edx+ecx*4]    ; First cycle
>
>>    mov [edi+ecx*4],eax
>>    mov eax,[esi+ecx*4]    ; Second cycle
>
>>    inc ecx
>>    jnz next        ; Third cycle
>
>> Terje
>
> As opposed to::
>
>     .global mpn_add_n
> mpn_add_n:
>     MOV   R5,#0     // c
>     MOV   R6,#0     // i
>
>     VEC   R7,{}
>     LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
>     LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
>     CARRY R5,{{IO}}
>     ADD   R10,R8,R9           // Add pair to add octal
>     STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
>     LOOP LT,R6,#1,R4         // increment 2-to-8 times
>     RET
>
> --------------------------------------------------------
>
>     LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
>     LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
>     CARRY R5,{{IO}}
>     ADD   R10,R8,R9           // cycle 4
>     STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
>     LOOP LT,R6,#1,R4         // cycle 3
>
> OR
>
>     LDD       LDd
>          LDD       LDd                    ADD
>               ST        STd
>               LOOP
>                    LDD       LDd
>                         LDD       LDd
> ADD
>                              ST        STd
>                              LOOP
>
> 10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
> machine !!
> without code scheduling heroics.
>
> 40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
> machine !!

It all comes down to the carry propagation, right?

The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block
to the next, right?

If you can do that at half a clock cycle per 64 bit ADD, then consider
me very impressed!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

devel / comp.arch / Re: "Mini" tags to reduce the number of op codes

Pages:1 234

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor