Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

"And remember: Evil will always prevail, because Good is dumb." -- Spaceballs

Re: "Mini" tags to reduce the number of op codes

Subject	Author
"Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	Anton Ertl
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	EricP
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Stephen Fuld
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	John Savard
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	Anton Ertl
Re: "Mini" tags to reduce the number of op codes	Thomas Koenig
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	Michael S
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Terje Mathisen
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Scott Lurndal
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	Paul A. Clayton
Re: "Mini" tags to reduce the number of op codes	Chris M. Thomasson
Re: "Mini" tags to reduce the number of op codes	BGB
Re: "Mini" tags to reduce the number of op codes	Chris M. Thomasson
Re: "Mini" tags to reduce the number of op codes	BGB-Alt
Re: "Mini" tags to reduce the number of op codes	Brian G. Lucas
Re: "Mini" tags to reduce the number of op codes	MitchAlsup1

Pages:123 4

Re: "Mini" tags to reduce the number of op codes

<e8q71jljlep537vm7tbue7ch37o9q66l8k@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38249&group=comp.arch#38249

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 08 Apr 2024 07:05:35 -0600
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <e8q71jljlep537vm7tbue7ch37o9q66l8k@4ax.com>
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <9280b28665576d098af53a9416604e36@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 08 Apr 2024 13:05:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="17d72f3665a043abd2df46001dddf359";
logging-data="3696458"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NWI7dgmCT/ITk59JT7F4n2XHXTIGm8/E="
Cancel-Lock: sha1:XXyrLFOiXqu0E7PO+B/KsCJohHQ=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Mon, 8 Apr 2024 13:05 UTC

On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

>How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

Yes, that's a problem. Presumably, I would have to do without
immediates.

An option would be to reserve some 16-bit codes to indicate a block
consisting of one 28-bit instruction and seven 32-bit instructions,
but that means a third instruction set.

>How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
>set of 256-bit instruction decodes ??

By using 36-bit instructions instead of 28-bit instructions.

>In complicated if-then-else codes (and switches) I often see one inst-
>ruction followed by a branch to a common point. Does your encoding deal
>with these efficiently ?? That is:: what happens when you jump to the
>middle of a block of 36-bit instructions ??

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit
instruction by the addresses of the first seven 32-bit positions in
the block.

In the case of 28-bit instructions, the first eight correspond to the
32-bit positions, the ninth corresponds to the last 16 bits of the
block.

John Savard

Re: "Mini" tags to reduce the number of op codes

<uv19ai$3kitp$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38250&group=comp.arch#38250

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 8 Apr 2024 17:25:38 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <uv19ai$3kitp$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com>
<9280b28665576d098af53a9416604e36@www.novabbs.org>
<e8q71jljlep537vm7tbue7ch37o9q66l8k@4ax.com>
Injection-Date: Mon, 08 Apr 2024 17:25:39 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5060ebfc2fadd0021a9283245842f527";
logging-data="3820473"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/v3LtMvj6+ZOxOVwPF23adeWvNmLX59DA="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:SkGFR7EmCQ7MxK3NCrGkJF9Vg+U=

by: Thomas Koenig - Mon, 8 Apr 2024 17:25 UTC

John Savard <quadibloc@servername.invalid> schrieb:

> Well, when the computer fetches a 256-bit block of code, the first
> four bits indicates whether it is composed of 36-bit instructions or
> 28-bit instructions.

Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?

Re: "Mini" tags to reduce the number of op codes

<ab4e76f2dc47f737941f9c385220f2a8@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38251&group=comp.arch#38251

copy link Newsgroups: comp.arch

Date: Mon, 8 Apr 2024 19:56:27 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$gyM6BCpIZSm.C9HBJTFe2.QSi9y2a329fD7D3tkp0T.G1.5D6Mz0u
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <9280b28665576d098af53a9416604e36@www.novabbs.org> <e8q71jljlep537vm7tbue7ch37o9q66l8k@4ax.com>
Organization: Rocksolid Light
Message-ID: <ab4e76f2dc47f737941f9c385220f2a8@www.novabbs.org>

by: MitchAlsup1 - Mon, 8 Apr 2024 19:56 UTC

John Savard wrote:

> On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:

>>In complicated if-then-else codes (and switches) I often see one inst-
>>ruction followed by a branch to a common point. Does your encoding deal
>>with these efficiently ?? That is:: what happens when you jump to the
>>middle of a block of 36-bit instructions ??

> Well, when the computer fetches a 256-bit block of code, the first
> four bits indicates whether it is composed of 36-bit instructions or
> 28-bit instructions. So the computer knows where the instructions are;
> and thus a convention can be applied, such as addressing each 36-bit
> instruction by the addresses of the first seven 32-bit positions in
> the block.

So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.

Sounds more complicated than necessary.

> In the case of 28-bit instructions, the first eight correspond to the
> 32-bit positions, the ninth corresponds to the last 16 bits of the
> block.

> John Savard

Re: "Mini" tags to reduce the number of op codes

<uv415n$ck2j$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38252&group=comp.arch#38252

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Tue, 9 Apr 2024 18:24:55 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <uv415n$ck2j$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me>
Injection-Date: Tue, 09 Apr 2024 18:24:55 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a8c6b817bfa3a5831148180fe8c5b174";
logging-data="413779"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/V3nxJUDnd7EObWhpHNDT4I0VfR9VG3p8="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:McT75lCApmHA2jqd+4eR+q0CvFw=

by: Thomas Koenig - Tue, 9 Apr 2024 18:24 UTC

I wrote:

> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>> Thomas Koenig wrote:
>>
>>> John Savard <quadibloc@servername.invalid> schrieb:
>>
>>>> Thus, instead of having mode bits, one _could_ do the following:
>>>>
>>>> Usually, have 28 bit instructions that are shorter because there's
>>>> only one opcode for each floating and integer operation. The first
>>>> four bits in a block give the lengths of data to be used.
>>>>
>>>> But have one value for the first four bits in a block that indicates
>>>> 36-bit instructions instead, which do include type information, so
>>>> that very occasional instructions for rarely-used types can be mixed
>>>> in which don't fill a whole block.
>>>>
>>>> While that's a theoretical possibility, I don't view it as being
>>>> worthwhile in practice.
>>
>>> I played around a bit with another scheme: Encoding things into
>>> 128-bit blocks, with either 21-bit or 42-bit or longer instructions
>>> (or a block header with six bits, and 20 or 40 bits for each
>>> instruction).
>>
>> Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
>> destructive operand model for the 21-bit encodings. Yes :: no ??
>
> It was not very well developed, I gave it up when I saw there wasn't
> much to gain.

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

Re: "Mini" tags to reduce the number of op codes

<uv46rg$e4nb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38253&group=comp.arch#38253

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Tue, 9 Apr 2024 15:01:50 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <uv46rg$e4nb$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 09 Apr 2024 20:01:53 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f0369543e96ead342fc87227dc1b0b19";
logging-data="463595"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/YnfunL7AsMWsJHlJiAs5W7aZRnMTtWvY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:KFwau5CvQmOcF8DgeSZZJ31HX4w=
In-Reply-To: <uv415n$ck2j$1@dont-email.me>
Content-Language: en-US

by: BGB - Tue, 9 Apr 2024 20:01 UTC

On 4/9/2024 1:24 PM, Thomas Koenig wrote:
> I wrote:
>
>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>> Thomas Koenig wrote:
>>>
>>>> John Savard <quadibloc@servername.invalid> schrieb:
>>>
>>>>> Thus, instead of having mode bits, one _could_ do the following:
>>>>>
>>>>> Usually, have 28 bit instructions that are shorter because there's
>>>>> only one opcode for each floating and integer operation. The first
>>>>> four bits in a block give the lengths of data to be used.
>>>>>
>>>>> But have one value for the first four bits in a block that indicates
>>>>> 36-bit instructions instead, which do include type information, so
>>>>> that very occasional instructions for rarely-used types can be mixed
>>>>> in which don't fill a whole block.
>>>>>
>>>>> While that's a theoretical possibility, I don't view it as being
>>>>> worthwhile in practice.
>>>
>>>> I played around a bit with another scheme: Encoding things into
>>>> 128-bit blocks, with either 21-bit or 42-bit or longer instructions
>>>> (or a block header with six bits, and 20 or 40 bits for each
>>>> instruction).
>>>
>>> Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
>>> destructive operand model for the 21-bit encodings. Yes :: no ??
>>
>> It was not very well developed, I gave it up when I saw there wasn't
>> much to gain.
>
> Maybe one more thing: In order to justify the more complex encoding,
> I was going for 64 registers, and that didn't work out too well
> (missing bits).
>
> Having learned about M-Core in the meantime, pure 32-register,
> 21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with 250+
local variables to make effective use of this, *, which probably isn't
going to happen).

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of CPU
registers. If more local variables than this, then spill/fill rate goes
up significantly, and if less, then the registers aren't utilized as
effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch
registers. However, for many/most small leaf functions, the total number
of variables isn't all that large either.

Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.

There is a "static assign everything" case in my case, where all of the
variables are statically assigned to registers (for the scope of the
function). This case typically requires that everything fit into callee
save registers, so (like the "tiny leaf" category, requires that the
number of local variables is less than the available registers).

On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.

In the non-static case, the top N variables might be static-assigned,
and the remaining variables dynamically assigned. Though, it appears
this is more an artifact of my naive register allocator, and might not
be as effective of a strategy with an "actually clever" register
allocator (like those in GCC or LLVM), where purely dynamic allocation
may be better (they are able to carry dynamic assignments across basic
block boundaries, rather than needing to spill/fill everything whenever
a branch or label is encountered).

....

Re: "Mini" tags to reduce the number of op codes

<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38254&group=comp.arch#38254

copy link Newsgroups: comp.arch

Date: Tue, 9 Apr 2024 21:05:44 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$PlrhXSRkoMZm/9imkDxHWeNde3M22zHF/hdYMmG8JG85G2xOESqka
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <a81256dbd4f121a9345b151b1280162f@www.novabbs.org>

by: MitchAlsup1 - Tue, 9 Apr 2024 21:05 UTC

BGB wrote:

> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>> I wrote:
>>
>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>> Thomas Koenig wrote:
>>>>
>> Maybe one more thing: In order to justify the more complex encoding,
>> I was going for 64 registers, and that didn't work out too well
>> (missing bits).
>>
>> Having learned about M-Core in the meantime, pure 32-register,
>> 21-bit instruction ISA might actually work better.

> For 32-bit instructions at least, 64 GPRs can work out OK.

> Though, the gain of 64 over 32 seems to be fairly small for most
> "typical" code, mostly bringing a benefit if one is spending a lot of
> CPU time in functions that have large numbers of local variables all
> being used at the same time.

> Seemingly:
> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
> density;
> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
> performance.

> Where, 16 GPRs isn't really enough (lots of register spills), and 128
> GPRs is wasteful (would likely need lots of monster functions with 250+
> local variables to make effective use of this, *, which probably isn't
> going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of
GPRs AND you have good access to constants.

> *: Where, it appears it is most efficient (for non-leaf functions) if
> the number of local variables is roughly twice that of the number of CPU
> registers. If more local variables than this, then spill/fill rate goes
> up significantly, and if less, then the registers aren't utilized as
> effectively.

> Well, except in "tiny leaf" functions, where the criteria is instead
> that the number of local variables be less than the number of scratch
> registers. However, for many/most small leaf functions, the total number
> of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

> Where, function categories:
> Tiny Leaf:
> Everything fits in scratch registers, no stack frame, no calls.
> Leaf:
> No function calls (either explicit or implicit);
> Will have a stack frame.
> Non-Leaf:
> May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.

> There is a "static assign everything" case in my case, where all of the
> variables are statically assigned to registers (for the scope of the
> function). This case typically requires that everything fit into callee
> save registers, so (like the "tiny leaf" category, requires that the
> number of local variables is less than the available registers).

> On a 32 register machine, if there are 14 available callee-save
> registers, the limit is 14 variables. On a 64 register machine, this
> limit might be 30 instead. This seems to have good coverage.

The apparent number of registers goes up when one does not waste a register
to hold a use-once constant.

Re: "Mini" tags to reduce the number of op codes

<uv4ghh$gfsv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38255&group=comp.arch#38255

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Tue, 9 Apr 2024 17:47:13 -0500
Organization: A noiseless patient Spider
Lines: 210
Message-ID: <uv4ghh$gfsv$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 09 Apr 2024 22:47:14 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3f725350b61c79ebcff7aa29e47080ee";
logging-data="540575"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19AA7ncHC5Vm7O2rlJVJLCt9uMuV6uG8OE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:49HAKGY9H7S+G/jgco05ev5oO7w=
In-Reply-To: <a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
Content-Language: en-US

by: BGB-Alt - Tue, 9 Apr 2024 22:47 UTC

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>> I wrote:
>>>
>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>> Thomas Koenig wrote:
>>>>>
>>> Maybe one more thing: In order to justify the more complex encoding,
>>> I was going for 64 registers, and that didn't work out too well
>>> (missing bits).
>>>
>>> Having learned about M-Core in the meantime, pure 32-register,
>>> 21-bit instruction ISA might actually work better.
>
>
>> For 32-bit instructions at least, 64 GPRs can work out OK.
>
>> Though, the gain of 64 over 32 seems to be fairly small for most
>> "typical" code, mostly bringing a benefit if one is spending a lot of
>> CPU time in functions that have large numbers of local variables all
>> being used at the same time.
>
>
>> Seemingly:
>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
>> density;
>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>> performance.
>
>> Where, 16 GPRs isn't really enough (lots of register spills), and 128
>> GPRs is wasteful (would likely need lots of monster functions with
>> 250+ local variables to make effective use of this, *, which probably
>> isn't going to happen).
>
> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
> of GPRs AND you have good access to constants.
>

On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my compiler
design, both function calls and branches terminating the current
basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.

>> *: Where, it appears it is most efficient (for non-leaf functions) if
>> the number of local variables is roughly twice that of the number of
>> CPU registers. If more local variables than this, then spill/fill rate
>> goes up significantly, and if less, then the registers aren't utilized
>> as effectively.
>
>> Well, except in "tiny leaf" functions, where the criteria is instead
>> that the number of local variables be less than the number of scratch
>> registers. However, for many/most small leaf functions, the total
>> number of variables isn't all that large either.
>
> The vast majority of leaf functions use less than 16 GPRs, given one has
> a SP not part of GPRs {including arguments and return values}. Once one
> starts placing things like memove(), memset(), sin(), cos(), exp(), log()
> in the ISA, it goes up even more.
>

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some
exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

>
>> Where, function categories:
>>    Tiny Leaf:
>>      Everything fits in scratch registers, no stack frame, no calls.
>>    Leaf:
>>      No function calls (either explicit or implicit);
>>      Will have a stack frame.
>>    Non-Leaf:
>>      May call functions, has a stack frame.
>
> You are forgetting about FP, GOT, TLS, and whatever resources are required
> to do try-throw-catch stuff as demanded by the source language.
>

Yeah, possibly true.

In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.

Try/throw/catch:
Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function.
Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.

Need for GBR save/restore effectively excludes a function from being
tiny-leaf. This may happen, say, if a function accesses global variables
and may be called as a function pointer.

>> There is a "static assign everything" case in my case, where all of
>> the variables are statically assigned to registers (for the scope of
>> the function). This case typically requires that everything fit into
>> callee save registers, so (like the "tiny leaf" category, requires
>> that the number of local variables is less than the available registers).
>
>> On a 32 register machine, if there are 14 available callee-save
>> registers, the limit is 14 variables. On a 64 register machine, this
>> limit might be 30 instead. This seems to have good coverage.
>
> The apparent number of registers goes up when one does not waste a register
> to hold a use-once constant.

Possibly true. In the "static assign everything" case, each constant
used is also assigned a register.

One "TODO" here would be to merge constants with the same "actual" value
into the same register. At present, they will be duplicated if the types
are sufficiently different (such as integer 0 vs NULL).

For functions with dynamic assignment, immediate values are more likely
to be used. If the code-generator were clever, potentially it could
exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate. Currently,
BGBCC is not that clever.

Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function
arguments:
If static assigned, the "target variable directly to argument register"
optimization can't be used (it ends up needing to go into a callee-save
register and then be MOV'ed into the argument register; otherwise the
compiler breaks...).

Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid
statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).

Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...

Re: "Mini" tags to reduce the number of op codes

<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38256&group=comp.arch#38256

copy link Newsgroups: comp.arch

Date: Wed, 10 Apr 2024 00:28:02 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$lnRaWvNXx4w8Y0Qr9mQiIegFGtVH1zO2p6VKGU.UNPtEmXAUqC8ei
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>

by: MitchAlsup1 - Wed, 10 Apr 2024 00:28 UTC

BGB-Alt wrote:

> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> Seemingly:
>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
>>> density;
>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>> performance.
>>
>>> Where, 16 GPRs isn't really enough (lots of register spills), and 128
>>> GPRs is wasteful (would likely need lots of monster functions with
>>> 250+ local variables to make effective use of this, *, which probably
>>> isn't going to happen).
>>
>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
>> of GPRs AND you have good access to constants.
>>

> On the main ISA's I had tried to generate code for, 16 GPRs was kind of
> a pain as it resulted in fairly high spill rates.

> Though, it would probably be less bad if the compiler was able to use
> all of the registers at the same time without stepping on itself (such
> as dealing with register allocation involving scratch registers while
> also not conflicting with the use of function arguments, ...).

> My code generators had typically only used callee save registers for
> variables in basic blocks which ended in a function call (in my compiler
> design, both function calls and branches terminating the current
> basic-block).

> On SH, the main way of getting constants (larger than 8 bits) was via
> PC-relative memory loads, which kinda sucked.

> This is slightly less bad on x86-64, since one can use memory operands
> with most instructions, and the CPU tends to deal fairly well with code
> that has lots of spill-and-fill. This along with instructions having
> access to 32-bit immediate values.

Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have
LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
take the place of universal constants, but goes a long way.

>>
>> The vast majority of leaf functions use less than 16 GPRs, given one has
>> a SP not part of GPRs {including arguments and return values}. Once one
>> starts placing things like memove(), memset(), sin(), cos(), exp(), log()
>> in the ISA, it goes up even more.
>>

> Yeah.

> Things like memcpy/memmove/memset/etc, are function calls in cases when
> not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.

> Did end up with an intermediate "memcpy slide", which can handle medium
> size memcpy and memset style operations by branching into a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The entire system
sees only the before or only the after state and nothing in between. This
means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.

> As noted, on a 32 GPR machine, most leaf functions can fit entirely in
> scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting
totally screwed.

> On a 64 GPR machine, this percentage is slightly
> higher (but, not significantly, since there are few leaf functions
> remaining at this point).

> If one had a 16 GPR machine with 6 usable scratch registers, it is a
> little harder though (as typically these need to cover both any
> variables used by the function, and any temporaries used, ...). There
> are a whole lot more leaf functions that exceed a limit of 6 than of 14.

The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+
advantage over a 16 GPRs; while 84 had only a 3% advantage.

> But, say, a 32 GPR machine could still do well here.

> Note that there are reasons why I don't claim 64 GPRs as a large
> performance advantage:
> On programs like Doom, the difference is small at best.

> It mostly effects things like GLQuake in my case, mostly because TKRA-GL
> has a lot of functions with a large numbers of local variables (some
> exceeding 100 local variables).

> Partly though this is due to code that is highly inlined and unrolled
> and uses lots of variables tending to perform better in my case (and
> tightly looping code, with lots of small functions, not so much...).

>>
>>> Where, function categories:
>>>    Tiny Leaf:
>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>    Leaf:
>>>      No function calls (either explicit or implicit);
>>>      Will have a stack frame.
>>>    Non-Leaf:
>>>      May call functions, has a stack frame.
>>
>> You are forgetting about FP, GOT, TLS, and whatever resources are required
>> to do try-throw-catch stuff as demanded by the source language.
>>

> Yeah, possibly true.

> In my case:
> There is no frame pointer, as BGBCC doesn't use one;

Can't do PASCAL and other ALOGO derived languages with block structure.

> All stack-frames are fixed size, VLA's and alloca use the heap;

longjump() is at a serious disadvantage here.
desctructors are sometimes hard to position on the stack.

> GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
> TLS, accessed via TBR.

> Try/throw/catch:
> Mostly N/A for leaf functions.

> Any function that can "throw", is in effect no longer a leaf function.
> Implicitly, any function which uses "variant" or similar is also, no
> longer a leaf function.

You do realize that there is a set of #define-s that can implement
try-throw-catch without requiring any subroutines ?!?

> Need for GBR save/restore effectively excludes a function from being
> tiny-leaf. This may happen, say, if a function accesses global variables
> and may be called as a function pointer.

------------------------------------------------------

> One "TODO" here would be to merge constants with the same "actual" value
> into the same register. At present, they will be duplicated if the types
> are sufficiently different (such as integer 0 vs NULL).

In practice, the upper 48-bits of a extern variable is completely shared
whereas the lower 16-bits are unique.

> For functions with dynamic assignment, immediate values are more likely
> to be used. If the code-generator were clever, potentially it could
> exclude assigning registers to constants which are only used by
> instructions which can encode them directly as an immediate. Currently,
> BGBCC is not that clever.

And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.

> Or, say:
> y=x+31; //31 only being used here, and fits easily in an Imm9.
> Ideally, compiler could realize 31 does not need a register here.

> Well, and another weakness is with temporaries that exist as function
> arguments:
> If static assigned, the "target variable directly to argument register"
> optimization can't be used (it ends up needing to go into a callee-save
> register and then be MOV'ed into the argument register; otherwise the
> compiler breaks...).

> Though, I guess possible could be that the compiler could try to
> partition temporaries that are used exclusively as function arguments
> into a different category from "normal" temporaries (or those whose
> values may cross a basic-block boundary), and then avoid
> statically-assigning them (and somehow not cause this to effectively
> break the full-static-assignment scheme in the process).

Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.

> Though, IIRC, I had also considered the possibility of a temporary
> "virtual assignment", allowing the argument value to be temporarily
> assigned to a function argument register, then going "poof" and
> disappearing when the function is called. Hadn't yet thought of a good
> way to add this logic to the register allocator though.

> But, yeah, compiler stuff is really fiddly...

More orthogonality helps.

Re: "Mini" tags to reduce the number of op codes

<uv56ec$ooj6$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38257&group=comp.arch#38257

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Tue, 9 Apr 2024 22:01:00 -0700
Organization: A noiseless patient Spider
Lines: 146
Message-ID: <uv56ec$ooj6$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 05:01:01 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1e0154287d270c974cd6798ddf950547";
logging-data="811622"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/CyFgGugYkm242dRVEyXCRk9xl9KrwAQs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LgSgyQw7ccsy/Cp+bxtc2xFNgWY=
Content-Language: en-US
In-Reply-To: <uv4ghh$gfsv$1@dont-email.me>

by: Chris M. Thomasson - Wed, 10 Apr 2024 05:01 UTC

On 4/9/2024 3:47 PM, BGB-Alt wrote:
> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>>> I wrote:
>>>>
>>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>>> Thomas Koenig wrote:
>>>>>>
>>>> Maybe one more thing: In order to justify the more complex encoding,
>>>> I was going for 64 registers, and that didn't work out too well
>>>> (missing bits).
>>>>
>>>> Having learned about M-Core in the meantime, pure 32-register,
>>>> 21-bit instruction ISA might actually work better.
>>
>>
>>> For 32-bit instructions at least, 64 GPRs can work out OK.
>>
>>> Though, the gain of 64 over 32 seems to be fairly small for most
>>> "typical" code, mostly bringing a benefit if one is spending a lot of
>>> CPU time in functions that have large numbers of local variables all
>>> being used at the same time.
>>
>>
>>> Seemingly:
>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
>>> code density;
>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>> performance.
>>
>>> Where, 16 GPRs isn't really enough (lots of register spills), and 128
>>> GPRs is wasteful (would likely need lots of monster functions with
>>> 250+ local variables to make effective use of this, *, which probably
>>> isn't going to happen).
>>
>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
>> of GPRs AND you have good access to constants.
>>
>
> On the main ISA's I had tried to generate code for, 16 GPRs was kind of
> a pain as it resulted in fairly high spill rates.
>
> Though, it would probably be less bad if the compiler was able to use
> all of the registers at the same time without stepping on itself (such
> as dealing with register allocation involving scratch registers while
> also not conflicting with the use of function arguments, ...).
>
>
> My code generators had typically only used callee save registers for
> variables in basic blocks which ended in a function call (in my compiler
> design, both function calls and branches terminating the current
> basic-block).
>
> On SH, the main way of getting constants (larger than 8 bits) was via
> PC-relative memory loads, which kinda sucked.
>
>
> This is slightly less bad on x86-64, since one can use memory operands
> with most instructions, and the CPU tends to deal fairly well with code
> that has lots of spill-and-fill. This along with instructions having
> access to 32-bit immediate values.
>
>
>>> *: Where, it appears it is most efficient (for non-leaf functions) if
>>> the number of local variables is roughly twice that of the number of
>>> CPU registers. If more local variables than this, then spill/fill
>>> rate goes up significantly, and if less, then the registers aren't
>>> utilized as effectively.
>>
>>> Well, except in "tiny leaf" functions, where the criteria is instead
>>> that the number of local variables be less than the number of scratch
>>> registers. However, for many/most small leaf functions, the total
>>> number of variables isn't all that large either.
>>
>> The vast majority of leaf functions use less than 16 GPRs, given one has
>> a SP not part of GPRs {including arguments and return values}. Once
>> one starts placing things like memove(), memset(), sin(), cos(),
>> exp(), log()
>> in the ISA, it goes up even more.
>>
>
> Yeah.
>
> Things like memcpy/memmove/memset/etc, are function calls in cases when
> not directly transformed into register load/store sequences.
>
> Did end up with an intermediate "memcpy slide", which can handle medium
> size memcpy and memset style operations by branching into a slide.
>
>
>
> As noted, on a 32 GPR machine, most leaf functions can fit entirely in
> scratch registers. On a 64 GPR machine, this percentage is slightly
> higher (but, not significantly, since there are few leaf functions
> remaining at this point).
>
>
> If one had a 16 GPR machine with 6 usable scratch registers, it is a
> little harder though (as typically these need to cover both any
> variables used by the function, and any temporaries used, ...). There
> are a whole lot more leaf functions that exceed a limit of 6 than of 14.
>
> But, say, a 32 GPR machine could still do well here.
>
>
> Note that there are reasons why I don't claim 64 GPRs as a large
> performance advantage:
> On programs like Doom, the difference is small at best.
>
>
> It mostly effects things like GLQuake in my case, mostly because TKRA-GL
> has a lot of functions with a large numbers of local variables (some
> exceeding 100 local variables).
>
> Partly though this is due to code that is highly inlined and unrolled
> and uses lots of variables tending to perform better in my case (and
> tightly looping code, with lots of small functions, not so much...).
>
>
>>
>>> Where, function categories:
>>>    Tiny Leaf:
>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>    Leaf:
>>>      No function calls (either explicit or implicit);
>>>      Will have a stack frame.
>>>    Non-Leaf:
>>>      May call functions, has a stack frame.
>>
>> You are forgetting about FP, GOT, TLS, and whatever resources are
>> required
>> to do try-throw-catch stuff as demanded by the source language.
>>
>
> Yeah, possibly true.
>
> In my case:
> There is no frame pointer, as BGBCC doesn't use one;
>     All stack-frames are fixed size, VLA's and alloca use the heap;
> GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
> TLS, accessed via TBR.[...]

alloca using the heap? Strange to me...

Re: "Mini" tags to reduce the number of op codes

<uv5err$ql29$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38258&group=comp.arch#38258

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 02:24:40 -0500
Organization: A noiseless patient Spider
Lines: 492
Message-ID: <uv5err$ql29$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 07:24:44 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="adf62e5ff09325073b660a4ffaf2aa0c";
logging-data="873545"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XcTq8Bbep+LFzBo54YMr2vY2TJUox4IA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:BNBs5YLjqtiIzvptgv2Tsmi4qTQ=
In-Reply-To: <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
Content-Language: en-US

by: BGB - Wed, 10 Apr 2024 07:24 UTC

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> Seemingly:
>>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
>>>> code density;
>>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>>> performance.
>>>
>>>> Where, 16 GPRs isn't really enough (lots of register spills), and
>>>> 128 GPRs is wasteful (would likely need lots of monster functions
>>>> with 250+ local variables to make effective use of this, *, which
>>>> probably isn't going to happen).
>>>
>>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
>>> part of GPRs AND you have good access to constants.
>>>
>
>> On the main ISA's I had tried to generate code for, 16 GPRs was kind
>> of a pain as it resulted in fairly high spill rates.
>
>> Though, it would probably be less bad if the compiler was able to use
>> all of the registers at the same time without stepping on itself (such
>> as dealing with register allocation involving scratch registers while
>> also not conflicting with the use of function arguments, ...).
>
>
>> My code generators had typically only used callee save registers for
>> variables in basic blocks which ended in a function call (in my
>> compiler design, both function calls and branches terminating the
>> current basic-block).
>
>> On SH, the main way of getting constants (larger than 8 bits) was via
>> PC-relative memory loads, which kinda sucked.
>

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)

Also kinda bad...

>
>> This is slightly less bad on x86-64, since one can use memory operands
>> with most instructions, and the CPU tends to deal fairly well with
>> code that has lots of spill-and-fill. This along with instructions
>> having access to 32-bit immediate values.
>
> Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have
> LD-Ops act as if they have 4-6 more registers than they really have. x86
> with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
> take the place of universal constants, but goes a long way.
>

Yeah.

>>>
>>> The vast majority of leaf functions use less than 16 GPRs, given one has
>>> a SP not part of GPRs {including arguments and return values}. Once
>>> one starts placing things like memove(), memset(), sin(), cos(),
>>> exp(), log()
>>> in the ISA, it goes up even more.
>>>
>
>> Yeah.
>
>> Things like memcpy/memmove/memset/etc, are function calls in cases
>> when not directly transformed into register load/store sequences.
>
> My 66000 does not convert them into LD-ST sequences, MM is a single inst-
> ruction.
>

I have no high-level memory move/copy/set instructions.
Only loads/stores...

For small copies, can encode them inline, but past a certain size this
becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high overhead
for small to medium copy.

So, there is a size range where doing it inline would be too bulky, but
a loop caries an undesirable level of overhead.

Ended up doing these with "slides", which end up eating roughly several
kB of code space, but was more compact than using larger inline copies.

Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).

Say:

__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)

__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)

....

__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS

>> Did end up with an intermediate "memcpy slide", which can handle
>> medium size memcpy and memset style operations by branching into a slide.
>
> MMs and MSs that do not cross page boundaries are ATOMIC. The entire system
> sees only the before or only the after state and nothing in between. This
> means one can start (queue up) a SATA disk access without obtaining a lock
> to the device--simply because one can fill in all the data of a command in
> a single instruction which smells ATOMIC to all interested 3rd parties.
>

My case, non-atomic, polling IO.

Code fragment:
while(ct<cte)
{
P_SPI_QDATA=0xFFFFFFFFFFFFFFFFULL;
P_SPI_CTRL=tkspi_ctl_status|SPICTRL_XMIT8X;
v=P_SPI_CTRL;
while(v&SPICTRL_BUSY)
v=P_SPI_CTRL;
*(u64 *)ct=P_SPI_QDATA;
ct+=8;
}

Where the MMIO interface allows sending/receiving 8 bytes at a time to
avoid bogging down at around 500 K/s or so (with 8B transfers, could
theoretically do 4 MB/s; though it is only ~ 1.5 MB/s with 12.5 MHz SPI).

Though, this is part of why I had ended up LZ compressing damn near
everything (LZ4 or RP2 being faster than sending ~ 3x as much data over
the SPI interface).

Hadn't generally used Huffman as the additional compression wasn't worth
the fairly steep performance cost (with something like Deflate, it would
barely be much faster than the bare SPI interface).

Did recently come up with a "pseudo entropic" coding that seems
promising in some testing:
Rank symbols by probability, sending the most common 128 symbols;
Send the encoded symbols as table indices via bytes, say:
00..78: Pair of symbol indices, 00..0A;
7F: Escaped Byte
80..FF: Symbol Index

Which, while it seems like this would likely fail to do much of
anything, it "sorta works", and is much faster to unpack than Huffman.

Though, if the distribution is "too flat", one needs to be able to fall
back to raw bytes.

Had experimentally written a compressor based around this scheme, and
while not as fast as LZ4, it did give compression much closer to Deflate.

Where, IME, on my current main PC:
LZMA: ~ 35 MB/s
Bitwise range coder.
Deflate: ~ 200 MB/s
Huffman based, symbols limited to 15 bits.
TKuLZ: ~ 350 MB/s
Resembles a Deflate / LZ4 Hybrid.
Huffman based, symbols limited to 12 bits.
TKFLZH: ~ 500 MB/s
Similar to a more elaborate version of TKuLZ.
Huffman symbols limited to 13 bits.
TKDELZ: ~ 700 MB/s
Similar to the prior to, but:
Splits symbols into separately-coded blocks;
Uses an interleaved encoding scheme, decoding 4 symbols at a time.
PSELZ: ~ 1.0 GB/s
Uses separate symbol blocks, with the "pseudo entropic" encoding.
RP2: ~ 1.8 GB/s
Byte oriented
LZ4: ~ 2.1 GB/s

Though, RP2 and LZ4 switch places on BJX2, where RP2 is both slightly
faster and gives slightly better compression.

I suspect this is likely because of differences in the relative cost of
byte loads and branch mispredicts.

Note that TKuLZ/TKFLZH/TKDELZ/PSELZ used a similar structure for
encoding LZ matches:
TAG
(Raw Length)
(Match Length)
Match Distance
(Literal Bytes)
Where, TAG has a structure like:
(7:5): Raw Length (0..6, 7 = Separate length)
(4:0): Match Length (3..33, 34 = Separate length)

Though, the former 3 were using a combination nybble-stream and bitstream.

Had considered a nybble stream for PSELZ, but ended up using bytes as
bytes are faster.

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<uv5fqf$qs8a$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38259&group=comp.arch#38259

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 02:41:01 -0500
Organization: A noiseless patient Spider
Lines: 167
Message-ID: <uv5fqf$qs8a$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me> <uv56ec$ooj6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 07:41:04 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="adf62e5ff09325073b660a4ffaf2aa0c";
logging-data="880906"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vDGnq4NyCa+j10Fl8rn+XdoHfd36L0gM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xc3CRBHath7ojB8TlAXH03gRI6w=
Content-Language: en-US
In-Reply-To: <uv56ec$ooj6$1@dont-email.me>

by: BGB - Wed, 10 Apr 2024 07:41 UTC

On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
> On 4/9/2024 3:47 PM, BGB-Alt wrote:
>> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>>>> I wrote:
>>>>>
>>>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>>>> Thomas Koenig wrote:
>>>>>>>
>>>>> Maybe one more thing: In order to justify the more complex encoding,
>>>>> I was going for 64 registers, and that didn't work out too well
>>>>> (missing bits).
>>>>>
>>>>> Having learned about M-Core in the meantime, pure 32-register,
>>>>> 21-bit instruction ISA might actually work better.
>>>
>>>
>>>> For 32-bit instructions at least, 64 GPRs can work out OK.
>>>
>>>> Though, the gain of 64 over 32 seems to be fairly small for most
>>>> "typical" code, mostly bringing a benefit if one is spending a lot
>>>> of CPU time in functions that have large numbers of local variables
>>>> all being used at the same time.
>>>
>>>
>>>> Seemingly:
>>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
>>>> code density;
>>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>>> performance.
>>>
>>>> Where, 16 GPRs isn't really enough (lots of register spills), and
>>>> 128 GPRs is wasteful (would likely need lots of monster functions
>>>> with 250+ local variables to make effective use of this, *, which
>>>> probably isn't going to happen).
>>>
>>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
>>> part of GPRs AND you have good access to constants.
>>>
>>
>> On the main ISA's I had tried to generate code for, 16 GPRs was kind
>> of a pain as it resulted in fairly high spill rates.
>>
>> Though, it would probably be less bad if the compiler was able to use
>> all of the registers at the same time without stepping on itself (such
>> as dealing with register allocation involving scratch registers while
>> also not conflicting with the use of function arguments, ...).
>>
>>
>> My code generators had typically only used callee save registers for
>> variables in basic blocks which ended in a function call (in my
>> compiler design, both function calls and branches terminating the
>> current basic-block).
>>
>> On SH, the main way of getting constants (larger than 8 bits) was via
>> PC-relative memory loads, which kinda sucked.
>>
>>
>> This is slightly less bad on x86-64, since one can use memory operands
>> with most instructions, and the CPU tends to deal fairly well with
>> code that has lots of spill-and-fill. This along with instructions
>> having access to 32-bit immediate values.
>>
>>
>>>> *: Where, it appears it is most efficient (for non-leaf functions)
>>>> if the number of local variables is roughly twice that of the number
>>>> of CPU registers. If more local variables than this, then spill/fill
>>>> rate goes up significantly, and if less, then the registers aren't
>>>> utilized as effectively.
>>>
>>>> Well, except in "tiny leaf" functions, where the criteria is instead
>>>> that the number of local variables be less than the number of
>>>> scratch registers. However, for many/most small leaf functions, the
>>>> total number of variables isn't all that large either.
>>>
>>> The vast majority of leaf functions use less than 16 GPRs, given one has
>>> a SP not part of GPRs {including arguments and return values}. Once
>>> one starts placing things like memove(), memset(), sin(), cos(),
>>> exp(), log()
>>> in the ISA, it goes up even more.
>>>
>>
>> Yeah.
>>
>> Things like memcpy/memmove/memset/etc, are function calls in cases
>> when not directly transformed into register load/store sequences.
>>
>> Did end up with an intermediate "memcpy slide", which can handle
>> medium size memcpy and memset style operations by branching into a slide.
>>
>>
>>
>> As noted, on a 32 GPR machine, most leaf functions can fit entirely in
>> scratch registers. On a 64 GPR machine, this percentage is slightly
>> higher (but, not significantly, since there are few leaf functions
>> remaining at this point).
>>
>>
>> If one had a 16 GPR machine with 6 usable scratch registers, it is a
>> little harder though (as typically these need to cover both any
>> variables used by the function, and any temporaries used, ...). There
>> are a whole lot more leaf functions that exceed a limit of 6 than of 14.
>>
>> But, say, a 32 GPR machine could still do well here.
>>
>>
>> Note that there are reasons why I don't claim 64 GPRs as a large
>> performance advantage:
>> On programs like Doom, the difference is small at best.
>>
>>
>> It mostly effects things like GLQuake in my case, mostly because
>> TKRA-GL has a lot of functions with a large numbers of local variables
>> (some exceeding 100 local variables).
>>
>> Partly though this is due to code that is highly inlined and unrolled
>> and uses lots of variables tending to perform better in my case (and
>> tightly looping code, with lots of small functions, not so much...).
>>
>>
>>>
>>>> Where, function categories:
>>>>    Tiny Leaf:
>>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>>    Leaf:
>>>>      No function calls (either explicit or implicit);
>>>>      Will have a stack frame.
>>>>    Non-Leaf:
>>>>      May call functions, has a stack frame.
>>>
>>> You are forgetting about FP, GOT, TLS, and whatever resources are
>>> required
>>> to do try-throw-catch stuff as demanded by the source language.
>>>
>>
>> Yeah, possibly true.
>>
>> In my case:
>>    There is no frame pointer, as BGBCC doesn't use one;
>>      All stack-frames are fixed size, VLA's and alloca use the heap;
>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
>>    TLS, accessed via TBR.[...]
>
> alloca using the heap? Strange to me...
>

Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.

Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack
overflow.

Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.

Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...

Re: "Mini" tags to reduce the number of op codes

<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38260&group=comp.arch#38260

copy link Newsgroups: comp.arch

Date: Wed, 10 Apr 2024 17:12:47 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$7XqNfbTMWmckH634dxFDxujVqacv79sSSZTII6UtrU4ciyMNFvU/e
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>

by: MitchAlsup1 - Wed, 10 Apr 2024 17:12 UTC

BGB wrote:

> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
>>

> Also the blob of constants needed to be within 512 bytes of the load
> instruction, which was also kind of an evil mess for branch handling
> (and extra bad if one needed to spill the constants in the middle of a
> basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

> Usually they were spilled between basic-blocks, with the basic-block
> needing to branch to the following basic-block in these cases.

> Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

> And, if one wanted a 16-bit branch:
> MOV.W (PC, 4), R0 //load a 16-bit branch displacement
> BRA/F R0
> .L0:
> NOP // delay slot
> .WORD $(Label - .L0)

> Also kinda bad...

Can you say Yech !!

>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>> when not directly transformed into register load/store sequences.
>>
>> My 66000 does not convert them into LD-ST sequences, MM is a single inst-
>> ruction.
>>

> I have no high-level memory move/copy/set instructions.
> Only loads/stores...

You have the power to fix it.........

> For small copies, can encode them inline, but past a certain size this
> becomes too bulky.

> A copy loop makes more sense for bigger copies, but has a high overhead
> for small to medium copy.

> So, there is a size range where doing it inline would be too bulky, but
> a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an
instruction.

> Ended up doing these with "slides", which end up eating roughly several
> kB of code space, but was more compact than using larger inline copies.

> Say (IIRC):
> 128 bytes or less: Inline Ld/St sequence
> 129 bytes to 512B: Slide
> Over 512B: Call "memcpy()" or similar.

Versus::
1-infinity: use MM instruction.

> The slide generally has entry points in multiples of 32 bytes, and
> operates in reverse order. So, if not a multiple of 32 bytes, the last
> bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

> Though, this is only used for fixed-size copies (or "memcpy()" when
> value is constant).

> Say:

> __memcpy64_512_ua:
> MOV.Q (R5, 480), R20
> MOV.Q (R5, 488), R21
> MOV.Q (R5, 496), R22
> MOV.Q (R5, 504), R23
> MOV.Q R20, (R4, 480)
> MOV.Q R21, (R4, 488)
> MOV.Q R22, (R4, 496)
> MOV.Q R23, (R4, 504)

> __memcpy64_480_ua:
> MOV.Q (R5, 448), R20
> MOV.Q (R5, 456), R21
> MOV.Q (R5, 464), R22
> MOV.Q (R5, 472), R23
> MOV.Q R20, (R4, 448)
> MOV.Q R21, (R4, 456)
> MOV.Q R22, (R4, 464)
> MOV.Q R23, (R4, 472)

> ....

> __memcpy64_32_ua:
> MOV.Q (R5), R20
> MOV.Q (R5, 8), R21
> MOV.Q (R5, 16), R22
> MOV.Q (R5, 24), R23
> MOV.Q R20, (R4)
> MOV.Q R21, (R4, 8)
> MOV.Q R22, (R4, 16)
> MOV.Q R23, (R4, 24)
> RTS

Duff's device in any other name.

Re: "Mini" tags to reduce the number of op codes

<S%zRN.162255$_a1e.120745@fx16.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38261&group=comp.arch#38261

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.netnews.com!s1-3.netnews.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: "Mini" tags to reduce the number of op codes
Newsgroups: comp.arch
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
Lines: 17
Message-ID: <S%zRN.162255$_a1e.120745@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 10 Apr 2024 17:29:22 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 10 Apr 2024 17:29:22 GMT
X-Received-Bytes: 1646
X-Original-Bytes: 1595

by: Scott Lurndal - Wed, 10 Apr 2024 17:29 UTC

mitchalsup@aol.com (MitchAlsup1) writes:
>BGB wrote:
>
>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>>
>
>> Also the blob of constants needed to be within 512 bytes of the load
>> instruction, which was also kind of an evil mess for branch handling
>> (and extra bad if one needed to spill the constants in the middle of a
>> basic block and then branch over it).
>
>In My 66000 case, the constant is the word following the instruction.
>Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

Re: "Mini" tags to reduce the number of op codes

<uv6nea$14d6r$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38262&group=comp.arch#38262

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 11:57:14 -0700
Organization: A noiseless patient Spider
Lines: 177
Message-ID: <uv6nea$14d6r$2@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me> <uv56ec$ooj6$1@dont-email.me>
<uv5fqf$qs8a$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 20:57:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1e0154287d270c974cd6798ddf950547";
logging-data="1193179"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GG9qoKpDLgdovQbcim6qYNsOWJHovy9U="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:9mf3fWQPpo+RltSVDqjvEwoQM44=
Content-Language: en-US
In-Reply-To: <uv5fqf$qs8a$1@dont-email.me>

by: Chris M. Thomasson - Wed, 10 Apr 2024 18:57 UTC

On 4/10/2024 12:41 AM, BGB wrote:
> On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
>> On 4/9/2024 3:47 PM, BGB-Alt wrote:
>>> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>>>> BGB wrote:
>>>>
>>>>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>>>>> I wrote:
>>>>>>
>>>>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>>>>> Thomas Koenig wrote:
>>>>>>>>
>>>>>> Maybe one more thing: In order to justify the more complex encoding,
>>>>>> I was going for 64 registers, and that didn't work out too well
>>>>>> (missing bits).
>>>>>>
>>>>>> Having learned about M-Core in the meantime, pure 32-register,
>>>>>> 21-bit instruction ISA might actually work better.
>>>>
>>>>
>>>>> For 32-bit instructions at least, 64 GPRs can work out OK.
>>>>
>>>>> Though, the gain of 64 over 32 seems to be fairly small for most
>>>>> "typical" code, mostly bringing a benefit if one is spending a lot
>>>>> of CPU time in functions that have large numbers of local variables
>>>>> all being used at the same time.
>>>>
>>>>
>>>>> Seemingly:
>>>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
>>>>> code density;
>>>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>>>> performance.
>>>>
>>>>> Where, 16 GPRs isn't really enough (lots of register spills), and
>>>>> 128 GPRs is wasteful (would likely need lots of monster functions
>>>>> with 250+ local variables to make effective use of this, *, which
>>>>> probably isn't going to happen).
>>>>
>>>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
>>>> part of GPRs AND you have good access to constants.
>>>>
>>>
>>> On the main ISA's I had tried to generate code for, 16 GPRs was kind
>>> of a pain as it resulted in fairly high spill rates.
>>>
>>> Though, it would probably be less bad if the compiler was able to use
>>> all of the registers at the same time without stepping on itself
>>> (such as dealing with register allocation involving scratch registers
>>> while also not conflicting with the use of function arguments, ...).
>>>
>>>
>>> My code generators had typically only used callee save registers for
>>> variables in basic blocks which ended in a function call (in my
>>> compiler design, both function calls and branches terminating the
>>> current basic-block).
>>>
>>> On SH, the main way of getting constants (larger than 8 bits) was via
>>> PC-relative memory loads, which kinda sucked.
>>>
>>>
>>> This is slightly less bad on x86-64, since one can use memory
>>> operands with most instructions, and the CPU tends to deal fairly
>>> well with code that has lots of spill-and-fill. This along with
>>> instructions having access to 32-bit immediate values.
>>>
>>>
>>>>> *: Where, it appears it is most efficient (for non-leaf functions)
>>>>> if the number of local variables is roughly twice that of the
>>>>> number of CPU registers. If more local variables than this, then
>>>>> spill/fill rate goes up significantly, and if less, then the
>>>>> registers aren't utilized as effectively.
>>>>
>>>>> Well, except in "tiny leaf" functions, where the criteria is
>>>>> instead that the number of local variables be less than the number
>>>>> of scratch registers. However, for many/most small leaf functions,
>>>>> the total number of variables isn't all that large either.
>>>>
>>>> The vast majority of leaf functions use less than 16 GPRs, given one
>>>> has
>>>> a SP not part of GPRs {including arguments and return values}. Once
>>>> one starts placing things like memove(), memset(), sin(), cos(),
>>>> exp(), log()
>>>> in the ISA, it goes up even more.
>>>>
>>>
>>> Yeah.
>>>
>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>> when not directly transformed into register load/store sequences.
>>>
>>> Did end up with an intermediate "memcpy slide", which can handle
>>> medium size memcpy and memset style operations by branching into a
>>> slide.
>>>
>>>
>>>
>>> As noted, on a 32 GPR machine, most leaf functions can fit entirely
>>> in scratch registers. On a 64 GPR machine, this percentage is
>>> slightly higher (but, not significantly, since there are few leaf
>>> functions remaining at this point).
>>>
>>>
>>> If one had a 16 GPR machine with 6 usable scratch registers, it is a
>>> little harder though (as typically these need to cover both any
>>> variables used by the function, and any temporaries used, ...). There
>>> are a whole lot more leaf functions that exceed a limit of 6 than of 14.
>>>
>>> But, say, a 32 GPR machine could still do well here.
>>>
>>>
>>> Note that there are reasons why I don't claim 64 GPRs as a large
>>> performance advantage:
>>> On programs like Doom, the difference is small at best.
>>>
>>>
>>> It mostly effects things like GLQuake in my case, mostly because
>>> TKRA-GL has a lot of functions with a large numbers of local
>>> variables (some exceeding 100 local variables).
>>>
>>> Partly though this is due to code that is highly inlined and unrolled
>>> and uses lots of variables tending to perform better in my case (and
>>> tightly looping code, with lots of small functions, not so much...).
>>>
>>>
>>>>
>>>>> Where, function categories:
>>>>>    Tiny Leaf:
>>>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>>>    Leaf:
>>>>>      No function calls (either explicit or implicit);
>>>>>      Will have a stack frame.
>>>>>    Non-Leaf:
>>>>>      May call functions, has a stack frame.
>>>>
>>>> You are forgetting about FP, GOT, TLS, and whatever resources are
>>>> required
>>>> to do try-throw-catch stuff as demanded by the source language.
>>>>
>>>
>>> Yeah, possibly true.
>>>
>>> In my case:
>>>    There is no frame pointer, as BGBCC doesn't use one;
>>>      All stack-frames are fixed size, VLA's and alloca use the heap;
>>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
>>>    TLS, accessed via TBR.[...]
>>
>> alloca using the heap? Strange to me...
>>
>
> Well, in this case:
> The alloca calls are turned into calls which allocate the memory blob
> and add it to a linked list;
> when the function returns, everything in the linked list is freed;
> Then, it internally pulls this off via malloc and free.
>
> Also the typical default stack size in this case is 128K, so trying to
> put big allocations on the stack is more liable to result in a stack
> overflow.
>
> Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
> heap allocation is not too slow in this case.
>
>
> Though, at the same time, ideally one limits use of language features
> where the code-generation degenerates into a mess of hidden runtime
> calls. These cases are not ideal for performance...
>
>

Sometimes alloca is useful wrt offsetting the stack to avoid false
sharing between stacks. Intel wrote a little paper that addresses this:

https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

Remember that one?

Re: "Mini" tags to reduce the number of op codes

<uv6u3r$16g41$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38263&group=comp.arch#38263

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 15:51:07 -0500
Organization: A noiseless patient Spider
Lines: 239
Message-ID: <uv6u3r$16g41$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 22:51:08 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e567a56649390981af7b87f0b83f34a3";
logging-data="1261697"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jIMrLdBqiQ52GpcX83fyutIV6B3XKN9s="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mPtHM1udBVOTG93BZha/Ad763/o=
Content-Language: en-US
In-Reply-To: <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>

by: BGB-Alt - Wed, 10 Apr 2024 20:51 UTC

On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>>
>
>> Also the blob of constants needed to be within 512 bytes of the load
>> instruction, which was also kind of an evil mess for branch handling
>> (and extra bad if one needed to spill the constants in the middle of a
>> basic block and then branch over it).
>
> In My 66000 case, the constant is the word following the instruction.
> Easy to find, easy to access, no register pollution, no DCache pollution.
>

Yeah.

This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).

Did have a mess when I later extended the ISA to 32 GPRs, as (like with
BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

>> Usually they were spilled between basic-blocks, with the basic-block
>> needing to branch to the following basic-block in these cases.
>
>> Also 8-bit branch displacements are kinda lame, ...
>
> Why do that to yourself ??
>

I didn't design SuperH, Hitachi did...

All of this stuff was apparently sufficient for the SEGA
32X/Saturn/Dreamcast consoles... (vs the Genesis/MegaDrive using a
M68000, and Master System using a Z80).

I guess for a while it was also popular in CD-ROM and HDD controllers.
I guess after SEGA left the game-console market, they had continued
using it for a while in arcade machines, before apparently later jumping
over to x86 via low-end PC motherboards (I guess it being cheaper since
the mid/late 2000s to build an arcade machine with off-the-shelf PC parts).

Saw a video where a guy was messing with one of these, where I guess
despite being built with low-end PC parts (and an entry-level graphics
card), the parts were balanced well enough that it still gave fairly
decent gaming performance.

But, with BJX1, I had added Disp16 branches.

With BJX2, they were replaced with 20 bit branches. These have the merit
of being able to branch anywhere within a Doom or Quake sized binary.

>> And, if one wanted a 16-bit branch:
>>    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
>>    BRA/F R0
>>    .L0:
>>    NOP    // delay slot
>>    .WORD $(Label - .L0)
>
>> Also kinda bad...
>
> Can you say Yech !!
>

Yeah.
This sort of stuff created strong incentive for ISA redesign...

Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.

Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than RV32I
or similar. Turns out not really, as the penalty of the 16 bit ops was
needing almost twice as many on average.

>>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>>> when not directly transformed into register load/store sequences.
>>>
>>> My 66000 does not convert them into LD-ST sequences, MM is a single
>>> inst-
>>> ruction.
>>>
>
>> I have no high-level memory move/copy/set instructions.
>> Only loads/stores...
>
> You have the power to fix it.........
>

But, at what cost...

I had generally avoided anything that will have required microcode or
shoving state-machines into the pipeline or similar.

Things like Load/Store-Multiple or

>> For small copies, can encode them inline, but past a certain size this
>> becomes too bulky.
>
>> A copy loop makes more sense for bigger copies, but has a high
>> overhead for small to medium copy.
>
>
>> So, there is a size range where doing it inline would be too bulky,
>> but a loop caries an undesirable level of overhead.
>
> All the more reason to put it (a highly useful unit of work) into an
> instruction.
>

This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
iteration or so to try to limit looping overhead.

Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).

Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for each
LZ decoder (as I see it, it is a similar issue to not wanting code to
endlessly re-roll stuff for functions like memcpy or malloc/free, *).

*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.

>> Ended up doing these with "slides", which end up eating roughly
>> several kB of code space, but was more compact than using larger
>> inline copies.
>
>
>> Say (IIRC):
>>    128 bytes or less: Inline Ld/St sequence
>>    129 bytes to 512B: Slide
>>    Over 512B: Call "memcpy()" or similar.
>
> Versus::
>     1-infinity: use MM instruction.
>

Yeah, but it makes the CPU logic more expensive.

>> The slide generally has entry points in multiples of 32 bytes, and
>> operates in reverse order. So, if not a multiple of 32 bytes, the last
>> bytes need to be handled externally prior to branching into the slide.
>
> Does this remain sequentially consistent ??
>

Within a thread, it is fine.

Main wonk is that it does start copying from the high address first.
Presumably interrupts or similar wont be messing with application memory
mid memcpy.

The looping memcpy's generally work from low to high addresses though.

>> Though, this is only used for fixed-size copies (or "memcpy()" when
>> value is constant).
>
>
>> Say:
>
>> __memcpy64_512_ua:
>>    MOV.Q        (R5, 480), R20
>>    MOV.Q        (R5, 488), R21
>>    MOV.Q        (R5, 496), R22
>>    MOV.Q        (R5, 504), R23
>>    MOV.Q        R20, (R4, 480)
>>    MOV.Q        R21, (R4, 488)
>>    MOV.Q        R22, (R4, 496)
>>    MOV.Q        R23, (R4, 504)
>
>> __memcpy64_480_ua:
>>    MOV.Q        (R5, 448), R20
>>    MOV.Q        (R5, 456), R21
>>    MOV.Q        (R5, 464), R22
>>    MOV.Q        (R5, 472), R23
>>    MOV.Q        R20, (R4, 448)
>>    MOV.Q        R21, (R4, 456)
>>    MOV.Q        R22, (R4, 464)
>>    MOV.Q        R23, (R4, 472)
>
>> ....
>
>> __memcpy64_32_ua:
>>    MOV.Q        (R5), R20
>>    MOV.Q        (R5, 8), R21
>>    MOV.Q        (R5, 16), R22
>>    MOV.Q        (R5, 24), R23
>>    MOV.Q        R20, (R4)
>>    MOV.Q        R21, (R4, 8)
>>    MOV.Q        R22, (R4, 16)
>>    MOV.Q        R23, (R4, 24)
>>    RTS
>
> Duff's device in any other name.

More or less, though I think the idea of Duff's device is specifically
in the way that it abuses the while-loop and switch constructs.

This is basically just an unrolled slide.
So, where one branches into it, determines how much is copied.

For small-to-medium copies, the advantage is mostly that this avoids
looping overhead.

Re: "Mini" tags to reduce the number of op codes

<9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38264&group=comp.arch#38264

copy link Newsgroups: comp.arch

Date: Wed, 10 Apr 2024 21:19:20 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$PSPWq4gcrq63W2WbkuW/tOlqiGx8eqb3fwwEM60b4KfNykT5iAXhu
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <uv6u3r$16g41$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org>

by: MitchAlsup1 - Wed, 10 Apr 2024 21:19 UTC

BGB-Alt wrote:

> On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>>> BGB-Alt wrote:
>>>>
>>
>>> Also the blob of constants needed to be within 512 bytes of the load
>>> instruction, which was also kind of an evil mess for branch handling
>>> (and extra bad if one needed to spill the constants in the middle of a
>>> basic block and then branch over it).
>>
>> In My 66000 case, the constant is the word following the instruction.
>> Easy to find, easy to access, no register pollution, no DCache pollution.
>>

> Yeah.

> This was why some of the first things I did when I started extending
> SH-4 were:
> Adding mechanisms to build constants inline;
> Adding Load/Store ops with a displacement (albeit with encodings
> borrowed from SH-2A);
> Adding 3R and 3RI encodings (originally Imm8 for 3RI).

My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??

> Did have a mess when I later extended the ISA to 32 GPRs, as (like with
> BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

>>> Usually they were spilled between basic-blocks, with the basic-block
>>> needing to branch to the following basic-block in these cases.
>>
>>> Also 8-bit branch displacements are kinda lame, ...
>>
>> Why do that to yourself ??
>>

> I didn't design SuperH, Hitachi did...

But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.

> But, with BJX1, I had added Disp16 branches.

> With BJX2, they were replaced with 20 bit branches. These have the merit
> of being able to branch anywhere within a Doom or Quake sized binary.

>>> And, if one wanted a 16-bit branch:
>>>    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
>>>    BRA/F R0
>>>    .L0:
>>>    NOP    // delay slot
>>>    .WORD $(Label - .L0)
>>
>>> Also kinda bad...
>>
>> Can you say Yech !!
>>

> Yeah.
> This sort of stuff created strong incentive for ISA redesign...

Maybe consider now as the appropriate time to strt.

> Granted, it is possible had I instead started with RISC-V instead of
> SuperH, it is probable BJX2 wouldn't exist.

> Though, at the time, the original thinking was that SuperH having
> smaller instructions meant it would have better code density than RV32I
> or similar. Turns out not really, as the penalty of the 16 bit ops was
> needing almost twice as many on average.

My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................

>>>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>>>> when not directly transformed into register load/store sequences.
>>>>
>>>> My 66000 does not convert them into LD-ST sequences, MM is a single
>>>> inst-
>>>> ruction.
>>>>
>>
>>> I have no high-level memory move/copy/set instructions.
>>> Only loads/stores...
>>
>> You have the power to fix it.........
>>

> But, at what cost...

You would not have to spend hours a week defending the indefensible !!

> I had generally avoided anything that will have required microcode or
> shoving state-machines into the pipeline or similar.

Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

> Things like Load/Store-Multiple or

If you like polluted ICaches..............

>>> For small copies, can encode them inline, but past a certain size this
>>> becomes too bulky.
>>
>>> A copy loop makes more sense for bigger copies, but has a high
>>> overhead for small to medium copy.
>>
>>
>>> So, there is a size range where doing it inline would be too bulky,
>>> but a loop caries an undesirable level of overhead.
>>
>> All the more reason to put it (a highly useful unit of work) into an
>> instruction.
>>

> This is an area where "slides" work well, the main cost is mostly the
> bulk that the slide adds to the binary (albeit, it is one-off).

Consider that the predictor getting into the slide the first time
always mispredicts !!

> Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
yet a HW sequencer only has to avoid asserting a single byte write enable
once.

> For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
> iteration or so to try to limit looping overhead.

On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.

> Though, leveraging the memcpy slide for the interior part of the copy
> could be possible in theory as well.

What do you do when the STAT drive wants to write a whole page ??

> For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
> shorter (a big part of LZ decoder performance mostly being in
> fine-tuning the logic for the match copies).

> Though, this is part of why my runtime library had added a
> "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
> which can consolidate this rather than needing to do it one-off for each
> LZ decoder (as I see it, it is a similar issue to not wanting code to
> endlessly re-roll stuff for functions like memcpy or malloc/free, *).

> *: Though, nevermind that the standard C interface for malloc is
> annoyingly minimal, and ends up requiring most non-trivial programs to
> roll their own memory management.

>>> Ended up doing these with "slides", which end up eating roughly
>>> several kB of code space, but was more compact than using larger
>>> inline copies.
>>
>>
>>> Say (IIRC):
>>>    128 bytes or less: Inline Ld/St sequence
>>>    129 bytes to 512B: Slide
>>>    Over 512B: Call "memcpy()" or similar.
>>
>> Versus::
>>     1-infinity: use MM instruction.
>>

> Yeah, but it makes the CPU logic more expensive.

By what, 37-gates ??

>>> The slide generally has entry points in multiples of 32 bytes, and
>>> operates in reverse order. So, if not a multiple of 32 bytes, the last
>>> bytes need to be handled externally prior to branching into the slide.
>>
>> Does this remain sequentially consistent ??
>>

> Within a thread, it is fine.

What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.

> Main wonk is that it does start copying from the high address first.
> Presumably interrupts or similar wont be messing with application memory
> mid memcpy.

The only things wanting high-low access patterns are dumping stuff to the
stack. The fact you CAN get away with it most of the time is no excuse.

> The looping memcpy's generally work from low to high addresses though.

As does all string processing.

Re: "Mini" tags to reduce the number of op codes

<uv71os$17d11$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38265&group=comp.arch#38265

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 16:53:32 -0500
Organization: A noiseless patient Spider
Lines: 263
Message-ID: <uv71os$17d11$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me> <uv56ec$ooj6$1@dont-email.me>
<uv5fqf$qs8a$1@dont-email.me> <uv6nea$14d6r$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 23:53:33 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e567a56649390981af7b87f0b83f34a3";
logging-data="1291297"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+q3sgH1bbi3B5OE684luXdE0b1VtxQgYU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xgJO6+iLcWCXDNf4ErOdpLruGbM=
In-Reply-To: <uv6nea$14d6r$2@dont-email.me>
Content-Language: en-US

by: BGB-Alt - Wed, 10 Apr 2024 21:53 UTC

On 4/10/2024 1:57 PM, Chris M. Thomasson wrote:
> On 4/10/2024 12:41 AM, BGB wrote:
>> On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
>>> On 4/9/2024 3:47 PM, BGB-Alt wrote:
>>>> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>>>>> BGB wrote:
>>>>>
>>>>>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>>>>>> I wrote:
>>>>>>>
>>>>>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>>>>>> Thomas Koenig wrote:
>>>>>>>>>
>>>>>>> Maybe one more thing: In order to justify the more complex encoding,
>>>>>>> I was going for 64 registers, and that didn't work out too well
>>>>>>> (missing bits).
>>>>>>>
>>>>>>> Having learned about M-Core in the meantime, pure 32-register,
>>>>>>> 21-bit instruction ISA might actually work better.
>>>>>
>>>>>
>>>>>> For 32-bit instructions at least, 64 GPRs can work out OK.
>>>>>
>>>>>> Though, the gain of 64 over 32 seems to be fairly small for most
>>>>>> "typical" code, mostly bringing a benefit if one is spending a lot
>>>>>> of CPU time in functions that have large numbers of local
>>>>>> variables all being used at the same time.
>>>>>
>>>>>
>>>>>> Seemingly:
>>>>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
>>>>>> code density;
>>>>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
>>>>>> performance.
>>>>>
>>>>>> Where, 16 GPRs isn't really enough (lots of register spills), and
>>>>>> 128 GPRs is wasteful (would likely need lots of monster functions
>>>>>> with 250+ local variables to make effective use of this, *, which
>>>>>> probably isn't going to happen).
>>>>>
>>>>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
>>>>> part of GPRs AND you have good access to constants.
>>>>>
>>>>
>>>> On the main ISA's I had tried to generate code for, 16 GPRs was kind
>>>> of a pain as it resulted in fairly high spill rates.
>>>>
>>>> Though, it would probably be less bad if the compiler was able to
>>>> use all of the registers at the same time without stepping on itself
>>>> (such as dealing with register allocation involving scratch
>>>> registers while also not conflicting with the use of function
>>>> arguments, ...).
>>>>
>>>>
>>>> My code generators had typically only used callee save registers for
>>>> variables in basic blocks which ended in a function call (in my
>>>> compiler design, both function calls and branches terminating the
>>>> current basic-block).
>>>>
>>>> On SH, the main way of getting constants (larger than 8 bits) was
>>>> via PC-relative memory loads, which kinda sucked.
>>>>
>>>>
>>>> This is slightly less bad on x86-64, since one can use memory
>>>> operands with most instructions, and the CPU tends to deal fairly
>>>> well with code that has lots of spill-and-fill. This along with
>>>> instructions having access to 32-bit immediate values.
>>>>
>>>>
>>>>>> *: Where, it appears it is most efficient (for non-leaf functions)
>>>>>> if the number of local variables is roughly twice that of the
>>>>>> number of CPU registers. If more local variables than this, then
>>>>>> spill/fill rate goes up significantly, and if less, then the
>>>>>> registers aren't utilized as effectively.
>>>>>
>>>>>> Well, except in "tiny leaf" functions, where the criteria is
>>>>>> instead that the number of local variables be less than the number
>>>>>> of scratch registers. However, for many/most small leaf functions,
>>>>>> the total number of variables isn't all that large either.
>>>>>
>>>>> The vast majority of leaf functions use less than 16 GPRs, given
>>>>> one has
>>>>> a SP not part of GPRs {including arguments and return values}. Once
>>>>> one starts placing things like memove(), memset(), sin(), cos(),
>>>>> exp(), log()
>>>>> in the ISA, it goes up even more.
>>>>>
>>>>
>>>> Yeah.
>>>>
>>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>>> when not directly transformed into register load/store sequences.
>>>>
>>>> Did end up with an intermediate "memcpy slide", which can handle
>>>> medium size memcpy and memset style operations by branching into a
>>>> slide.
>>>>
>>>>
>>>>
>>>> As noted, on a 32 GPR machine, most leaf functions can fit entirely
>>>> in scratch registers. On a 64 GPR machine, this percentage is
>>>> slightly higher (but, not significantly, since there are few leaf
>>>> functions remaining at this point).
>>>>
>>>>
>>>> If one had a 16 GPR machine with 6 usable scratch registers, it is a
>>>> little harder though (as typically these need to cover both any
>>>> variables used by the function, and any temporaries used, ...).
>>>> There are a whole lot more leaf functions that exceed a limit of 6
>>>> than of 14.
>>>>
>>>> But, say, a 32 GPR machine could still do well here.
>>>>
>>>>
>>>> Note that there are reasons why I don't claim 64 GPRs as a large
>>>> performance advantage:
>>>> On programs like Doom, the difference is small at best.
>>>>
>>>>
>>>> It mostly effects things like GLQuake in my case, mostly because
>>>> TKRA-GL has a lot of functions with a large numbers of local
>>>> variables (some exceeding 100 local variables).
>>>>
>>>> Partly though this is due to code that is highly inlined and
>>>> unrolled and uses lots of variables tending to perform better in my
>>>> case (and tightly looping code, with lots of small functions, not so
>>>> much...).
>>>>
>>>>
>>>>>
>>>>>> Where, function categories:
>>>>>>    Tiny Leaf:
>>>>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>>>>    Leaf:
>>>>>>      No function calls (either explicit or implicit);
>>>>>>      Will have a stack frame.
>>>>>>    Non-Leaf:
>>>>>>      May call functions, has a stack frame.
>>>>>
>>>>> You are forgetting about FP, GOT, TLS, and whatever resources are
>>>>> required
>>>>> to do try-throw-catch stuff as demanded by the source language.
>>>>>
>>>>
>>>> Yeah, possibly true.
>>>>
>>>> In my case:
>>>>    There is no frame pointer, as BGBCC doesn't use one;
>>>>      All stack-frames are fixed size, VLA's and alloca use the heap;
>>>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
>>>>    TLS, accessed via TBR.[...]
>>>
>>> alloca using the heap? Strange to me...
>>>
>>
>> Well, in this case:
>> The alloca calls are turned into calls which allocate the memory blob
>> and add it to a linked list;
>> when the function returns, everything in the linked list is freed;
>> Then, it internally pulls this off via malloc and free.
>>
>> Also the typical default stack size in this case is 128K, so trying to
>> put big allocations on the stack is more liable to result in a stack
>> overflow.
>>
>> Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
>> heap allocation is not too slow in this case.
>>
>>
>> Though, at the same time, ideally one limits use of language features
>> where the code-generation degenerates into a mess of hidden runtime
>> calls. These cases are not ideal for performance...
>>
>>
>
> Sometimes alloca is useful wrt offsetting the stack to avoid false
> sharing between stacks. Intel wrote a little paper that addresses this:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
>
> Remember that one?

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<uv71v2$17d11$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38266&group=comp.arch#38266

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 16:56:51 -0500
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <uv71v2$17d11$2@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 10 Apr 2024 23:56:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e567a56649390981af7b87f0b83f34a3";
logging-data="1291297"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+hhvqPY2XyzWYTb0aSDUUpLZx2gdur5xo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mJ7h6pWFsTvfl6Nz8nVjT+FHSMU=
Content-Language: en-US
In-Reply-To: <S%zRN.162255$_a1e.120745@fx16.iad>

by: BGB-Alt - Wed, 10 Apr 2024 21:56 UTC

On 4/10/2024 12:29 PM, Scott Lurndal wrote:
> mitchalsup@aol.com (MitchAlsup1) writes:
>> BGB wrote:
>>
>>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>>> BGB-Alt wrote:
>>>>
>>
>>> Also the blob of constants needed to be within 512 bytes of the load
>>> instruction, which was also kind of an evil mess for branch handling
>>> (and extra bad if one needed to spill the constants in the middle of a
>>> basic block and then branch over it).
>>
>> In My 66000 case, the constant is the word following the instruction.
>> Easy to find, easy to access, no register pollution, no DCache pollution.
>
> It does occupy some icache space, however; have you boosted the icache
> size to compensate?

FWIW, in my case:
32K I$ + 32K D$ does fairly well IME;

16K I$ + 32K D$ works well for Doom, but has a fairly higher I$ miss
rate for Quake and similar (and most other non-Doom programs). Vs, Doom
seemingly being pretty much D$ bound.

Constants are generally encoded inline in BJX2.

....

Re: "Mini" tags to reduce the number of op codes

<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38267&group=comp.arch#38267

copy link Newsgroups: comp.arch

Date: Wed, 10 Apr 2024 23:30:02 +0000
Subject: Re: "Mini" tags to reduce the number of op codes
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$6m1DcHA87Dv4775QWfCxMegY9UjHLe5zSIJPtpC3Z7zMNdcH7LZrm
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uuk100$inj$1@dont-email.me> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <S%zRN.162255$_a1e.120745@fx16.iad>
Organization: Rocksolid Light
Message-ID: <8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>

by: MitchAlsup1 - Wed, 10 Apr 2024 23:30 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup1) writes:
>>BGB wrote:
>>
>>
>>In My 66000 case, the constant is the word following the instruction.
>>Easy to find, easy to access, no register pollution, no DCache pollution.

> It does occupy some icache space, however; have you boosted the icache
> size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD instruction
and 1 or 2 words in DCache, while consuming a GPR. So, overall, it takes
fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have no
direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Re: "Mini" tags to reduce the number of op codes

<uv7h9k$1ek3q$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38268&group=comp.arch#38268

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 22:18:25 -0400
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <uv7h9k$1ek3q$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 04:18:28 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="dcae821607dd36c58f874b6df34708af";
logging-data="1527930"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vgucjxVBbQxpi2lZsn8nYtw3recqo1Iw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:Q1ZUHOGhk2SDzczMDWXIPIDux0w=
In-Reply-To: <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>

by: Paul A. Clayton - Thu, 11 Apr 2024 02:18 UTC

On 4/9/24 8:28 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
[snip]
>> Things like memcpy/memmove/memset/etc, are function calls in
>> cases when not directly transformed into register load/store
>> sequences.
>
> My 66000 does not convert them into LD-ST sequences, MM is a
> single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

>> Did end up with an intermediate "memcpy slide", which can handle
>> medium size memcpy and memset style operations by branching into
>> a slide.
>
> MMs and MSs that do not cross page boundaries are ATOMIC. The
> entire system
> sees only the before or only the after state and nothing in
> between.

I still feel that this atomicity should somehow be included with
ESM just because they feel related, but the benefit seems likely
to be extremely small. How often would software want to copy
multiple regions atomically or combine region copying with
ordinary ESM atomicity?? There *might* be some use for an atomic
region copy and an updating of a separate data structure (moving a
structure and updating one or a very few pointers??). For
structures three cache lines in size where only one region
occupies four cache lines, ordinary ESM could be used.

My feeling based on "relatedness" is not a strong basis for such
an architectural design choice.

(Simple page masking would allow false conflicts when smaller
memory moves are used. If there is a separate pair of range
registers that is checked for coherence of memory moves, this
issue would only apply for multiple memory moves _and_ all eight
of the buffer entries could be used for smaller accesses.)

[snip]
>> As noted, on a 32 GPR machine, most leaf functions can fit
>> entirely in scratch registers.
>
> Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
> getting totally screwed.

I wonder how many instructions would have to have access to such a
set of "special registers" and if a larger number of extra
registers would be useful. (One of the issues — in my opinion —
with PowerPC's link register and count register was that they
could not be directly loaded from or stored to memory [or loaded \
with a constant from the instruction stream]. For counted loops,
loading the count register from the instruction stream would
presumably have allowed early branch determination even for deep
pipelines and small loop counts.) SP, FP, GOT, and TLS hold
"stable values", which might facilitate some microarchitectural
optimizations compared to more frequently modified register names.

(I am intrigued by the possibility of small contexts for some
multithreaded workloads, similar to how some GPUs allow variable
context sizes.)

Re: "Mini" tags to reduce the number of op codes

<uv7kit$1fc2u$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38269&group=comp.arch#38269

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 22:14:33 -0500
Organization: A noiseless patient Spider
Lines: 393
Message-ID: <uv7kit$1fc2u$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<uv6u3r$16g41$1@dont-email.me>
<9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 05:14:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="059e35bc5e274e101eeeb06f16103042";
logging-data="1552478"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19RB2E2BRMRYoRYARVgGOcvfkNO/4M0/m4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:f0OiYyAH2zgONYDXR8Vkk4E5abs=
In-Reply-To: <9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org>
Content-Language: en-US

by: BGB - Thu, 11 Apr 2024 03:14 UTC

On 4/10/2024 4:19 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
>
>> On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>>>> BGB-Alt wrote:
>>>>>
>>>
>>>> Also the blob of constants needed to be within 512 bytes of the load
>>>> instruction, which was also kind of an evil mess for branch handling
>>>> (and extra bad if one needed to spill the constants in the middle of
>>>> a basic block and then branch over it).
>>>
>>> In My 66000 case, the constant is the word following the instruction.
>>> Easy to find, easy to access, no register pollution, no DCache
>>> pollution.
>>>
>
>> Yeah.
>
>> This was why some of the first things I did when I started extending
>> SH-4 were:
>> Adding mechanisms to build constants inline;
>> Adding Load/Store ops with a displacement (albeit with encodings
>> borrowed from SH-2A);
>> Adding 3R and 3RI encodings (originally Imm8 for 3RI).
>
> My suggestion is that:: "Now that you have screwed around for a while,
> Why not take that experience and do a new ISA without any of those
> mistakes in it" ??
>

There was a reboot, it became BJX2.
This, of course, has developed some of its own hair...

Where, BJX1 was a modified SuperH, and BJX2 was a redesigned ISA design
that was "mostly backwards compatible" at the ASM level.

Granted, possibly I could have gone further, such as no longer having
the stack pointer in R15, but alas...

Though, in some areas, SH had features that I had dropped as well, such
as auto-increment addressing and delay slots.

>> Did have a mess when I later extended the ISA to 32 GPRs, as (like
>> with BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.
>
>
>>>> Usually they were spilled between basic-blocks, with the basic-block
>>>> needing to branch to the following basic-block in these cases.
>>>
>>>> Also 8-bit branch displacements are kinda lame, ...
>>>
>>> Why do that to yourself ??
>>>
>
>> I didn't design SuperH, Hitachi did...
>
> But you did not fix them en massé, and you complain about them
> at least once a week. There comes a time when it takes less time
> and less courage to do that big switch and clean up all that mess.
>

For the most part, BJX2 is using 20-bit branches for 32-bit ops.

Exceptions being the Compare-and-Branch, and Compare-Zero-and-Branch
ops, but this is mostly because there wasn't enough encoding space to
give them larger displacements.

BREQ.Q Rn, Disp11s
BREQ.Q Rm, Rn, Disp8s

There are Disp32s variants available, just that these involve using a
Jumbo prefix.

>
>> But, with BJX1, I had added Disp16 branches.
>
>> With BJX2, they were replaced with 20 bit branches. These have the
>> merit of being able to branch anywhere within a Doom or Quake sized
>> binary.
>
>
>>>> And, if one wanted a 16-bit branch:
>>>>    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
>>>>    BRA/F R0
>>>>    .L0:
>>>>    NOP    // delay slot
>>>>    .WORD $(Label - .L0)
>>>
>>>> Also kinda bad...
>>>
>>> Can you say Yech !!
>>>
>
>> Yeah.
>> This sort of stuff created strong incentive for ISA redesign...
>
> Maybe consider now as the appropriate time to strt.
>

The above was for SuperH; this sort of thing is N/A for BJX2.

In this case, BJX2 can pull it off in a single instruction.

None the less, even with all this crap, the SuperH was still seen as
sufficient for the Sega 32X/Saturn/Dreamcast (and the Naomi and Hikaru
arcade machine boards, ...).

Though, it seems Sega later jumped ship from SuperH to using low-end x86
PC motherboads in later arcade machines.

>> Granted, it is possible had I instead started with RISC-V instead of
>> SuperH, it is probable BJX2 wouldn't exist.
>
>
>> Though, at the time, the original thinking was that SuperH having
>> smaller instructions meant it would have better code density than
>> RV32I or similar. Turns out not really, as the penalty of the 16 bit
>> ops was needing almost twice as many on average.
>
> My 66000 only requires 70% the instruction count of RISC-V,
> Yours could too ................
>

At this point, I suspect the main issue for me not (entirely) beating
RV64G, is mostly compiler issues...

So, the ".text" section is still around 10% bigger, with some amount of
this being spent on Jumbo prefixes, and the rest due to cases where code
generation falls short.

>>>>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>>>>> when not directly transformed into register load/store sequences.
>>>>>
>>>>> My 66000 does not convert them into LD-ST sequences, MM is a single
>>>>> inst-
>>>>> ruction.
>>>>>
>>>
>>>> I have no high-level memory move/copy/set instructions.
>>>> Only loads/stores...
>>>
>>> You have the power to fix it.........
>>>
>
>> But, at what cost...
>
> You would not have to spend hours a week defending the indefensible !!
>
>> I had generally avoided anything that will have required microcode or
>> shoving state-machines into the pipeline or similar.
>
> Things as simple as IDIV and FDIV require sequencers.
> But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!
>

Not so much in my case.

IDIV and FDIV:
Feed inputs into Shift-Add unit;
Stall pipeline for a predefined number of clock cycles;
Grab result out of the other end (at which point, pipeline resumes).

In this case, the FDIV was based on noting that if one lets the
Shift-Add unit run for longer, it moves from doing an integer divide to
doing a fractional divide, so I could make it perform an FDIV merely by
feeding the mantissas into it (as two big integers) and doubling the
latency. Then glue on some extra logic to figure out the exponents and
pack/unpack Binary64, and, done.

Not really the same thing at all...

Apart from it tending to get stomped every time one does an integer
divide, could possibly also use it as an RNG, as it basically churns
over whatever random bits flow into it from the pipeline.

>> Things like Load/Store-Multiple or
>
> If you like polluted ICaches..............
>
>>>> For small copies, can encode them inline, but past a certain size
>>>> this becomes too bulky.
>>>
>>>> A copy loop makes more sense for bigger copies, but has a high
>>>> overhead for small to medium copy.
>>>
>>>
>>>> So, there is a size range where doing it inline would be too bulky,
>>>> but a loop caries an undesirable level of overhead.
>>>
>>> All the more reason to put it (a highly useful unit of work) into an
>>> instruction.
>>>
>
>> This is an area where "slides" work well, the main cost is mostly the
>> bulk that the slide adds to the binary (albeit, it is one-off).
>
> Consider that the predictor getting into the slide the first time
> always mispredicts !!
>

Possibly.

But, note that the paths headed into the slide are things like structure
assignment and "memcpy()" where the size is constant. So, in these
cases, the compiler already knows where it is branching.

So, say:
memcpy(dst, src, 512);
Gets compiled as, effectively:
MOV dst, R4
MOV src, R5
BSR __memcpy64_512_ua

>> Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...
>
> What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
> yet a HW sequencer only has to avoid asserting a single byte write enable
> once.
>

Two strategies:
Compiler pads it to 64 bytes (typical for struct copy, where structs can
always be padded up to their natural alignment);
It emits the code for copying the last N bytes (modulo 32) and then
branches into the slide (typical for memcpy).

For variable memcpy, there is an extension:
_memcpyf(void *dst, void *src, size_t len);

Which is basically the "I don't care if it copies a little extra"
version (say, where it may pad the copy up to a multiple of 16 bytes).

>> For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
>> iteration or so to try to limit looping overhead.
>
> On low end machines, you want to operate at cache port width,
> On high end machines, you want to operate at cache line widths per port.
> This is essentially impossible using slides.....here, the same code is
> not optimal across a line of implementations.
>

Click here to read the complete article

Re: "Mini" tags to reduce the number of op codes

<uv7l00$1fc2u$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38270&group=comp.arch#38270

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 22:21:33 -0500
Organization: A noiseless patient Spider
Lines: 137
Message-ID: <uv7l00$1fc2u$2@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
<15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv7h9k$1ek3q$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 11 Apr 2024 05:21:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="059e35bc5e274e101eeeb06f16103042";
logging-data="1552478"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19y7LfSOloJZ6oHCUo6Vbjn5bhFepfZDUw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:4JT0sjweHF+HrsfwzcOmN1+QaVs=
In-Reply-To: <uv7h9k$1ek3q$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 11 Apr 2024 03:21 UTC

On 4/10/2024 9:18 PM, Paul A. Clayton wrote:
> On 4/9/24 8:28 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
> [snip]
>>> Things like memcpy/memmove/memset/etc, are function calls in cases
>>> when not directly transformed into register load/store sequences.
>>
>> My 66000 does not convert them into LD-ST sequences, MM is a single
>> instruction.
>
> I wonder if it would be useful to have an immediate count form of
> memory move. Copying fixed-size structures would be able to use an
> immediate. Aside from not having to load an immediate for such
> cases, there might be microarchitectural benefits to using a
> constant. Since fixed-sized copies would likely be limited to
> smaller regions (with the possible exception of 8 MiB page copies)
> and the overhead of loading a constant for large sizes would be
> tiny, only providing a 16-bit immediate form might be reasonable.
>

As noted, in my case, the whole thing of Ld/St sequences, and memcpy
slides, mostly applies to constant cases.

If the copy size is variable, the compiler merely calls "memcpy()",
which will then generally figure out which loop to use, and one has to
pay the penalty of the runtime overhead of memcpy needing to figure out
what it needs to do.

>>> Did end up with an intermediate "memcpy slide", which can handle
>>> medium size memcpy and memset style operations by branching into a
>>> slide.
>>
>> MMs and MSs that do not cross page boundaries are ATOMIC. The entire
>> system
>> sees only the before or only the after state and nothing in between.
>
> I still feel that this atomicity should somehow be included with
> ESM just because they feel related, but the benefit seems likely
> to be extremely small. How often would software want to copy
> multiple regions atomically or combine region copying with
> ordinary ESM atomicity?? There *might* be some use for an atomic
> region copy and an updating of a separate data structure (moving a
> structure and updating one or a very few pointers??). For
> structures three cache lines in size where only one region
> occupies four cache lines, ordinary ESM could be used.
>
> My feeling based on "relatedness" is not a strong basis for such
> an architectural design choice.
>
> (Simple page masking would allow false conflicts when smaller
> memory moves are used. If there is a separate pair of range
> registers that is checked for coherence of memory moves, this
> issue would only apply for multiple memory moves _and_ all eight
> of the buffer entries could be used for smaller accesses.)
>

All seems a bit complicated to me.

But, as noted, I went for a model of weak memory coherence and leaving
most of this stuff for software to sort out.

> [snip]
>>> As noted, on a 32 GPR machine, most leaf functions can fit entirely
>>> in scratch registers.
>>
>> Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
>> getting totally screwed.
>
> I wonder how many instructions would have to have access to such a
> set of "special registers" and if a larger number of extra
> registers would be useful. (One of the issues — in my opinion —
> with PowerPC's link register and count register was that they
> could not be directly loaded from or stored to memory [or loaded \
> with a constant from the instruction stream]. For counted loops,
> loading the count register from the instruction stream would
> presumably have allowed early branch determination even for deep
> pipelines and small loop counts.) SP, FP, GOT, and TLS hold
> "stable values", which might facilitate some microarchitectural
> optimizations compared to more frequently modified register names.
>
> (I am intrigued by the possibility of small contexts for some
> multithreaded workloads, similar to how some GPUs allow variable context
> sizes.)

In my case, yeah, there are two semi-separate register spaces here:
GPRs: R0..R63
R0, R1, and R15 are Special
R0/DLR: Hard-coded register for some instructions;
Assembler may stomp without warning for pseudo-instructions.
R1/DHR:
Was originally intended similar to DLR;
Now mostly used as an auxiliary link register.
R15/SP:
Stack Pointer.
CRs: C0..C63
Various special purpose registers;
Most are privileged only.
LR, GBR, etc, are in CR space.

Though, internally, GPRs and CRs both exist within a combined register
space in the CPU:
00..3F: Mostly GPR space
40..7F: CR and SPR space.

Generally, CRs may only be accessed by certain register ports though.

By default, the only way to save/restore CRs is by shuffling them
through GPRs. There is an optional MOV.C instruction for this, but
generally it is not enabled as it isn't clear that it saves enough to be
worth the added LUT cost.

There is a subset version, where MOV.C exists, but is only really able
to be used with LR and GBR and similar. Generally, this version exists
as RISC-V Mode needs to be able to save/restore these registers (they
exist in the GPR space in RISC-V).

As I can note, if I did a new ISA, most likely the register assignment
scheme would differ, say:
R0: ZR / PC
R1: LR / TP (TBR)
R2: SP
R3: GP (GBR)
Where the interpretation of R0 and R1 would depend on context (ZR and LR
for most instructions, PC and TP when used as a Ld/St base address).

Though, some ideas had involved otherwise keeping a similar register
space layout to my existing ABI, mostly because significant ABI changes
would not be easy for my compiler as-is.

Re: "Mini" tags to reduce the number of op codes

<uv8dlo$1krvp$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38271&group=comp.arch#38271

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 12:22:47 +0200
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <uv8dlo$1krvp$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 11 Apr 2024 12:22:48 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0e068abbfdc66e7dff6ecc0805f42343";
logging-data="1732601"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX199ym5o3OwEM43kvadG0Jyoz/YpkIpK6+6ex35wgbDyFg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:OJqPNAsfGAYGSA/3XripfH4E2mk=
In-Reply-To: <S%zRN.162255$_a1e.120745@fx16.iad>

by: Terje Mathisen - Thu, 11 Apr 2024 10:22 UTC

Scott Lurndal wrote:
> mitchalsup@aol.com (MitchAlsup1) writes:
>> BGB wrote:
>>
>>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
>>>> BGB-Alt wrote:
>>>>
>>
>>> Also the blob of constants needed to be within 512 bytes of the load
>>> instruction, which was also kind of an evil mess for branch handling
>>> (and extra bad if one needed to spill the constants in the middle of a
>>> basic block and then branch over it).
>>
>> In My 66000 case, the constant is the word following the instruction.
>> Easy to find, easy to access, no register pollution, no DCache pollution.
>
> It does occupy some icache space, however; have you boosted the icache
> size to compensate?

Except it pretty rarely do so (increase icache pressure):

mov temp_reg, offset const_table
mov reg,qword ptr [temp_reg+const_offset]

looks to me like at least 5 bytes for the first instruction and probably
6 for the second, for a total of 11 (could be as low as 8 for a very
small offset), all on top of the 8 bytes of dcache needed to hold the
64-bit value loaded.

In My 66000 this should be a single 32-bit instruction followed by the
8-byte const, so 12 bytes total and no lookaside dcache inference.

It is only when you do a lot of 64-bit data loads, all gathered in a
single 256-byte buffer holding up to 32 such values, and you can afford
to allocate a fixed register pointing to the middle of that range, that
you actually gain some total space: Each load can now just do a

mov reg,qword ptr [fixed_base_reg+byte_offset]

which, due to the need for a 64-bit prefix, will probably need 4
instruction bytes on top of the 8 bytes from dcache. At this point we
are touching exactly the same number of bytes (12) as My 66000, but from
two different caches, so much more likley to suffer dcache misses.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: "Mini" tags to reduce the number of op codes

<20240411141324.0000090d@yahoo.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38272&group=comp.arch#38272

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 14:13:24 +0300
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <20240411141324.0000090d@yahoo.com>
References: <uuk100$inj$1@dont-email.me>
<lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com>
<uuv1ir$30htt$1@dont-email.me>
<d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
<uv02dn$3b6ik$1@dont-email.me>
<uv415n$ck2j$1@dont-email.me>
<uv46rg$e4nb$1@dont-email.me>
<a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
<uv4ghh$gfsv$1@dont-email.me>
<8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org>
<uv5err$ql29$1@dont-email.me>
<e43623eb10619eb28a68b2bd7af93390@www.novabbs.org>
<S%zRN.162255$_a1e.120745@fx16.iad>
<8b6bcc78355b8706235b193ad2243ad0@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Date: Thu, 11 Apr 2024 13:13:28 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="948a3d46d314eac708dbd3a3c8e1a9c2";
logging-data="1721437"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eZu79hjE47aKn5hzItYhKM2tCK5IkoW8="
Cancel-Lock: sha1:7nykYz+ZkJJGmBBgnuUdMWXyTzE=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)

by: Michael S - Thu, 11 Apr 2024 11:13 UTC

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

> Scott Lurndal wrote:
>
> > mitchalsup@aol.com (MitchAlsup1) writes:
> >>BGB wrote:
> >>
> >>
> >>In My 66000 case, the constant is the word following the
> >>instruction. Easy to find, easy to access, no register pollution,
> >>no DCache pollution.
>
> > It does occupy some icache space, however; have you boosted the
> > icache size to compensate?
>
> The space occupied in the ICache is freed up from being in the DCache
> so the overall hit rate goes up !! At typical sizes, ICache miss rate
> is about ¼ the miss rate of DCache.
>
> Besides:: if you had to LD the constant from memory, you use a LD
> instruction and 1 or 2 words in DCache, while consuming a GPR. So,
> overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
>
> Alternatively:: if you paste constants together (LUI, AUPIC) you have
> no direct route to either 64-bit constants or 64-bit address spaces.
>
> It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Re: "Mini" tags to reduce the number of op codes

<7uSRN.161295$m4d.65414@fx43.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=38273&group=comp.arch#38273

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: "Mini" tags to reduce the number of op codes
Newsgroups: comp.arch
References: <uuk100$inj$1@dont-email.me> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me>
Lines: 38
Message-ID: <7uSRN.161295$m4d.65414@fx43.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 11 Apr 2024 14:30:27 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 11 Apr 2024 14:30:27 GMT
X-Received-Bytes: 2498

by: Scott Lurndal - Thu, 11 Apr 2024 14:30 UTC

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 4/9/24 8:28 PM, MitchAlsup1 wrote:
>> BGB-Alt wrote:
>[snip]
>>> Things like memcpy/memmove/memset/etc, are function calls in
>>> cases when not directly transformed into register load/store
>>> sequences.
>>
>> My 66000 does not convert them into LD-ST sequences, MM is a
>> single instruction.
>
>I wonder if it would be useful to have an immediate count form of
>memory move. Copying fixed-size structures would be able to use an
>immediate. Aside from not having to load an immediate for such
>cases, there might be microarchitectural benefits to using a
>constant. Since fixed-sized copies would likely be limited to
>smaller regions (with the possible exception of 8 MiB page copies)
>and the overhead of loading a constant for large sizes would be
>tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

>
>>> Did end up with an intermediate "memcpy slide", which can handle
>>> medium size memcpy and memset style operations by branching into
>>> a slide.
>>
>> MMs and MSs that do not cross page boundaries are ATOMIC. The
>> entire system
>> sees only the before or only the after state and nothing in
>> between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

devel / comp.arch / Re: "Mini" tags to reduce the number of op codes

Pages:123 4

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor