Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

New crypt. See /usr/news/crypt.


devel / comp.arch / Re: Intel goes to 32-bit general purpose registers

SubjectAuthor
* Intel goes to 32-bit general purpose registersThomas Koenig
+* Re: Intel goes to 32-bit general purpose registersScott Lurndal
|`* Re: Intel goes to 32-bit general purpose registersMitchAlsup
| `* Re: Intel goes to 32-bit general purpose registersQuadibloc
|  `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|   `* Re: Intel goes to 32-bit general purpose registersPeter Lund
|    `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|     `* Re: Intel goes to 32-bit general purpose registersElijah Stone
|      `* Re: Intel goes to 32-bit general purpose registersMitchAlsup
|       `- Re: Intel goes to 32-bit general purpose registersThomas Koenig
+* Re: Intel goes to 32-bit general purpose registersQuadibloc
|`* Re: Intel goes to 32-bit general purpose registersQuadibloc
| +* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |+* Re: Intel goes to 32-bit general purpose registersScott Lurndal
| ||`- Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |+* Re: Intel goes to 32-bit general purpose registersAnton Ertl
| ||+* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |||+- Re: Intel goes to 32-bit general purpose registersBGB
| |||`- Re: Intel goes to 32-bit general purpose registersAnton Ertl
| ||+- Re: Intel goes to 32-bit general purpose registersJimBrakefield
| ||`* Re: Intel goes to 32-bit general purpose registersMichael S
| || `- Re: Intel goes to 32-bit general purpose registersAnton Ertl
| |`* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| | +- Re: Intel goes to 32-bit general purpose registersStephen Fuld
| | `- Re: Intel goes to 32-bit general purpose registersMitchAlsup
| `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|  `- Re: Intel goes to 32-bit general purpose registersJohn Dallman
`* Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Anton Ertl
 `* Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Quadibloc
  +- Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Anton Ertl
  `* Re: Intel goes to 32 GPRsTerje Mathisen
   `* Re: Intel goes to 32 GPRsThomas Koenig
    `* Re: Intel goes to 32 GPRsTerje Mathisen
     +* Re: Intel goes to 32 GPRsThomas Koenig
     |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | +- Re: Intel goes to 32 GPRsMitchAlsup
     | `* Re: Intel goes to 32 GPRsThomas Koenig
     |  `* Re: Intel goes to 32 GPRsTerje Mathisen
     |   +- Re: Intel goes to 32 GPRsMitchAlsup
     |   `* Re: Intel goes to 32 GPRsThomas Koenig
     |    `- Re: Intel goes to 32 GPRsAnton Ertl
     +- Re: Intel goes to 32 GPRsMitchAlsup
     +* Re: Intel goes to 32 GPRsAnton Ertl
     |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | +* Re: Intel goes to 32 GPRsScott Lurndal
     | |`* Re: Intel goes to 32 GPRsMitchAlsup
     | | +- Re: Intel goes to 32 GPRsMitchAlsup
     | | +* Re: Intel goes to 32 GPRsScott Lurndal
     | | |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | | | +* Re: Intel goes to 32 GPRsBGB
     | | | |+* Re: Intel goes to 32 GPRsMitchAlsup
     | | | ||`- Re: Intel goes to 32 GPRsBGB
     | | | |`* Re: Intel goes to 32 GPRsQuadibloc
     | | | | `- Re: Intel goes to 32 GPRsBGB
     | | | `* Re: Intel goes to 32 GPRsAnton Ertl
     | | |  `- Re: Intel goes to 32 GPRsTerje Mathisen
     | | `* Re: Intel goes to 32 GPRsBGB
     | |  `* Re: Intel goes to 32 GPRsMitchAlsup
     | |   `- Re: Intel goes to 32 GPRsBGB
     | +* Re: Intel goes to 32 GPRsAnton Ertl
     | |`* Re: Intel goes to 32 GPRsThomas Koenig
     | | +* Re: Intel goes to 32 GPRsMitchAlsup
     | | |`- Re: Intel goes to 32 GPRsAnton Ertl
     | | +* Re: Intel goes to 32 GPRsTerje Mathisen
     | | |+* Re: Intel goes to 32 GPRsAnton Ertl
     | | ||+- Re: Intel goes to 32 GPRsMitchAlsup
     | | ||`- Re: Intel goes to 32 GPRsJimBrakefield
     | | |`* Re: Intel goes to 32 GPRsMitchAlsup
     | | | `* Re: Intel goes to 32 GPRsBGB
     | | |  `* Re: Intel goes to 32 GPRsMitchAlsup
     | | |   +- Re: Intel goes to 32 GPRsBGB
     | | |   `* Re: Intel goes to 32 GPRsTerje Mathisen
     | | |    `- Re: Intel goes to 32 GPRsBGB
     | | `* Re: Intel goes to 32 GPRsStephen Fuld
     | |  `* Re: Intel goes to 32 GPRsAnton Ertl
     | |   +- Re: Intel goes to 32 GPRsStephen Fuld
     | |   `- Re: Intel goes to 32 GPRsThomas Koenig
     | `* Re: Intel goes to 32 GPRsThomas Koenig
     |  `* Re: Intel goes to 32 GPRsTerje Mathisen
     |   `* Re: Intel goes to 32 GPRsThomas Koenig
     |    `* Re: Intel goes to 32 GPRsMitchAlsup
     |     `* Re: Intel goes to 32 GPRsNiklas Holsti
     |      `* Re: Intel goes to 32 GPRsMitchAlsup
     |       `* Re: Intel goes to 32 GPRsNiklas Holsti
     |        `* Re: Intel goes to 32 GPRsStephen Fuld
     |         +- Re: Intel goes to 32 GPRsNiklas Holsti
     |         `- Re: Intel goes to 32 GPRsIvan Godard
     `* Re: Intel goes to 32 GPRsKent Dickey
      +* Re: Intel goes to 32 GPRsMitchAlsup
      |+* Re: Intel goes to 32 GPRsQuadibloc
      ||`- Re: Intel goes to 32 GPRsTerje Mathisen
      |`* Re: Intel goes to 32 GPRsKent Dickey
      | `* Re: Intel goes to 32 GPRsThomas Koenig
      |  +* Re: Intel goes to 32 GPRsAnton Ertl
      |  |+- Re: Intel goes to 32 GPRsAnton Ertl
      |  |`* Re: Intel goes to 32 GPRsEricP
      |  | +* Re: Intel goes to 32 GPRsMitchAlsup
      |  | |`* Re: Intel goes to 32 GPRsThomas Koenig
      |  | | `* Re: Intel goes to 32 GPRsBGB
      |  | |  `* Re: Intel goes to 32 GPRsMitchAlsup
      |  | |   +* Re: Intel goes to 32 GPRsBGB
      |  | |   `* Re: Intel goes to 32 GPRsTerje Mathisen
      |  | `* Re: Intel goes to 32 GPRsStephen Fuld
      |  `* Re: Intel goes to 32 GPRsKent Dickey
      +* Callee-saved registers (was: Intel goes to 32 GPRs)Anton Ertl
      `- Re: Intel goes to 32 GPRsMike Stump

Pages:12345678910
Re: Intel goes to 32 GPRs

<86zg2dxgcz.fsf@linuxsc.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33839&group=comp.arch#33839

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17687@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sat, 26 Aug 2023 21:24:12 -0700
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <86zg2dxgcz.fsf@linuxsc.com>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad> <taLFM.157982$8_8a.131964@fx48.iad> <oKLFM.142637$ftCb.43121@fx34.iad> <QH3GM.95667$VzFf.17010@fx03.iad> <7L4GM.909644$GMN3.642632@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="e11e661f808acc58bc4b97373d32be95";
logging-data="1092090"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7Lpq/Kr0ejWxrJw8s6p0D7hBthwV7fuI="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:RrO0ErRBVfsjsErYccXLe5OgTzo=
sha1:6U6oz81gEAcTEZpWI5j3WZGg2EQ=
 by: Tim Rentsch - Sun, 27 Aug 2023 04:24 UTC

scott@slp53.sl.home (Scott Lurndal) writes:

> EricP <ThatWouldBeTelling@thevillage.com> writes:

[...]

>> On Windows errno is what the C standard says, an integer with file
>> scope, which they moved to the TLS area. Each EXE/DLL that
>> references errno gets an errno, just like any other global scope
>> variable.
>
> POSIX extends the definition a bit.
>
> The symbol errno shall expand to a modifiable lvalue of
> type int. It is unspecified whether errno is a macro or
> an identifier declared with external linkage. If a macro
> definition is suppressed in order to access an actual object,
> or a program defines an identifier with the name errno, the
> behavior is undefined.

The C standard requires <errno.h> to define 'errno' as a macro,
and has since the original C standard in 1989. The text above
is basically taken straight out of the C standard.

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<ucfa4j$30keu$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33841&group=comp.arch#33841

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f2a7-0-ac89-ce2c-47ac-ae4.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Sun, 27 Aug 2023 10:56:19 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <ucfa4j$30keu$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<uc1hsv$2ntj0$1@newsreader4.netcologne.de> <ucb30g$6hlj$1@dont-email.me>
<uccedu$2urln$1@newsreader4.netcologne.de> <ucd02g$jk15$1@dont-email.me>
Injection-Date: Sun, 27 Aug 2023 10:56:19 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f2a7-0-ac89-ce2c-47ac-ae4.ipv6dyn.netcologne.de:2001:4dd4:f2a7:0:ac89:ce2c:47ac:ae4";
logging-data="3166686"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 27 Aug 2023 10:56 UTC

Kent Dickey <kegs@provalid.com> schrieb:
> In article <uccedu$2urln$1@newsreader4.netcologne.de>,
> Thomas Koenig <tkoenig@netcologne.de> wrote:
>>Kent Dickey <kegs@provalid.com> schrieb:
>>
>>> 2) GCC's register allocator misses some reuse. If it feels it has
>>> plenty of registers, then it ends up doing stuff like:
>>>
>>> LDR x20,[x8,#120]
>>> ADD x20,x20,x9
>>> STR x20,[x8,#120]
>>> BL some_other_func
>>> LDR x21,[x8,#128]
>>> ADD x21,x21,x0
>>> STR x21,[x8,#128]
>>>
>>> Basically, it needs a register to hold some value, and it even
>>> though x20 and x21's use do not overlap, it doesn't just use x20
>>> for both. What's actually going on is more complex (these are
>>> preserved registers, not scrach registers), but effectively GCC
>>> can waste some preserved (callee save) registers.
>>
>>Register allocation is a hard problem, and known to be far from
>>perfect; a bugzilla search for the keyword ra (register allocation)
>>shows 214 open bugs.
>>
>>What you showed looks like it should be number 215. Can you
>>open a PR on gcc bugzilla, or send me the code (godbolt, mail)?
>>I would then submit it.
>
> You can actually see the function get longer using the example function
> I posted earlier in this thread,

Ah, so there is no test case without that option?

Then it is likely not to be high on developer's priority lists. If you
have a test case with out a non-standard ABI, that is likely to be
different.

>>Do not forget that -fcall-saved is an ABI-changing option.
>>
>>Did your whole program, including those usually called via shared
>>libraries, with the -fall-saved-xx option? If not, calling (for
>>example) a system-provided libc function is a likely
>>source for crashes.
>
> Yes, every line of C code is recompiled with that option (except for
> libgcc.a, which based on past experience my code doesn't use on ARM64).

"From past experience" is sometimes a poor guide, checking might
be a better way.

> There are no libraries, not even libc.

Inline assembler in header files, maybe?

>I have a pretty unique use case.
> It runs for a little while, but then jumps to address 0. I'm pretty sure
> it's a GCC bug that I don't have interest in debugging due to the r29
> fiasco.

This may just expose a latent bug in your program. If you have
the possibility of running it under a normal environment, I would
suggest shaking the code by all debugging options (sanitizer,
running under valgrind, ...) at it. Maybe some bugs will come
crawling out.

Re: Intel goes to 32 GPRs

<YZJGM.490024$U3w1.271899@fx09.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33845&group=comp.arch#33845

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mb-net.net!open-news-network.org!news.mind.de!bolzen.all.de!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad> <taLFM.157982$8_8a.131964@fx48.iad> <oKLFM.142637$ftCb.43121@fx34.iad> <QH3GM.95667$VzFf.17010@fx03.iad> <7L4GM.909644$GMN3.642632@fx16.iad>
In-Reply-To: <7L4GM.909644$GMN3.642632@fx16.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 78
Message-ID: <YZJGM.490024$U3w1.271899@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 27 Aug 2023 15:30:00 UTC
Date: Sun, 27 Aug 2023 11:28:55 -0400
X-Received-Bytes: 4812
 by: EricP - Sun, 27 Aug 2023 15:28 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>
>> Looking at dlOpen and dlClose documentation I see no mention of TLS at all,
>> let alone a clear statement of how TLS interacts with running threads,
>> and that dlClose cleans up and recovers TLS allocated storage for that
>> shared library *for all the threads of a process*.
>
> I use dlopen extensively in threaded applications. dlopen(3) itself
> doesn't particularly care whether the code is threaded or not threaded;
> it will simply reference the "modifiable lvalue errno", which in a
> threaded application will be the __errno_location() function noted
> above. All other globals in the application are used to resolve
> external references in the shared object either when loaded (RTLD_NOW)
> or as referenced (RTLD_LAZY).
>
>> Because the Windows documentation cleary states that there is no
>> interaction with already running threads, one can make use of that.
>> A program which wants to use plug-ins should use LoadLibrary to map
>> the DLL, then create new threads which use the DLL functions,
>> then allow those threads to exit and terminate, then FreeLibrary to
>> unmap the DLL. Other threads should not interact with the loaded DLL.
>> (But note there are almost always other gotcha's in Windows.)
>>
>> Based on the lack of Linux information, I would be inclined to create
>> a separate new process for the thread that calls dlOpen and then
>> toss the whole process instead of dlClose.
>
> Not necessary on linux.
>
> See 'pthread_key_create' and "pthread_getspecific".

My concern was that any such destructor must be run before the destructor
code is unmapped. pthread_key_create would be subject to the same concern.
But I found the answer (below).

>> That way one can be sure it won't leak TLS memory in your application.
>
> Unix/linux shared objects have mechanisms for load time initialization
> and unload time destruction (consider it the equivalent of C++ constructors
> and destructors at the full library level). Static destructors for C++
> code in the library are called when the library is unloaded, for example.

My concern was if the destructors are in the dynamic shared object (DSO)
being unloaded, then how does dlClose ensure that all *other* threads run
those destructors and free their TLS memory before the DSO containing
that code is unmapped? And how does it do this without causing deadlocks?
Eg using signals to try to force a thread to destruct memory could deadlock.

I found the source for dlClose online, and the answer appears to be that
the thread calling dlClose basically reaches into all thread's TCB
(user mode Thread Control Block, aka user mode thread header)
where the TLS tables are located, grabs away the TLS pointer,
runs the destructors (if any), and frees the TLS memory.

The major complexity comes from trying to avoid having to guard all TLS
accesses with a mutex because the TLS tables can be dynamically resized.
So it uses a mutex to serialize dlOpen and dlClose, and then atomic
operations to coordinate TLS changes outside of those mutexes.

https://codebrowser.dev/glibc/glibc/elf/dl-close.c.html

This appears to be the source code for TLS.
The routine that looks up a TLS symbol is __tls_get_addr

https://codebrowser.dev/glibc/glibc/elf/dl-tls.c.html

The explaination for how TLS works comes from:

ELF Handling For Thread-Local Storage, Drepper 2013
https://c9x.me/compile/bib/tls.pdf

How To Write Shared Libraries, Drepper 2011
http://www.staroceans.org.s3.amazonaws.com/e-book/dsohowto.pdf

Re: Intel goes to 32 GPRs

<udcvo7$32enh$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33910&group=comp.arch#33910

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Thu, 7 Sep 2023 12:01:50 -0500
Organization: A noiseless patient Spider
Lines: 71
Message-ID: <udcvo7$32enh$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 7 Sep 2023 17:03:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fa5feae4a1631c3a81b60339b98ebe9a";
logging-data="3226353"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KUU5sBGCA/ArjccJn5Bx8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:6uXk48c/66yPXV/ym5lXGHwDolw=
In-Reply-To: <ubrca2$2jvsf$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Thu, 7 Sep 2023 17:01 UTC

On 8/19/2023 4:30 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>> In My 66000 ISA, the first instruction of a subroutine already carries
>> this information. If the instruction is ENTER (inst<31:26>==101100)
>> Then Rd is the first preserved register and Rs1 is the last register saved
>> on the stack. If the instruction is not ENTER, then the subroutine uses
>> only R0..R15 to perform its duties.
>
> Yes, but...
>
> if the callee does not use some of the registers R0..R15, and this
> is made known to the caller at (for example) link time, then the
> caller need not save them across calls.
>
> It would not work across shared library boundaries, though.

I had used this trick in ASM a few times, but wouldn't have considered
it "sane" for a compiler to do this sort of thing...

Vs, say, strictly enforcing the register use of the C ABI at all points.

Do remember at one point considering the possibility that the balance
between caller and callee save registers could be "adjustable" in the
compiler (say, to allow leaf functions to have more scratch registers,
and deeply nested functions to have more preserved registers).

Didn't end up going this direction though, instead sticking with a
roughly 50/50 split and statically-defined rules for register use.

There is already enough chaos from what things are still variable:
32 or 64 GPR configurations;
8 or 16 register arguments for functions (ABI variant);
16 means that argument lists nearly always fit in registers.
But, 16 also depends on 64 GPRs, and is optional even then.
Whether or not there is a "spill space" on the stack.
There are pros/cons either way with this one.
The "pain" of spill space being more obvious with 16 arguments.

If register use were made more flexible, it is possible, say:
R56..R63: May move over to scratch for leaf functions;
Maybe also: R40..R47 ?...
R32..R35, R48..R51: May be reclaimed for callee preserve.
R36..R39, R52..R55: Remain always scratch (argument registers).

If this were per-function (rather than global), there would need to be
some way for the caller and callee to signal or agree on which ABI
variant to use (and preferably not via "__declspec" or similar, as this
would suck; but if used, would imply some similar semantics to
"__forceinline" or similar, such as the inability to take a function
pointer to such a function).

Though, one other option would be to leave it as implicit (compiler
inferred), but explicitly excluded for any functions which either have
their address taken (function pointers), or which are marked as exports
("__declspec(dllexport)" or similar). In these cases, only the baseline
ABI rules would be used.

It would likely make more sense for the rules for R0..R31 to be left
unaffected, in any case.

One drawback would be the compiler would need to account for non-local
control flow within the call-graph for its register allocation and
register preservation decisions (likely to be the major issue with
implementing such a thing).

....

Re: Intel goes to 32 GPRs

<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33911&group=comp.arch#33911

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:8b03:b0:76e:eb7d:8d79 with SMTP id qw3-20020a05620a8b0300b0076eeb7d8d79mr3168qkn.10.1694109217856;
Thu, 07 Sep 2023 10:53:37 -0700 (PDT)
X-Received: by 2002:a17:902:f0cc:b0:1bf:702b:f208 with SMTP id
v12-20020a170902f0cc00b001bf702bf208mr76174pla.11.1694109217310; Thu, 07 Sep
2023 10:53:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 7 Sep 2023 10:53:36 -0700 (PDT)
In-Reply-To: <udcvo7$32enh$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:79fe:79d:cdaa:9e6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:79fe:79d:cdaa:9e6
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 07 Sep 2023 17:53:37 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6759
 by: MitchAlsup - Thu, 7 Sep 2023 17:53 UTC

On Thursday, September 7, 2023 at 12:03:08 PM UTC-5, BGB wrote:
> On 8/19/2023 4:30 PM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >
> >> In My 66000 ISA, the first instruction of a subroutine already carries
> >> this information. If the instruction is ENTER (inst<31:26>==101100)
> >> Then Rd is the first preserved register and Rs1 is the last register saved
> >> on the stack. If the instruction is not ENTER, then the subroutine uses
> >> only R0..R15 to perform its duties.
> >
> > Yes, but...
> >
> > if the callee does not use some of the registers R0..R15, and this
> > is made known to the caller at (for example) link time, then the
> > caller need not save them across calls.
> >
> > It would not work across shared library boundaries, though.
>
> I had used this trick in ASM a few times, but wouldn't have considered
> it "sane" for a compiler to do this sort of thing...
>
> Vs, say, strictly enforcing the register use of the C ABI at all points.
>
> Do remember at one point considering the possibility that the balance
> between caller and callee save registers could be "adjustable" in the
> compiler (say, to allow leaf functions to have more scratch registers,
> and deeply nested functions to have more preserved registers).
<
There is still this disclaimer in the software document::
<
4.22 Register Sets
This author chose 8-registers for Argument passing and Result returning..
Likewise, this author chose 16-registers as the preserved set. The ENTER
and EXIT instructions are capable of supporting smaller or larger numbers
of registers in either set. This author simply wants data on why something
different is more optimal.
>
>
> Didn't end up going this direction though, instead sticking with a
> roughly 50/50 split and statically-defined rules for register use.
>
> There is already enough chaos from what things are still variable:
> 32 or 64 GPR configurations;
<
I only have 32 registers.
<
> 8 or 16 register arguments for functions (ABI variant);
<
8 arguments and results with 8 temporaries seems close to optimal
for leaf routines.
<
> 16 means that argument lists nearly always fit in registers.
<
So does 8.....but this is the easiest thing to expand.
<
> But, 16 also depends on 64 GPRs, and is optional even then.
<
Not having 64 gets rid of this problem.
<
> Whether or not there is a "spill space" on the stack.
<
There is always spill space on the stack (allocate page on write).
<
> There are pros/cons either way with this one.
<
Granted; but where are the studies that investigates optimality
over {arguments, results, temporaries, and preserved}
<
> The "pain" of spill space being more obvious with 16 arguments.
>
Spill space on My 66000 stacks is allocated by the spiller not
the caller. At entry there is no excess space on the stack, however
should callee be a varargs, register arguments are pushed onto
the standard stack using the same ENTER as anyone else, just
wrapping around R0 and into the argument registers. This creates
a dense vector from which vararg arguments can be easily extracted.
>
> If register use were made more flexible, it is possible, say:
> R56..R63: May move over to scratch for leaf functions;
> Maybe also: R40..R47 ?...
> R32..R35, R48..R51: May be reclaimed for callee preserve.
> R36..R39, R52..R55: Remain always scratch (argument registers).
<
I eliminated this by only having 5-bits to name architectural registers.
{Which is another reason to avoid different addressabilities of registers
depending on instruction formats......}
>
> If this were per-function (rather than global), there would need to be
> some way for the caller and callee to signal or agree on which ABI
> variant to use (and preferably not via "__declspec" or similar, as this
> would suck; but if used, would imply some similar semantics to
> "__forceinline" or similar, such as the inability to take a function
> pointer to such a function).
<
What you want if for the compiler to effectively "inline" the function
everywhere it is called, but them "outline" the code of the function itself..
>
> Though, one other option would be to leave it as implicit (compiler
> inferred), but explicitly excluded for any functions which either have
> their address taken (function pointers), or which are marked as exports
> ("__declspec(dllexport)" or similar). In these cases, only the baseline
> ABI rules would be used.
>
>
> It would likely make more sense for the rules for R0..R31 to be left
> unaffected, in any case.
>
> One drawback would be the compiler would need to account for non-local
> control flow within the call-graph for its register allocation and
> register preservation decisions (likely to be the major issue with
> implementing such a thing).
<
Generate code after linking..........
>
> ...

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<udd3l8$331a8$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33912&group=comp.arch#33912

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Thu, 7 Sep 2023 13:08:31 -0500
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <udd3l8$331a8$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<uc05oh$200om$1@dont-email.me> <uc1hsv$2ntj0$1@newsreader4.netcologne.de>
<ucb30g$6hlj$1@dont-email.me> <pk9GM.619557$SuUf.333498@fx14.iad>
<e8cce067-d38f-46c6-8cd1-831db706eabfn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Sep 2023 18:09:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fa5feae4a1631c3a81b60339b98ebe9a";
logging-data="3245384"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0B8aOOgssgWpvlqASqLQl"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:cPAI0wZtKf3jqzh3FXuwL4BGX8M=
In-Reply-To: <e8cce067-d38f-46c6-8cd1-831db706eabfn@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 7 Sep 2023 18:08 UTC

On 8/25/2023 5:10 PM, MitchAlsup wrote:
> On Friday, August 25, 2023 at 4:48:09 PM UTC-5, Scott Lurndal wrote:
>> ke...@provalid.com (Kent Dickey) writes:
>>> In article <uc1hsv$2ntj0$1...@newsreader4.netcologne.de>,
>>> Thomas Koenig <tko...@netcologne.de> wrote:
>>>> Kent Dickey <ke...@provalid.com> schrieb:
>>>>> Here is the code again. I'm trying to keep it as short as possible to show
>>>>> the problem, so I don't plan to respond to nitpicks about the particulars
>>>>> of this code.
>>>>
>>>> [...]
>>>>
>>>> The issue of the frame pointer register not being used shows up in the
>>>> code on godbolt, even with recent gcc trunk.
>>>>
>>>> I have submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096 for
>>>> this.
>>>
>>> The GCC bug is about to be closed since GCC wants to reserve r29 on ARM64
>>> to always be a valid stack frame pointer, just "out of date" if a given
>>> function doesn't use it. So r29 is simply wasted on ARM64 when using GCC
>>> I have no idea what use a frame pointer is, I have no use for it.
> <
>> A frame pointer is useful for debugging, and in some cases for
>> stack unwinding (e.g. C++ exceptions).
> <
> The frame pointer can be used as a means to access locations on the stack
> which are static when the TOS contains dynamically sized data. Local data,
> dynamic descriptors, and destructor lists. It is hard to imaging doing general
> dynamic stack allocations without one. FPs are also used in block structured
> languages. FPs generally require 4-6 more instructions to use per subroutine
> than when one does not need an FP.
> <
> Never found myself in a debugging situation where having a FP (when otherwise
> unneeded) would have been help in debugging.

Yeah.

I can note that I don't have a frame pointer in my case as:
Stack frames are fixed size at runtime;
"alloca()" uses heap-backed memory;
I use a "predefined epilog sequence/structure" mechanism:
This is similar to the Win64 X64 ABI (and a few others).

In this case, one can unwind the stack without either a frame pointer or
debug metadata (such as DWARF), provided one can find the epilog for the
current function. This is accomplished via a lookup table which maps PC
ranges for functions to their corresponding epilogs (and also can define
entry points for exception-handling if the function contains any "catch"
blocks).

General design for this mechanism was "inherited" from the PE/COFF format.

A frame pointer could be more useful if I wanted VLAs or similar to be
on the stack, but... Then I would need bigger stacks (which don't play
well if one wants to support NOMMU operation...).

Could maybe be useful to come up with a mechanism to detect stack
overflows though, as there have been a "non-zero" number of bugs
resulting from programs quietly overflowing the stack and then going on
to corrupt other memory.

I guess one option could be to reserve the last 4K or 8K or so as a "Red
Line" and then have the compiler insert checks to detect if it has
crossed it.

Say, in the prolog:
ADD -xxxx, SP
MOV.Q (SP, 0), R16
MOV 0xXXXXXXXXXXXXXXXX, R17
CMPQEQ R16, R17
BREAK?T
Or, alternately:
ADD -xxxx, SP
BSR __check_stackfault

Say, to trigger a breakpoint or similar if a special magic value is seen
on the stack (which is assumed to designate that the "Red Line" has been
reached).

Runtime call could potentially do a more detailed check or print a debug
message or similar.

For now, it will just quietly overflow, and then other wonky stuff may
result...

More traditional option is a page-fault, but this only works if one is
using an MMU (and has put the stack in its own part of virtual memory).

>>
>> On intel, gcc supports -fomit-frame-pointer which releases RBP
>> to be a general register. I'd be surprised if the ARM64 compiler
>> doesn't always support omit-frame-pointer.
> <
> I should note that My 66000 supports the use of a FP (when needed) without
> any cost in the instruction stream (no more instructions to use than to avoid);
> your only cost is that FP is not a free register.

I could consider a frame-pointer if needed...
But, not needed for now.

Re: Intel goes to 32 GPRs

<ude6ci$3b25m$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33913&group=comp.arch#33913

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Thu, 7 Sep 2023 23:01:15 -0500
Organization: A noiseless patient Spider
Lines: 383
Message-ID: <ude6ci$3b25m$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 8 Sep 2023 04:02:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="360839cd567bb209cbcd5bc96ea34276";
logging-data="3508406"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oRY0FMUeOIYzTtC3J9Kym"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:1tiLIdtc/q+LiKklxDpyszIFQPw=
Content-Language: en-US
In-Reply-To: <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
 by: BGB - Fri, 8 Sep 2023 04:01 UTC

On 9/7/2023 12:53 PM, MitchAlsup wrote:
> On Thursday, September 7, 2023 at 12:03:08 PM UTC-5, BGB wrote:
>> On 8/19/2023 4:30 PM, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>
>>>> In My 66000 ISA, the first instruction of a subroutine already carries
>>>> this information. If the instruction is ENTER (inst<31:26>==101100)
>>>> Then Rd is the first preserved register and Rs1 is the last register saved
>>>> on the stack. If the instruction is not ENTER, then the subroutine uses
>>>> only R0..R15 to perform its duties.
>>>
>>> Yes, but...
>>>
>>> if the callee does not use some of the registers R0..R15, and this
>>> is made known to the caller at (for example) link time, then the
>>> caller need not save them across calls.
>>>
>>> It would not work across shared library boundaries, though.
>>
>> I had used this trick in ASM a few times, but wouldn't have considered
>> it "sane" for a compiler to do this sort of thing...
>>
>> Vs, say, strictly enforcing the register use of the C ABI at all points.
>>
>> Do remember at one point considering the possibility that the balance
>> between caller and callee save registers could be "adjustable" in the
>> compiler (say, to allow leaf functions to have more scratch registers,
>> and deeply nested functions to have more preserved registers).
> <
> There is still this disclaimer in the software document::
> <
> 4.22 Register Sets
> This author chose 8-registers for Argument passing and Result returning.
> Likewise, this author chose 16-registers as the preserved set. The ENTER
> and EXIT instructions are capable of supporting smaller or larger numbers
> of registers in either set. This author simply wants data on why something
> different is more optimal.

Fair enough.

>>
>>
>> Didn't end up going this direction though, instead sticking with a
>> roughly 50/50 split and statically-defined rules for register use.
>>
>> There is already enough chaos from what things are still variable:
>> 32 or 64 GPR configurations;
> <
> I only have 32 registers.
> <
>> 8 or 16 register arguments for functions (ABI variant);
> <
> 8 arguments and results with 8 temporaries seems close to optimal
> for leaf routines.
> <
>> 16 means that argument lists nearly always fit in registers.
> <
> So does 8.....but this is the easiest thing to expand.

Having 8 is "a majority of the time", having 16 is "almost always"
(apart from some random random function taking 20 or 22 arguments or
similar off in some obscure corner of the codebase...).

> <
>> But, 16 also depends on 64 GPRs, and is optional even then.
> <
> Not having 64 gets rid of this problem.
> <

64 GPRs is pros/cons.

Most "normal" code doesn't need 64 GPRs, but for code that does, it can
help...

Ironically, one ends up needing to have compiler heuristics for whether
it makes sense to try using all 64 GPRs, or to pretend they don't exist,
in the context of a given function.

For a majority of functions, it ends up more efficient to ignore the
high 32 registers (as register spills end up costing less than
saving/restoring more registers).

A very crude heuristic would be to base it on the number of local
variables and similar (including args and temporaries):
< 1.5x * preserved-low-GPR-count: Use low GPRs only
>= 1.5x * preserved-low-GPR-count: Use all the GPRs

Which, in this case, means the "magic cutoff" would be somewhere around
24 local variables or so.

Though,in a similar way, it is possible to calculate an estimate for the
"optimal" number of registers to reserve for static and dynamic register
assignment.

Where, say (absent any paired-register types):
Optimal register footprint: ~ 1.5x variable-count;
Need to keep at least 3 (minimum) or 4/5 (preferable) free for dynamic
assignment (needs to be doubled if paired-register types are present).

My hueristic for static-vs-dynamic assignment was based on usage count
of each variable:
Sort by descending usage;
If current weight > total of following weights:
Add as a static assignment;
Else:
Leave remaining registers for dynamic assignment.

Compiler can fudge things to make sure the constraints hold.

One needs to keep a certain minimum of dynamic registers:
Not enough, and spill/refill spikes;
If one goes below 3 (no pairs) or 6 (pairs), this is bad...
Situations may arise where reg-alloc can't allocate a register...
Then the compiler has no real option other than to explode...

This is ignoring scratch registers, which may or may not be available
for general register allocation in a given basic block (depending on
whether or not the basic block is calling a function or similar).

But, say, for a function with 16 local variables:
We optimally save ~ 11 registers;
Roughly up to 6 can be used for static assignment.

There is also a merit to track loop-nesting level and apply a roughly
1+8*x^2 scale factor based on the nesting level. Say:
int a, b, c, d;
int i, j, k;
...
a++; //weight=1
for(i=0; i<n; i++)
{
b++; //weight=9
for(j=0; j<n; j++)
{
c++; //weight=17
for(k=0; k<n; k++)
{
d++; //weight=73
}
}
}

So, variables used inside of nested loops are far more likely to be
static-assigned than ones not part of a deeply nested loop.

So, in this example, one might end up with weights:
k: 219
d: 73
j: 51
i: 27
c: 17
b: 9
a: 1
With k, d, and j being static-assigned.

Or, say, a function with 40 variables:
Optimal is ~ 28 preserved;
Roughly 17 can be static-assigned, 11 left for dynamic.
Or, 16 / 12, if any 128-bit types are used...

So, a function with 40 variables may be in need of 64 GPRs to perform
well; but one with 16 or 20 will be better off reserving fewer registers
for its uses. This factor doesn't seem particularly dependent on dynamic
factors within the function; unlike register rankings which are
primarily dominated by dynamic factors, like loop nesting level.

Most functions don't have 40+ variables, but for those that do, in
contexts where performance matters (say, if this function is also an
inner loop), then 64 GPRs is a "nice to have".

Similarly, say, if one wants to perform a 4x4 matrix multiply entirely
in registers, then 64 is also nice to have, ...

Though, a lot of these heuristics were "tuned" mostly via trial-and-error...

>> Whether or not there is a "spill space" on the stack.
> <
> There is always spill space on the stack (allocate page on write).
> <

I meant in the sense of the Win64 X64 ABI and some other similar ABIs.

Say, on function entry:
SP+0, SP+8, ...: Left unused (callee may spill function arguments here);
SP+64: First stack argument (with 8 register arguments).

SP+128: First stack argument (with 16 register arguments).

Since only a minority of functions use more than 8 arguments, always
burning 128 bytes here can result in a visible increase in the average
size of a stack frame.

Where, say, layout it, say:
Spill space for arguments
-- SP (function entry)
Saved LR
Saved GPRs
(Stack Canary)
Space for structs and arrays
Local variables and temporaries
Spill-space for callee arguments (non-leaf)
-- SP (frame bottom)

Unlike some other ABIs where SP+0 (on function entry) would hold the
first non-register argument.

>> There are pros/cons either way with this one.
> <
> Granted; but where are the studies that investigates optimality
> over {arguments, results, temporaries, and preserved}
> <

My case, mostly fiddling with stuff and observing results.

Seems mostly like a roughly 50% split for register types is a "sane
balance" in the average case.

For leaf functions, it is preferable to have more scratch registers
(which don't need to be saved or restored).

For nested call stacks, it is preferable to have more callee preserved
registers, since they can hold values without needing to be spilled
whenever calling a function or similar (since using one of them has a
constant cost, whereas using a scratch register may result in it needing
to be spilled to the stack repeatedly).

Well, except for temporaries where the compiler can determine that its
value "falls off into the void" and does not need to be saved (1).

*1: One trick here being to assign a sequence number to each variable
any time it is modified, and if a situation arises where the variable
with this sequence number may flow into another basic block, it needs to
be spilled to the stack (otherwise, it does not matter, and its current
value can be discarded).


Click here to read the complete article
Re: Intel goes to 32 GPRs

<udfe8b$3gkfb$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33918&group=comp.arch#33918

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Fri, 8 Sep 2023 10:21:38 -0500
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <udfe8b$3gkfb$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
<Td6FM.501194$qnnb.321179@fx11.iad>
<uc45qi$2pk2p$1@newsreader4.netcologne.de>
<G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad>
<taLFM.157982$8_8a.131964@fx48.iad> <oKLFM.142637$ftCb.43121@fx34.iad>
<QH3GM.95667$VzFf.17010@fx03.iad> <7L4GM.909644$GMN3.642632@fx16.iad>
<86zg2dxgcz.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 8 Sep 2023 15:22:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="360839cd567bb209cbcd5bc96ea34276";
logging-data="3690987"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/CQqS8WddNJrSe1iD4vI6L"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:Hg71mc+9g0ICeREO16SmJUFr7sI=
Content-Language: en-US
In-Reply-To: <86zg2dxgcz.fsf@linuxsc.com>
 by: BGB - Fri, 8 Sep 2023 15:21 UTC

On 8/26/2023 11:24 PM, Tim Rentsch wrote:
> scott@slp53.sl.home (Scott Lurndal) writes:
>
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>
> [...]
>
>>> On Windows errno is what the C standard says, an integer with file
>>> scope, which they moved to the TLS area. Each EXE/DLL that
>>> references errno gets an errno, just like any other global scope
>>> variable.
>>
>> POSIX extends the definition a bit.
>>
>> The symbol errno shall expand to a modifiable lvalue of
>> type int. It is unspecified whether errno is a macro or
>> an identifier declared with external linkage. If a macro
>> definition is suppressed in order to access an actual object,
>> or a program defines an identifier with the name errno, the
>> behavior is undefined.
>
> The C standard requires <errno.h> to define 'errno' as a macro,
> and has since the original C standard in 1989. The text above
> is basically taken straight out of the C standard.

FWIW:
I had defined errno as a macro, roughly:
int *__get_errno(void);
#define errno (*__get_errno())

The function in-turn responsible for fetching and returning a pointer to
the corresponding TLS.

Idea for DLLs is that there these will effectively have a C library
wrapper than uses a vtable to access the main program's C library.

So, in this case, something like:
int *__get_errno(void)
{
static int *(*get_errno)(void);
_STDIO_VTABLE **vt;
if(get_errno)
return(get_errno());
vt=__get_stdio_vtable();
get_errno=(*vt)->GetProcAddress(vt, "__get_errno");
return(get_errno());
}

Which would in turn defer to the main C library (to get a pointer to the
TLS variable).

In this case, most calls with non-local effect are directed through the
vtable, but various local-only functions (memcpy, strcmp, sqrt, etc...)
are still present within the C library.

This differs some from Windows which (by default) typically static-links
a copy of the full C library to each DLL (though, a DLL option also
exists, with some tradeoffs).

Note that in my case, TLS is accessed through a special 'TBR' control
register (via the (R1,R1) addressing special case), which is in-turn
read-only from usermode.

Typically, it in turn contains several sub-structures:
TBR+0: Read-Only to usermode, Read/Write supervisor
TBR+N: Supervisor only (stuff that should be kept hidden)
TBR+N+M: Read/Write in usermode.

Typically (for probably obvious reasons), these structures need to be
padded up to the page size. Each thread in-turn needing its own copy.
The root structure also contains pointers to the other structures (with
the usermode accessible sub-structure in turn containing the TLS variables).

....

Re: Intel goes to 32 GPRs

<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33921&group=comp.arch#33921

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7f0d:0:b0:412:2dd3:e0e4 with SMTP id f13-20020ac87f0d000000b004122dd3e0e4mr72207qtk.1.1694190966318;
Fri, 08 Sep 2023 09:36:06 -0700 (PDT)
X-Received: by 2002:a17:90a:dd8b:b0:26c:fab1:9e23 with SMTP id
l11-20020a17090add8b00b0026cfab19e23mr786556pjv.0.1694190965775; Fri, 08 Sep
2023 09:36:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 8 Sep 2023 09:36:05 -0700 (PDT)
In-Reply-To: <ude6ci$3b25m$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:682e:e2a1:1871:6648;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:682e:e2a1:1871:6648
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 08 Sep 2023 16:36:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3735
 by: MitchAlsup - Fri, 8 Sep 2023 16:36 UTC

On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
> On 9/7/2023 12:53 PM, MitchAlsup wrote:
>
> >> The "pain" of spill space being more obvious with 16 arguments.
> >>
> > Spill space on My 66000 stacks is allocated by the spiller not
> > the caller. At entry there is no excess space on the stack, however
> > should callee be a varargs, register arguments are pushed onto
> > the standard stack using the same ENTER as anyone else, just
> > wrapping around R0 and into the argument registers. This creates
> > a dense vector from which vararg arguments can be easily extracted.
<
> Called function makes space for saving preserved registers.
> Caller leaves space for spilling function arguments.
>
This wastes space when not needed, and is so easy for callee to provide
when needed.
>
> Basically, the same basic scheme used by the Windows X64 ABI, and a few
> other "MS style" ABIs...
<
Right--because windows does it that makes it right...............not
>
> Say:
> Argument spill space on stack, provided by caller;
> Structures are passed/returned via pointers;
> Return by copying into a pointer provided by the caller;
> Except when structure is smaller than 16 bytes, which use registers;
<
Argument spill space is provided by callee
Structures up to 8 registers are passed in registers both ways
> ...

> > I eliminated this by only having 5-bits to name architectural registers..
> > {Which is another reason to avoid different addressabilities of registers
> > depending on instruction formats......}
<
> This is an ABI design issue, not an ISA encoding issue...
> The ISA doesn't actually care that much about how the ABI uses its
> registers (well, apart from R0/R1 being SPRs and R15 being the
> stack-pointer and similar).
>
I disagree--if registers are not being used, they take up entropy without
providing value. Therefor it is an ISA issue that should be properly used
in the ABI.
>

Re: Intel goes to 32 GPRs

<udgv5t$3uiis$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33928&group=comp.arch#33928

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sat, 9 Sep 2023 00:16:35 -0500
Organization: A noiseless patient Spider
Lines: 192
Message-ID: <udgv5t$3uiis$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Sep 2023 05:17:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e694753756465144a949b4d04db0cb41";
logging-data="4147804"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tUlk3P1HrQO9qEr3CEbaL"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:WAn71ouZRmNBhQbzAH0nMr4we6c=
Content-Language: en-US
In-Reply-To: <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
 by: BGB - Sat, 9 Sep 2023 05:16 UTC

On 9/8/2023 11:36 AM, MitchAlsup wrote:
> On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
>> On 9/7/2023 12:53 PM, MitchAlsup wrote:
>>
>>>> The "pain" of spill space being more obvious with 16 arguments.
>>>>
>>> Spill space on My 66000 stacks is allocated by the spiller not
>>> the caller. At entry there is no excess space on the stack, however
>>> should callee be a varargs, register arguments are pushed onto
>>> the standard stack using the same ENTER as anyone else, just
>>> wrapping around R0 and into the argument registers. This creates
>>> a dense vector from which vararg arguments can be easily extracted.
> <
>> Called function makes space for saving preserved registers.
>> Caller leaves space for spilling function arguments.
>>
> This wastes space when not needed, and is so easy for callee to provide
> when needed.

The difference in stack-usage is "mostly negligible" in the 8 argument
case, since the function would otherwise still need to provide backing
memory for the function arguments.

But, yeah, 16 argument does end up burning more stack space.

Say, if we assume a function with, say, 5 arguments and 20 locals
(primitive types only):
Saves 16 registers (14 GPRs, LR and GBR);
Space for around 20 locals;
Space for 8 or 16 arguments.

So, 44 or 52 spots, 352 or 416 bytes.
Relative size delta: 18%

If no spill space:
Saves 16 registers (14 GPRs, LR and GBR);
Space for 5 locals;
Space for around 20 locals.

Needs 42 spots (pad to even), 336 bytes, only ~ 4.5% less than the 8
argument spill space.

Throw in some local arrays or similar, and the relative difference
becomes smaller.

Neither case is going to make that huge of a difference as to whether
the program can get along OK with a 128K stack or similar.

Looks like the most common cases for saved-register counts in BGBCC are:
7, 15, 23, 31

Granted, this makes sense given how BGBCC's register allocator works:
7: R8..R14
(small functions)
15: R8..R14, R24..R31
(most functions)
23: R8..R14, R24..R31, R40..R47
(bigger functions)
31: R8..R14, R24..R31, R40..R47, R56..R63
(large functions, 34..50+ local variables)

Namely, enabling registers in groups; then allocating these registers
within the group (often, all registers within the enabled set end up
saved/restored).

It was universally/force enabling the last 2 cases that was a problem:
The extra save/restore eating any potential gains from fewer register
spills (it remained more efficient to leave it up to heuristics, namely
variable count, to decide whether to enable them).

It was considered that for TKuCC, I would likely use a slightly
different strategy here (possibly "more granular", calculating roughly
how many registers are likely to be needed, and then saving this many).

Well, and for 16-register machines, it is effective enough to "just save
all the preserved registers" (since spill/fill is likely to be a bigger
factor than the overhead of save/reload).

>>
>> Basically, the same basic scheme used by the Windows X64 ABI, and a few
>> other "MS style" ABIs...
> <
> Right--because windows does it that makes it right...............not

For the most part, MS has had a pretty solid track record as far as the
engineering aspects goes. Granted, sometimes they throw "design
elegance" out the window. But, for the most part, it tends to be fairly
effective.

>>
>> Say:
>> Argument spill space on stack, provided by caller;
>> Structures are passed/returned via pointers;
>> Return by copying into a pointer provided by the caller;
>> Except when structure is smaller than 16 bytes, which use registers;
> <
> Argument spill space is provided by callee
> Structures up to 8 registers are passed in registers both ways

My case, it is 2 registers for passing/returning in registers.

My thinking was originally that falling back to "pass by reference" here
was both simpler and likely to be more efficient than the "copy inline
via the stack" strategy.

Along with being simpler than "Copy in 1..N registers, else pass on
stack" strategy.

>> ...
>
>>> I eliminated this by only having 5-bits to name architectural registers.
>>> {Which is another reason to avoid different addressabilities of registers
>>> depending on instruction formats......}
> <
>> This is an ABI design issue, not an ISA encoding issue...
>> The ISA doesn't actually care that much about how the ABI uses its
>> registers (well, apart from R0/R1 being SPRs and R15 being the
>> stack-pointer and similar).
>>
> I disagree--if registers are not being used, they take up entropy without
> providing value. Therefor it is an ISA issue that should be properly used
> in the ABI.

Granted.

This is also part of why XG2 wasn't as big of a win as I had initially
expected.

Universally enabling use of all the extended registers was not a win;
The gains from 6-bit register fields are reduced if only a subset of the
program actually makes all that much use of them.

Granted, "It makes TKRA-GL and similar slightly faster" is a little
hit/miss...

But, this is in turn, because TKRA-GL tends to involve a lot of loops
that update a fair amount of state (and needing to spill and fill a
bunch of variables without the loop body is not particularly efficient).

Ironically, in parts of TKRA-GL, I am still seeing some code that is
having a bunch of spill/fill despite using 64 GPRs (mostly the code for
trying to pump data into the ~ 18 MMIO registers used by the rasterizer
module).

Though, not clear why exactly, this function only using ~ 35 local
variables, which should be able to fit in registers (it does use a
volatile pointer, but the spills seem unrelated to this pointer).

I guess maybe it could make sense to gather and dump some stats for
average-case variable counts and usage pressure and similar.

Adding something, looks like (my GLQuake port):
Under 16 variables: 66%
Under 32 variables: 89%
Under 48 variables: 96%
Under 64 variables: 98%

Looks like one of my bigger functions here has ~ 95 local variables
(expands to 145 variables when including temporaries). This function
also tending to be one of the "heavyweights" in terms of CPU time
(basically, one of the functions that transforms, classifies, and
dynamically tessellates geometric primitives; effectively as a loop
operating on a stack of primitives).

It looks like roughly 20% of the functions would fall above the point
where R32..R63 will be enabled in the register allocator.

Looks like the "biggest" function, ironically, ended up with ~ 3 local
variables and roughly 700 temporaries (in a random chunk of code I had
back-ported to my GLQuake port from Quake3).

This looks like an adverse edge case...

>>
>

Re: Intel goes to 32-bit general purpose registers

<77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33930&group=comp.arch#33930

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:8e8:b0:655:afc1:e94b with SMTP id dr8-20020a05621408e800b00655afc1e94bmr111557qvb.3.1694274027567;
Sat, 09 Sep 2023 08:40:27 -0700 (PDT)
X-Received: by 2002:a65:670a:0:b0:577:4619:a0a0 with SMTP id
u10-20020a65670a000000b005774619a0a0mr400259pgf.6.1694274027256; Sat, 09 Sep
2023 08:40:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 08:40:26 -0700 (PDT)
In-Reply-To: <0cf10981-cb89-4d23-b715-d09dcf84fc34n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:7107:457c:617f:c887;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:7107:457c:617f:c887
References: <u9o14h$183or$1@newsreader4.netcologne.de> <0cf10981-cb89-4d23-b715-d09dcf84fc34n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>
Subject: Re: Intel goes to 32-bit general purpose registers
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 09 Sep 2023 15:40:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2577
 by: Quadibloc - Sat, 9 Sep 2023 15:40 UTC

On Tuesday, July 25, 2023 at 8:40:01 AM UTC-6, Quadibloc wrote:

> Oh, wow, Intel will be just as good as RISC!
>
> Of course, there is room for further improvement. They already have, on the shelf, technology to
> increase to 128 registers. If they find a way to improve performance still further when switching
> to Itanium mode, then they will really dominate the field!

I have finally tried to make some sense out of this new idea from Intel.

Is it there for any reason besides increasing vendor lock-in - so that, while AMD's
agreements with Intel will let it include this instruction set, nobody else will be able
to do so, because it is protected by new patents - while the patents on x86 and even
x86_64 are running out?

One other possibility is that due to loops being common in code, caching being
universal in large modern processors (thus making the low code density this kind
of thing implies irrelevant), and Intel having the ability to add transistors
to its chips in such a way that their more complicated instruction set than that of a
RISC chip does not have a big performance hit... this new instruction set really will
significantly reduce any performance gap between Intel's chips and RISC chips.

John Savard

Re: Intel goes to 32-bit general purpose registers

<memo.20230909173348.13508L@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33931&group=comp.arch#33931

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 17:33 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <memo.20230909173348.13508L@jgd.cix.co.uk>
References: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="2b826d6dbcd736c61fb04e1d1deaae55";
logging-data="166111"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+4rvRnyzyQIbV7u+Wu0LQoSByRWxiNFS8="
Cancel-Lock: sha1:htHmXX09I0q2rz7Z7ZrZhTfVWRo=
 by: John Dallman - Sat, 9 Sep 2023 16:33 UTC

In article <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>,
jsavard@ecn.ab.ca (Quadibloc) wrote:

> Is it there for any reason besides increasing vendor lock-in - so
> that, while AMD's agreements with Intel will let it include this
> instruction set, nobody else will be able to do so, because it is
> protected by new patents - while the patents on x86 and even
> x86_64 are running out?

I don't know the details of the x86 license agreements that various other
companies have, but it's notable that none of them have managed to make
much market impact in the last 20 years.

I think Intel are afraid to try a new architecture, given how badly they
failed with Itanium, and how much installed base they have with x86.

> One other possibility is that due to loops being common in code,
> caching being universal in large modern processors (thus making
> the low code density this kind of thing implies irrelevant),

Low code density is not irrelevant. It costs cache space, which is always
a bottleneck in practical software.

> and Intel having the ability to add transistors to its chips in
> such a way that their more complicated instruction set than that
> of a RISC chip does not have a big performance hit...

You've just given me an interesting idea about how to achieve that.

Think of the x86-64 instruction set as an intermediate code. Transform it
into a more sensibly encoded instruction set with the same registers and
operations. Build a processor that runs that, and provide code to
translate historic x86-64 code into this new format. That lets you get
past the decode bottleneck, with an install-time code transformation
which is simple enough to work reliably.

Maybe leave out some things that are never used in 64-bit mode, such as
the x87 registers and instructions, and support for self-modifying code.
Both of those are largely used by 32-bit code.

John

Re: Intel goes to 32-bit general purpose registers

<8G1LM.590250$Fgta.439991@fx10.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33932&group=comp.arch#33932

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32-bit general purpose registers
Newsgroups: comp.arch
References: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com> <memo.20230909173348.13508L@jgd.cix.co.uk>
Lines: 39
Message-ID: <8G1LM.590250$Fgta.439991@fx10.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 09 Sep 2023 17:10:28 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 09 Sep 2023 17:10:28 GMT
X-Received-Bytes: 2591
 by: Scott Lurndal - Sat, 9 Sep 2023 17:10 UTC

jgd@cix.co.uk (John Dallman) writes:
>In article <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>,
>jsavard@ecn.ab.ca (Quadibloc) wrote:
>
>> Is it there for any reason besides increasing vendor lock-in - so
>> that, while AMD's agreements with Intel will let it include this
>> instruction set, nobody else will be able to do so, because it is
>> protected by new patents - while the patents on x86 and even
>> x86_64 are running out?
>
>I don't know the details of the x86 license agreements that various other
>companies have, but it's notable that none of them have managed to make
>much market impact in the last 20 years.
>
>I think Intel are afraid to try a new architecture, given how badly they
>failed with Itanium, and how much installed base they have with x86.
>
>> One other possibility is that due to loops being common in code,
>> caching being universal in large modern processors (thus making
>> the low code density this kind of thing implies irrelevant),
>
>Low code density is not irrelevant. It costs cache space, which is always
>a bottleneck in practical software.
>
>> and Intel having the ability to add transistors to its chips in
>> such a way that their more complicated instruction set than that
>> of a RISC chip does not have a big performance hit...
>
>You've just given me an interesting idea about how to achieve that.
>
>Think of the x86-64 instruction set as an intermediate code. Transform it
>into a more sensibly encoded instruction set with the same registers and
>operations. Build a processor that runs that, and provide code to
>translate historic x86-64 code into this new format. That lets you get
>past the decode bottleneck, with an install-time code transformation
>which is simple enough to work reliably.

You're basically describing transmeta.

Re: Intel goes to 32-bit general purpose registers

<2023Sep9.191044@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33933&group=comp.arch#33933

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 09 Sep 2023 17:10:44 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 48
Message-ID: <2023Sep9.191044@mips.complang.tuwien.ac.at>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <0cf10981-cb89-4d23-b715-d09dcf84fc34n@googlegroups.com> <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>
Injection-Info: dont-email.me; posting-host="9093e634fcc0060e2d3bec1061f87f5d";
logging-data="178965"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zhtAzKYTX+M/zwRl5c9xp"
Cancel-Lock: sha1:zrpwSixqm3T4FFCUvgk7231QbRM=
X-newsreader: xrn 10.11
 by: Anton Ertl - Sat, 9 Sep 2023 17:10 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>Is it there for any reason besides increasing vendor lock-in - so that, whi=
>le AMD's
>agreements with Intel will let it include this instruction set, nobody else=
> will be able
>to do so, because it is protected by new patents - while the patents on x86=
> and even
>x86_64 are running out?

I don't think that Intel is worried about competitors in the AMD64
playground other than AMD. Everybody else is pretty much irrelevant
these days.

They are probably more worried of competition from other
architectures, in particular ARM (with the server offerings) and, in
the long run, RISC-V.

>One other possibility is that due to loops being common in code, caching be=
>ing
>universal in large modern processors (thus making the low code density this=
> kind
>of thing implies irrelevant),

Supposedly the code density is the same: what code with APX loses in
individual instruction length, it gains in number of instructions.

>and Intel having the ability to add transisto=
>rs
>to its chips in such a way that their more complicated instruction set than=
> that of a
>RISC chip does not have a big performance hit... this new instruction set r=
>eally will
>significantly reduce any performance gap between Intel's chips and RISC chi=
>ps.

What performance gap? Are you living in 1994, when the 21164 reigned
supreme?

My take: supposedly APX code needs 10% fewer loads and 20% fewer
stores; LSUs cost a lot of area, and using them costs a lot more
energy than accessing a register. APX allows Intel to build wider
CPUs with the same number of LSUs. The resulting code will run a
little faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Intel goes to 32 GPRs

<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33934&group=comp.arch#33934

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2b45:b0:76d:9ab3:2e2c with SMTP id dp5-20020a05620a2b4500b0076d9ab32e2cmr105703qkb.2.1694281118647;
Sat, 09 Sep 2023 10:38:38 -0700 (PDT)
X-Received: by 2002:a05:6a00:1a4a:b0:68b:ea9c:b55a with SMTP id
h10-20020a056a001a4a00b0068bea9cb55amr2088831pfv.3.1694281118397; Sat, 09 Sep
2023 10:38:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 10:38:37 -0700 (PDT)
In-Reply-To: <udgv5t$3uiis$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0aa:b645:ad20:1f85;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0aa:b645:ad20:1f85
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 09 Sep 2023 17:38:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7461
 by: MitchAlsup - Sat, 9 Sep 2023 17:38 UTC

On Saturday, September 9, 2023 at 12:17:53 AM UTC-5, BGB wrote:
> On 9/8/2023 11:36 AM, MitchAlsup wrote:
> > On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
> >> On 9/7/2023 12:53 PM, MitchAlsup wrote:
> >>
> >>>> The "pain" of spill space being more obvious with 16 arguments.
> >>>>
> >>> Spill space on My 66000 stacks is allocated by the spiller not
> >>> the caller. At entry there is no excess space on the stack, however
> >>> should callee be a varargs, register arguments are pushed onto
> >>> the standard stack using the same ENTER as anyone else, just
> >>> wrapping around R0 and into the argument registers. This creates
> >>> a dense vector from which vararg arguments can be easily extracted.
> > <
> >> Called function makes space for saving preserved registers.
> >> Caller leaves space for spilling function arguments.
> >>
> > This wastes space when not needed, and is so easy for callee to provide
> > when needed.
<
> The difference in stack-usage is "mostly negligible" in the 8 argument
> case, since the function would otherwise still need to provide backing
> memory for the function arguments.
<
Early in the development of My 66000 ABI, we experimented with always
having a frame pointer. In the piece of code we looked at, this cost us 20%
more maximum stack space (on the stack). This is likely to have been the
simulator compiling the compiler--so take that for what it is worth.
>
> But, yeah, 16 argument does end up burning more stack space.
<
My point was that My 66000 ABI has the callee allocate this space, and
only when needed, otherwise the compiler can use registers as it pleases.
>
> Say, if we assume a function with, say, 5 arguments and 20 locals
> (primitive types only):
> Saves 16 registers (14 GPRs, LR and GBR);
> Space for around 20 locals;
> Space for 8 or 16 arguments.
<
Saves as many preserved registers as desired,
<optionally> allocates space for surviving arguments
allocates space for the un-register-allocated variables
<optionally> allocates space for caller arguments/results in excess of 8.
{{This, by the way, is 1 instruction.}}
>
> So, 44 or 52 spots, 352 or 416 bytes.
> Relative size delta: 18%
<
The compiler is in a position to know the lifespan of the arguments
and of the local variables; in the majority of cases, it allocates fewer
than expected containers, so the delta is greater than 20%.
<
> Looks like the most common cases for saved-register counts in BGBCC are:
> 7, 15, 23, 31
<
I have a function that looks like 1/n with 0 saved about 40% of the time (leaf
needing nothing but temporary registers) 1 is in the 30% range, decreasing
rather smoothly to 16 at < 1% level.
>
> Granted, this makes sense given how BGBCC's register allocator works:
> 7: R8..R14
> (small functions)
> 15: R8..R14, R24..R31
> (most functions)
> 23: R8..R14, R24..R31, R40..R47
> (bigger functions)
> 31: R8..R14, R24..R31, R40..R47, R56..R63
> (large functions, 34..50+ local variables)
<
So, when you save all these registers you emit copious amounts of ST
instructions, then when you reload them you emit copious amounts of
LD instructions. I, on the other hand, emit 1.....We did have to teach the
compiler to allocate registers from R30 towards R16 to be compatible
with the ENTER and EXIT instructions.
>
> Namely, enabling registers in groups; then allocating these registers
> within the group (often, all registers within the enabled set end up
> saved/restored).
>
> It was universally/force enabling the last 2 cases that was a problem:
> The extra save/restore eating any potential gains from fewer register
> spills (it remained more efficient to leave it up to heuristics, namely
> variable count, to decide whether to enable them).
<
It is reasons such as this why I stopped at 32.
>
> It was considered that for TKuCC, I would likely use a slightly
> different strategy here (possibly "more granular", calculating roughly
> how many registers are likely to be needed, and then saving this many).
>
> Well, and for 16-register machines, it is effective enough to "just save
> all the preserved registers" (since spill/fill is likely to be a bigger
> factor than the overhead of save/reload).
<
A reason I did not stop at 16 registers, too.
<
> >>
> >> Basically, the same basic scheme used by the Windows X64 ABI, and a few
> >> other "MS style" ABIs...
> > <
> > Right--because windows does it that makes it right...............not
> For the most part, MS has had a pretty solid track record as far as the
> engineering aspects goes. Granted, sometimes they throw "design
> elegance" out the window. But, for the most part, it tends to be fairly
> effective.
<
But when you design specific aspects of ISA to do all this for you, you
don't have to "follow" but design from first principles.
> >>
> >> Say:
> >> Argument spill space on stack, provided by caller;
> >> Structures are passed/returned via pointers;
> >> Return by copying into a pointer provided by the caller;
> >> Except when structure is smaller than 16 bytes, which use registers;
> > <
> > Argument spill space is provided by callee
> > Structures up to 8 registers are passed in registers both ways
> My case, it is 2 registers for passing/returning in registers.
>
And I remain suggesting this is wasting cycles here or there.
>

Re: Intel goes to 32-bit general purpose registers

<2023Sep9.192231@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33935&group=comp.arch#33935

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 09 Sep 2023 17:22:31 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 52
Message-ID: <2023Sep9.192231@mips.complang.tuwien.ac.at>
References: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com> <memo.20230909173348.13508L@jgd.cix.co.uk>
Injection-Info: dont-email.me; posting-host="9093e634fcc0060e2d3bec1061f87f5d";
logging-data="184195"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//eLJEnXYJbsj/27Kyj72M"
Cancel-Lock: sha1:ZqCeLXIxgok8dv5ooYr+LxnaNgg=
X-newsreader: xrn 10.11
 by: Anton Ertl - Sat, 9 Sep 2023 17:22 UTC

jgd@cix.co.uk (John Dallman) writes:
>Think of the x86-64 instruction set as an intermediate code. Transform it
>into a more sensibly encoded instruction set with the same registers and
>operations. Build a processor that runs that, and provide code to
>translate historic x86-64 code into this new format. That lets you get
>past the decode bottleneck, with an install-time code transformation
>which is simple enough to work reliably.

Intel and AMD already do this, but not at install time. They have
microcode caches that contain the new format.

One interesting aspect is that Tremont (and AFAIK Gracemont), i.e.,
the E-Cores, don't have a microcode cache, but instead two three-wide
decoders (decoding the next two predicted instruction streams);
supposedly this is more power-efficient; it certainly is more
area-efficient, which seems to be the main point of these cores.
Anyway, there does not really seem to be a decoding bottleneck. For
the P-cores they could use a third and fourth 3-wide decoder if they
wanted to do away with the microcode cache.

One other interesting aspect is that ARM also uses a microcode cache.

One guess I have is that this is due to ARM supporting 2 or 3
different instruction sets. If that is true, they could do away with
the cache now that they eliminate A32/T32 in their ARMv9 cores.

Another guess I have is that their many-register instructions require
more sophisticated "decoding" in their big cores, and they cache that
effort in the microcode cache. If that is true, they probably will
stick with the cache.

>Maybe leave out some things that are never used in 64-bit mode, such as
>the x87 registers and instructions, and support for self-modifying code.
>Both of those are largely used by 32-bit code.

Both of those are also used by 64-bit code. The 387 stuff is used
when 80-bit floats are wanted; in particular, the biggest customer of
MPE has complained about FP precision when they switched from
IA-32+387 to AMD64+SSE2, and I think they now provide AMD64+387 as an
option.

As for "self-modifying code", like IA-32, AMD64 has no instructions
for telling the CPU that the data just written are instructions, so
JIT compilers and the like just write the instructions and then
execute them, just as with IA-32. If you eliminate support for that,
all JIT compilers stop working (or worse, they might work by luck in
testing, and then fail in the field).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Intel goes to 32-bit general purpose registers

<memo.20230909190509.13508N@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33936&group=comp.arch#33936

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 19:05 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <memo.20230909190509.13508N@jgd.cix.co.uk>
References: <memo.20230909173348.13508L@jgd.cix.co.uk>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="2b826d6dbcd736c61fb04e1d1deaae55";
logging-data="190750"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19870Ycq9BK296Q8rCG+FpQ+Oy7eSW40Sw="
Cancel-Lock: sha1:ffJzLNrpH7gYPUY51Ww2BTvijcw=
 by: John Dallman - Sat, 9 Sep 2023 18:05 UTC

In article <memo.20230909173348.13508L@jgd.cix.co.uk>, jgd@cix.co.uk
(John Dallman) wrote:

> Think of the x86-64 instruction set as an intermediate code.
> Transform it into a more sensibly encoded instruction set with
> the same registers and operations.

I don't think there's any possibility that Intel have not thought of this
idea. Which means one or more of:

* It's outside their management's comfortable zone of operations.
* It's much harder than it sounds.
* It won't work, which is my bet.
* They're working on it.

John

Re: Intel goes to 32-bit general purpose registers

<memo.20230909192959.13508O@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33938&group=comp.arch#33938

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 19:29 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <memo.20230909192959.13508O@jgd.cix.co.uk>
References: <8G1LM.590250$Fgta.439991@fx10.iad>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="2b826d6dbcd736c61fb04e1d1deaae55";
logging-data="197674"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0gTsJxvR/fRkCN5k7lg041HFDEC/vEcc="
Cancel-Lock: sha1:QhJetplF0QghhASqkX8pY6jdqiY=
 by: John Dallman - Sat, 9 Sep 2023 18:29 UTC

In article <8G1LM.590250$Fgta.439991@fx10.iad>, scott@slp53.sl.home
(Scott Lurndal) wrote:

> jgd@cix.co.uk (John Dallman) writes:
> > Think of the x86-64 instruction set as an intermediate code.
> > Transform it into a more sensibly encoded instruction set with
> > the same registers and operations. Build a processor that runs
> > that, and provide code to translate historic x86-64 code into
> > this new format. That lets you get past the decode bottleneck,
> > with an install-time code transformation which is simple enough
> > to work reliably.
> You're basically describing transmeta.

Yes, but significantly simpler and thus easier to implement. Transmeta
wanted to get high performance and low power usage by VLIW magic, which
they could not make work, because real code is too branchy.

This just makes the decode simpler, allowing more transistors for
productive work in a given fabrication process. It also allows for a
transition to the rationalised instruction set.

John

Re: Intel goes to 32-bit general purpose registers

<memo.20230909193000.13508P@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33939&group=comp.arch#33939

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 19:30 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <memo.20230909193000.13508P@jgd.cix.co.uk>
References: <2023Sep9.192231@mips.complang.tuwien.ac.at>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="2b826d6dbcd736c61fb04e1d1deaae55";
logging-data="197674"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tVEDB5uZ1f+ed/rIIgh3gsxtfr1r6MHI="
Cancel-Lock: sha1:n5ZDQ6YVdRF8yPFWp1iU52PyiVA=
 by: John Dallman - Sat, 9 Sep 2023 18:30 UTC

In article <2023Sep9.192231@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> One interesting aspect is that Tremont (and AFAIK Gracemont), i.e.,
> the E-Cores, don't have a microcode cache, but instead two
> three-wide decoders (decoding the next two predicted instruction
> streams); supposedly this is more power-efficient; it certainly is
> more area-efficient, which seems to be the main point of these cores.
> Anyway, there does not really seem to be a decoding bottleneck.

Aha, that's a sensible way round the decoding bottleneck.

> One other interesting aspect is that ARM also uses a microcode
> cache.
>
> One guess I have is that this is due to ARM supporting 2 or 3
> different instruction sets. If that is true, they could do away
> with the cache now that they eliminate A32/T32 in their ARMv9 cores.

That seems plausible.

> Another guess I have is that their many-register instructions
> require more sophisticated "decoding" in their big cores, and
> they cache that effort in the microcode cache.

From memory, the 64-bit ARM instruction set does not have the
many-register operations. It has instructions that use register pairs,
but not the bit-masks of registers to operate on that the A32 instruction
set inherited from the early days.

> Both of those are also used by 64-bit code. The 387 stuff is used
> when 80-bit floats are wanted; in particular, the biggest customer
> of MPE has complained about FP precision when they switched from
> IA-32+387 to AMD64+SSE2, and I think they now provide AMD64+387 as
> an option.

I don't know which MPE this is, but if it gets used, it gets used.

> As for "self-modifying code", like IA-32, AMD64 has no instructions
> for telling the CPU that the data just written are instructions, so
> JIT compilers and the like just write the instructions and then
> execute them, just as with IA-32. If you eliminate support for
> that, all JIT compilers stop working (or worse, they might work by
> luck in testing, and then fail in the field).

Yup, that wrecks that part of the idea.

John

Re: Intel goes to 32-bit general purpose registers

<memo.20230909193001.13508Q@jgd.cix.co.uk>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33940&group=comp.arch#33940

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jgd@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 19:30 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <memo.20230909193001.13508Q@jgd.cix.co.uk>
References: <2023Sep9.191044@mips.complang.tuwien.ac.at>
Reply-To: jgd@cix.co.uk
Injection-Info: dont-email.me; posting-host="2b826d6dbcd736c61fb04e1d1deaae55";
logging-data="197674"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19DdViLcVe43yq9WsD3LNQ/krVYmuB8hw0="
Cancel-Lock: sha1:eHHD+RXlHaofH/ooGufnqKNoxsg=
 by: John Dallman - Sat, 9 Sep 2023 18:30 UTC

In article <2023Sep9.191044@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> [Intel] are probably more worried of competition from other
> architectures, in particular ARM (with the server offerings)
> and, in the long run, RISC-V.

They seem to be trying to join the RISC-V bandwagon, a bit. I think this
is in the hope of getting to build processors with performance-per-watt
that competes with ARM.

<https://riscv.org/news/2023/01/hifive-pro-p550-horse-creek-risc-v-motherb
oard-with-16gb-ram-to-launch-this-summer/>

John

Re: Intel goes to 32-bit general purpose registers

<11c60c39-cb65-4b78-8ba0-c3fe9bf82f80n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33941&group=comp.arch#33941

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1103:b0:40f:f22c:2a3b with SMTP id e3-20020a05622a110300b0040ff22c2a3bmr137783qty.3.1694284368839;
Sat, 09 Sep 2023 11:32:48 -0700 (PDT)
X-Received: by 2002:a17:90b:100a:b0:268:776:e26 with SMTP id
gm10-20020a17090b100a00b0026807760e26mr1469042pjb.5.1694284368489; Sat, 09
Sep 2023 11:32:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 11:32:47 -0700 (PDT)
In-Reply-To: <2023Sep9.192231@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.14.162; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.14.162
References: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>
<memo.20230909173348.13508L@jgd.cix.co.uk> <2023Sep9.192231@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <11c60c39-cb65-4b78-8ba0-c3fe9bf82f80n@googlegroups.com>
Subject: Re: Intel goes to 32-bit general purpose registers
From: jim.brakefield@ieee.org (JimBrakefield)
Injection-Date: Sat, 09 Sep 2023 18:32:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4971
 by: JimBrakefield - Sat, 9 Sep 2023 18:32 UTC

On Saturday, September 9, 2023 at 12:42:11 PM UTC-5, Anton Ertl wrote:
> j...@cix.co.uk (John Dallman) writes:
> >Think of the x86-64 instruction set as an intermediate code. Transform it
> >into a more sensibly encoded instruction set with the same registers and
> >operations. Build a processor that runs that, and provide code to
> >translate historic x86-64 code into this new format. That lets you get
> >past the decode bottleneck, with an install-time code transformation
> >which is simple enough to work reliably.
> Intel and AMD already do this, but not at install time. They have
> microcode caches that contain the new format.
>
> One interesting aspect is that Tremont (and AFAIK Gracemont), i.e.,
> the E-Cores, don't have a microcode cache, but instead two three-wide
> decoders (decoding the next two predicted instruction streams);
> supposedly this is more power-efficient; it certainly is more
> area-efficient, which seems to be the main point of these cores.
> Anyway, there does not really seem to be a decoding bottleneck. For
> the P-cores they could use a third and fourth 3-wide decoder if they
> wanted to do away with the microcode cache.
>
> One other interesting aspect is that ARM also uses a microcode cache.
>
> One guess I have is that this is due to ARM supporting 2 or 3
> different instruction sets. If that is true, they could do away with
> the cache now that they eliminate A32/T32 in their ARMv9 cores.
>
> Another guess I have is that their many-register instructions require
> more sophisticated "decoding" in their big cores, and they cache that
> effort in the microcode cache. If that is true, they probably will
> stick with the cache.
> >Maybe leave out some things that are never used in 64-bit mode, such as
> >the x87 registers and instructions, and support for self-modifying code.
> >Both of those are largely used by 32-bit code.
> Both of those are also used by 64-bit code. The 387 stuff is used
> when 80-bit floats are wanted; in particular, the biggest customer of
> MPE has complained about FP precision when they switched from
> IA-32+387 to AMD64+SSE2, and I think they now provide AMD64+387 as an
> option.
>
> As for "self-modifying code", like IA-32, AMD64 has no instructions
> for telling the CPU that the data just written are instructions, so
> JIT compilers and the like just write the instructions and then
> execute them, just as with IA-32. If you eliminate support for that,
> all JIT compilers stop working (or worse, they might work by luck in
> testing, and then fail in the field).
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

FWIIW in 2018..2020 I enumerated a 32-register RISC/X86-64 hybrid ISA:
The 24-bit RISC instructions used three register fields, any immediate values suffixed to the 24-bits.
A combination of 24 and 32-bit instructions for the x86 op-codes (operations between register and memory operand,
result going to either); again immediate values suffixed.

So with a complete re-encoding of X86 ISA one can do a X86 like ISA with 32 registers in 24 and 32-bit instructions, eg, without
the prefix byte stuff (memory addresses of the form: reg B + (reg I << SHF) + SZ2-sized-offsset).

This gives one a superset of the X86, allowing X86 instructions of whatever prefix byte combinations to be encoded into prefix free
instructions that support a full 32 register register file.

Re: Intel goes to 32-bit general purpose registers

<32c8fa44-dd86-437f-b174-1e4af28b8c53n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33942&group=comp.arch#33942

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:453:b0:40d:b839:b5bb with SMTP id o19-20020a05622a045300b0040db839b5bbmr161453qtx.2.1694287594786;
Sat, 09 Sep 2023 12:26:34 -0700 (PDT)
X-Received: by 2002:a17:902:d48d:b0:1bb:b30e:436d with SMTP id
c13-20020a170902d48d00b001bbb30e436dmr2247540plg.4.1694287594528; Sat, 09 Sep
2023 12:26:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 12:26:34 -0700 (PDT)
In-Reply-To: <2023Sep9.192231@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:39b0:732b:a11b:21f8;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:39b0:732b:a11b:21f8
References: <77d4f9e8-f474-48b9-b62b-6dd210856ac1n@googlegroups.com>
<memo.20230909173348.13508L@jgd.cix.co.uk> <2023Sep9.192231@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <32c8fa44-dd86-437f-b174-1e4af28b8c53n@googlegroups.com>
Subject: Re: Intel goes to 32-bit general purpose registers
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Sat, 09 Sep 2023 19:26:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4380
 by: Michael S - Sat, 9 Sep 2023 19:26 UTC

On Saturday, September 9, 2023 at 8:42:11 PM UTC+3, Anton Ertl wrote:
> j...@cix.co.uk (John Dallman) writes:
> >Think of the x86-64 instruction set as an intermediate code. Transform it
> >into a more sensibly encoded instruction set with the same registers and
> >operations. Build a processor that runs that, and provide code to
> >translate historic x86-64 code into this new format. That lets you get
> >past the decode bottleneck, with an install-time code transformation
> >which is simple enough to work reliably.
> Intel and AMD already do this, but not at install time. They have
> microcode caches that contain the new format.
>
> One interesting aspect is that Tremont (and AFAIK Gracemont), i.e.,
> the E-Cores, don't have a microcode cache, but instead two three-wide
> decoders (decoding the next two predicted instruction streams);
> supposedly this is more power-efficient; it certainly is more
> area-efficient, which seems to be the main point of these cores.
> Anyway, there does not really seem to be a decoding bottleneck. For
> the P-cores they could use a third and fourth 3-wide decoder if they
> wanted to do away with the microcode cache.
>
> One other interesting aspect is that ARM also uses a microcode cache.
>
> One guess I have is that this is due to ARM supporting 2 or 3
> different instruction sets. If that is true, they could do away with
> the cache now that they eliminate A32/T32 in their ARMv9 cores.
>

Out of Arm Inc. aarch64-only cores those have MOP cache:
Cortex-X2, Cortex-X3, Neoverse-V2.
And those don't:
Cortex-A715, Cortex-A720, Coertex-X4.

> Another guess I have is that their many-register instructions require
> more sophisticated "decoding" in their big cores, and they cache that
> effort in the microcode cache. If that is true, they probably will
> stick with the cache.
> >Maybe leave out some things that are never used in 64-bit mode, such as
> >the x87 registers and instructions, and support for self-modifying code.
> >Both of those are largely used by 32-bit code.
> Both of those are also used by 64-bit code. The 387 stuff is used
> when 80-bit floats are wanted; in particular, the biggest customer of
> MPE has complained about FP precision when they switched from
> IA-32+387 to AMD64+SSE2, and I think they now provide AMD64+387 as an
> option.
>
> As for "self-modifying code", like IA-32, AMD64 has no instructions
> for telling the CPU that the data just written are instructions, so
> JIT compilers and the like just write the instructions and then
> execute them, just as with IA-32. If you eliminate support for that,
> all JIT compilers stop working (or worse, they might work by luck in
> testing, and then fail in the field).
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Intel goes to 32-bit general purpose registers

<udihf4$6j1j$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33943&group=comp.arch#33943

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 9 Sep 2023 12:36:04 -0700
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <udihf4$6j1j$1@dont-email.me>
References: <memo.20230909173348.13508L@jgd.cix.co.uk>
<memo.20230909190509.13508N@jgd.cix.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 9 Sep 2023 19:36:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9096fcde15b509542d418631fff6d377";
logging-data="216115"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18LRAXsKEkYYdxCbs5lh9wLm9/wDNkNJcw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:kg20W6NDeG4d2JaOCHjvVJMm3+U=
In-Reply-To: <memo.20230909190509.13508N@jgd.cix.co.uk>
Content-Language: en-US
 by: Stephen Fuld - Sat, 9 Sep 2023 19:36 UTC

On 9/9/2023 11:05 AM, John Dallman wrote:
> In article <memo.20230909173348.13508L@jgd.cix.co.uk>, jgd@cix.co.uk
> (John Dallman) wrote:
>
>> Think of the x86-64 instruction set as an intermediate code.
>> Transform it into a more sensibly encoded instruction set with
>> the same registers and operations.
>
> I don't think there's any possibility that Intel have not thought of this
> idea. Which means one or more of:
>
> * It's outside their management's comfortable zone of operations.
> * It's much harder than it sounds.
> * It won't work, which is my bet.
> * They're working on it.

Or, it might work, but there are "better" (at least in their
calculations) solutions, which, as Anton has elucidated, they have chosen.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32-bit general purpose registers

<75696fb0-f8d1-4340-a318-2de98811250cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33944&group=comp.arch#33944

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b87:b0:64f:3bbb:1d1c with SMTP id fe7-20020a0562140b8700b0064f3bbb1d1cmr223740qvb.2.1694289075569;
Sat, 09 Sep 2023 12:51:15 -0700 (PDT)
X-Received: by 2002:a05:6a00:421b:b0:68f:a57d:2569 with SMTP id
cd27-20020a056a00421b00b0068fa57d2569mr740094pfb.4.1694289075327; Sat, 09 Sep
2023 12:51:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 12:51:14 -0700 (PDT)
In-Reply-To: <memo.20230909190509.13508N@jgd.cix.co.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0aa:b645:ad20:1f85;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0aa:b645:ad20:1f85
References: <memo.20230909173348.13508L@jgd.cix.co.uk> <memo.20230909190509.13508N@jgd.cix.co.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <75696fb0-f8d1-4340-a318-2de98811250cn@googlegroups.com>
Subject: Re: Intel goes to 32-bit general purpose registers
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 09 Sep 2023 19:51:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 37
 by: MitchAlsup - Sat, 9 Sep 2023 19:51 UTC

On Saturday, September 9, 2023 at 1:05:14 PM UTC-5, John Dallman wrote:
> In article <memo.20230909...@jgd.cix.co.uk>, j...@cix.co.uk
> (John Dallman) wrote:
>
> > Think of the x86-64 instruction set as an intermediate code.
> > Transform it into a more sensibly encoded instruction set with
> > the same registers and operations.
<
This is the design point of the front end of K9--treat x86-64 as IR and
then recompile it into RISC. It turns out that to do this efficiently, one
has to perform peep-hole optimizations on the post transformed
code. This is at least BigO( n^2 ), but for n < {3 or 4} one can brute
force this in gates.
<
> I don't think there's any possibility that Intel have not thought of this
> idea. Which means one or more of:
<
Since I was doing this in 1999-2004, you can bet intel looked into it.
>
> * It's outside their management's comfortable zone of operations.
<
Possibly, but this ultimately leads to loss of market.
<
> * It's much harder than it sounds.
<
In x86-64 it actually simplifies many things.
<
> * It won't work, which is my bet.
<
We had little problem with it.
<
> * They're working on it.
<
Waiting for that special market window.
>
> John

Re: Intel goes to 32 GPRs

<udikqu$71in$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33945&group=comp.arch#33945

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sat, 9 Sep 2023 15:33:32 -0500
Organization: A noiseless patient Spider
Lines: 294
Message-ID: <udikqu$71in$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Sep 2023 20:33:35 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e694753756465144a949b4d04db0cb41";
logging-data="230999"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+61C9F95yvu+/7BZgXoyMw"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:NIQ41an7tmxzuVPF8ABQHcxSXW8=
Content-Language: en-US
In-Reply-To: <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
 by: BGB - Sat, 9 Sep 2023 20:33 UTC

On 9/9/2023 12:38 PM, MitchAlsup wrote:
> On Saturday, September 9, 2023 at 12:17:53 AM UTC-5, BGB wrote:
>> On 9/8/2023 11:36 AM, MitchAlsup wrote:
>>> On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
>>>> On 9/7/2023 12:53 PM, MitchAlsup wrote:
>>>>
>>>>>> The "pain" of spill space being more obvious with 16 arguments.
>>>>>>
>>>>> Spill space on My 66000 stacks is allocated by the spiller not
>>>>> the caller. At entry there is no excess space on the stack, however
>>>>> should callee be a varargs, register arguments are pushed onto
>>>>> the standard stack using the same ENTER as anyone else, just
>>>>> wrapping around R0 and into the argument registers. This creates
>>>>> a dense vector from which vararg arguments can be easily extracted.
>>> <
>>>> Called function makes space for saving preserved registers.
>>>> Caller leaves space for spilling function arguments.
>>>>
>>> This wastes space when not needed, and is so easy for callee to provide
>>> when needed.
> <
>> The difference in stack-usage is "mostly negligible" in the 8 argument
>> case, since the function would otherwise still need to provide backing
>> memory for the function arguments.
> <
> Early in the development of My 66000 ABI, we experimented with always
> having a frame pointer. In the piece of code we looked at, this cost us 20%
> more maximum stack space (on the stack). This is likely to have been the
> simulator compiling the compiler--so take that for what it is worth.
>>
>> But, yeah, 16 argument does end up burning more stack space.
> <
> My point was that My 66000 ABI has the callee allocate this space, and
> only when needed, otherwise the compiler can use registers as it pleases.

The space doesn't effect register usage; only stack-frame size.

Stack-frames being a little bit larger than a theoretical minimum
doesn't seem to have all that big of an effect on performance.

It also doesn't effect things enough to significantly effect the
hit/miss rate of a 9-bit load/store displacement.

Needing to reserve this space is also N/A for leaf functions (but, can
leave some small leaf functions space to save/restore registers without
needing to adjust the stack pointer).

>>
>> Say, if we assume a function with, say, 5 arguments and 20 locals
>> (primitive types only):
>> Saves 16 registers (14 GPRs, LR and GBR);
>> Space for around 20 locals;
>> Space for 8 or 16 arguments.
> <
> Saves as many preserved registers as desired,
> <optionally> allocates space for surviving arguments
> allocates space for the un-register-allocated variables
> <optionally> allocates space for caller arguments/results in excess of 8.
> {{This, by the way, is 1 instruction.}}

The number of registers to save/restored is not a mandate, rather a
"performance tuned parameter" (need to try to counter-balance
spill/refill with the cost of save/restore in the prolog and epilog).

>>
>> So, 44 or 52 spots, 352 or 416 bytes.
>> Relative size delta: 18%
> <
> The compiler is in a position to know the lifespan of the arguments
> and of the local variables; in the majority of cases, it allocates fewer
> than expected containers, so the delta is greater than 20%.
> <

At least in BGBCC, typically *all* variables are assigned backing
memory, but whether or not this backing memory is actually used, is more
subject to interpretation.

This excludes pure-static and tiny-leaf functions, which may sidestep
this part (tiny-leaf functions wont create a stack-frame in the first
place, static-assigning every variable to a scratch register).

Granted, it could be possible for the compiler to sidestep the need to
assign backing memory for variables that don't need it.

>> Looks like the most common cases for saved-register counts in BGBCC are:
>> 7, 15, 23, 31
> <
> I have a function that looks like 1/n with 0 saved about 40% of the time (leaf
> needing nothing but temporary registers) 1 is in the 30% range, decreasing
> rather smoothly to 16 at < 1% level.

As noted, 16 variables merely covers 66% of functions in my case...

A lot of the larger functions weigh in at over 100 variables.

One of my larger functions has ~ 145 variables, and a fair chunk of them
are 128-bit SIMD types as well.

Not quite an "obscure edge case" as a lot these functions also happen to
be in the "hot path" in this case.

They are also singular functions that take upwards of 1000 lines each as
well.

Though, I can note that the Quake Engine proper seemingly does not
exceed 100 variables for any of its functions (so it is mostly my own
code that is resulting in the high variable count numbers).

Checking Doom:
16 variables: 64%
32 variables: 89%
48 variables: 97%
64 variables: 98%

The largest functions are around 108 variables.
Though, looks like most of the larger functions are still my own code;
The Doom engine proper appears to be mostly free of 80-100 variable
beast functions...

>>
>> Granted, this makes sense given how BGBCC's register allocator works:
>> 7: R8..R14
>> (small functions)
>> 15: R8..R14, R24..R31
>> (most functions)
>> 23: R8..R14, R24..R31, R40..R47
>> (bigger functions)
>> 31: R8..R14, R24..R31, R40..R47, R56..R63
>> (large functions, 34..50+ local variables)
> <
> So, when you save all these registers you emit copious amounts of ST
> instructions, then when you reload them you emit copious amounts of
> LD instructions. I, on the other hand, emit 1.....We did have to teach the
> compiler to allocate registers from R30 towards R16 to be compatible
> with the ENTER and EXIT instructions.

....

This is why "prolog/epilog compression" is a thing in my case...

Like, say if one is effectively saving/reloading 33 values:
R8..R14, R24..R31, R40..R47, R56..R63, GBR, LR

Then it makes sense to consolidate this across multiple functions and
reuse it.

Granted, noting as how they mostly only occur in fixed quantities, it is
possible that the prologs/epilogs could have been handled by special
purpose runtime functions in this case:
ADD -272, SP //save area
MOV LR, R1
BSR __prolog_save_31
ADD -384, SP //rest of the call-frame
...

>>
>> Namely, enabling registers in groups; then allocating these registers
>> within the group (often, all registers within the enabled set end up
>> saved/restored).
>>
>> It was universally/force enabling the last 2 cases that was a problem:
>> The extra save/restore eating any potential gains from fewer register
>> spills (it remained more efficient to leave it up to heuristics, namely
>> variable count, to decide whether to enable them).
> <
> It is reasons such as this why I stopped at 32.

Possibly.

Top 80% of functions are, granted, more efficient if one only uses the
first 32 registers for them (even with an ISA variant where the encoding
is orthogonal).

It is the last 20% that is the factor, and the likelihood of functions
in this last 20% being in the hot path.

While "most functions" will not benefit from 64 registers, things like
80+ variable monster functions will take a bigger performance hit with
32 registers than with 64 (and the performance of such code is "hot
garbage" on something like a RasPi).

Sort of a similar issue with 16 argument registers:
Most functions will not benefit;
Functions where one is passing a bunch of 128-bit vectors or similar,
will fare better with 16 argument registers.

In theory, the compiler could fine-tune the ABI per function (say, not
bothering with reserving space for 16 arguments for functions which use
fewer than 8 arguments), but this would add complexity.

Say:
Called function only takes 6 or 8 arguments or similar, ABI falls back
to 8 argument rules;
Called function takes 12 arguments, ABI switches to 16 argument rules;
Vararg functions would likely assume all 16 registers.

>>
>> It was considered that for TKuCC, I would likely use a slightly
>> different strategy here (possibly "more granular", calculating roughly
>> how many registers are likely to be needed, and then saving this many).
>>
>> Well, and for 16-register machines, it is effective enough to "just save
>> all the preserved registers" (since spill/fill is likely to be a bigger
>> factor than the overhead of save/reload).
> <
> A reason I did not stop at 16 registers, too.
> <

Yeah, this much is uncontroversial.

>>>>
>>>> Basically, the same basic scheme used by the Windows X64 ABI, and a few
>>>> other "MS style" ABIs...
>>> <
>>> Right--because windows does it that makes it right...............not
>> For the most part, MS has had a pretty solid track record as far as the
>> engineering aspects goes. Granted, sometimes they throw "design
>> elegance" out the window. But, for the most part, it tends to be fairly
>> effective.
> <
> But when you design specific aspects of ISA to do all this for you, you
> don't have to "follow" but design from first principles.


Click here to read the complete article

devel / comp.arch / Re: Intel goes to 32-bit general purpose registers

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor