Rocksolid Light - comp.arch - Re: Intel goes to 32 GPRs

Re: Intel goes to 32 GPRs

<uc0egd$2n58b$2@newsreader4.netcologne.de>

https://news.novabbs.org/devel/article-flat.php?id=33737&group=comp.arch#33737

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-1c08-0-7eab-33e6-ac79-76a6.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 21 Aug 2023 19:38:53 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uc0egd$2n58b$2@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<ubueqn$1nk70$2@dont-email.me>
Injection-Date: Mon, 21 Aug 2023 19:38:53 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-1c08-0-7eab-33e6-ac79-76a6.ipv6dyn.netcologne.de:2001:4dd6:1c08:0:7eab:33e6:ac79:76a6";
logging-data="2856203"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 21 Aug 2023 19:38 UTC

Kent Dickey <kegs@provalid.com> schrieb:
> In article <ubpr3d$2iugr$1@newsreader4.netcologne.de>,
> Thomas Koenig <tkoenig@netcologne.de> wrote:
>>Kent Dickey <kegs@provalid.com> schrieb:
>>
>>> It's just that most of the code is full of register reloads around function
>>> calls, most of which would go away with just a few more preserved registers.
>>
>>It would be good to have an option for the linker to do away with
>>load/stores even in the absence of full LTO.
>>
>>The discussion here led me to propse this on the gcc mailing list, with
>>the feedback that it could be useful, but would be a handful of work.
>>
>>A compiler could annotate spill/restore instructions, something like
>>this:
>>
>>
>> .spill r17
>> std r17,120(sp)
>>[...]
>> call foo
>>
>>[...]
>>
>> .restore r17
>> ldd r17,120(sp)
>>
>>and foo would then have someting like
>>
>>.type foo,@function
>>.useregs foo,r12,r13,r14,r15
>>foo:
>>
>>If the linker finds what the assembler puts into the object file
>>for these assembly statements, it could then infer that r17 is not
>>used in foo, and remove the corresponding std and ldd statements
>>(with all the potential issues that this has).
>
> I think it's not a great idea to push this to the compiler/linker when
> it's a trivial thing to fix in documentation.

How would you fix the documentation change which registers are used?

> But the real problem is a solution like this is more complex: the linker
> needs to know all the functions foo() could call, and track the register
> use through all of them as well.

Correct. It would face the same thing of problem as LTO, but on a
much lower level.

Re: Intel goes to 32 GPRs

<uc0lga$22c76$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33738&group=comp.arch#33738

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 21 Aug 2023 14:38:18 -0700
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <uc0lga$22c76$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Aug 2023 21:38:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="243c41ef04085573e77bbd46c315a664";
logging-data="2175206"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18aPyHZKEqZcUgweYbnBNrJmEfafweZSuk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:GMxU6jVbTs0yNzIvI7GOG/ON94M=
In-Reply-To: <AA6EM.518412$TCKc.373270@fx13.iad>
Content-Language: en-US

by: Stephen Fuld - Mon, 21 Aug 2023 21:38 UTC

On 8/19/2023 10:02 AM, EricP wrote:

snipped discussion on caller vs callee saved registers

> The compiler could export a summary file describing the registers
> used by each routine. This summary would be imported by the compiler
> of callers to those routines so it could allocate registers not used
> by the callee to the caller. Also the summary could be produced by
> scanning a .OBJ or .DLL.
>
> Such a mechanism does not require modifications to source and
> is also language independent.
>
> It does create a dependency between the caller and a particular
> (timestamped) instance of a callee in a .OBJ or .DLL file.

Not necessarily. If the compiler of the calling program also exported
the used register lists of the routines it called when compiling the
code, then the linker could check that the routines actually linked to
did not use more registers than the caller assumed. This would work even
if the called program was compiled at a different time than the one used
at the main routine's compile time. (I hope this is clear. The
language can get confusing.)

Of course, the question arises, what if a called routine does use more
registers than the calling program was told at its compile time. There
are several possibilities, abort the link, create trampoline routines to
save and restore the newly used registers, perhaps others. The choice
should certainly be made by the programmer.

In fact, if at link time, the called program used fewer registers than
the one assumed at compile time, that should be brought to the
programmers attention as information, so perhaps the calling program
could be recompiled and, by using more registers, be more efficient.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

browse Google groups without logging in - was Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<0o68eihnfjspvi75ei42dijdhuif668gm4@4ax.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33746&group=comp.arch#33746

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: gneuner2@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: browse Google groups without logging in - was Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Mon, 21 Aug 2023 22:37:16 -0400
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <0o68eihnfjspvi75ei42dijdhuif668gm4@4ax.com>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug18.082303@mips.complang.tuwien.ac.at> <ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com> <uc05oh$200om$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="b50f3628746486dd9352a647bf600695";
logging-data="2377016"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bVW71hz8Ud/ScdWj6o7DInRN5kwZg0h4="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:rLA39ioLzZOL4LFIyfWM9XYZw3A=

by: George Neuner - Tue, 22 Aug 2023 02:37 UTC

On Mon, 21 Aug 2023 17:09:37 -0000 (UTC), kegs@provalid.com (Kent
Dickey) wrote:

>I don't know how to browse google groups without logging in ...

Using a search engine (even Google itself), search for the group's
name. That will get you to the first page.

From there you can search "Conversations" within the group if you know
the names of threads, keywords, authors, etc.

Note that you CAN'T do a global search [of "all groups and messages"]
or try to go to the Groups home page as either of these will require
you to login. If you want to switch to another group, you have to go
back through the search engine.

Hope this helps.

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<uc1hsv$2ntj0$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33747&group=comp.arch#33747

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-1c08-0-7eab-33e6-ac79-76a6.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Tue, 22 Aug 2023 05:42:55 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uc1hsv$2ntj0$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Aug18.082303@mips.complang.tuwien.ac.at>
<ubuegr$1nk70$1@dont-email.me>
<c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<uc05oh$200om$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 22 Aug 2023 05:42:55 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-1c08-0-7eab-33e6-ac79-76a6.ipv6dyn.netcologne.de:2001:4dd6:1c08:0:7eab:33e6:ac79:76a6";
logging-data="2881120"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 22 Aug 2023 05:42 UTC

Kent Dickey <kegs@provalid.com> schrieb:
> In article <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>,
> Michael S <already5chosen@yahoo.com> wrote:
>>On Monday, August 21, 2023 at 4:26:55 AM UTC+3, Kent Dickey wrote:
>>> What I am VERY interested in is for someone to make the case for so many
>>> scratch registers. Show me some code which needs 19 scratch registers,
>>> and would run noticeably slower with 14 scratch registers.
>>>
>>
>>There is other side of the same argument: can you show me (or more
>>relevantly to designers of aarch64 ABI) some code which needs 16
>> callee-saved registers, and would run noticeably slower with 11
>> callee-saved registers ?
>>
>>> I think this is just an architecture issue no one gives much thought to.
>>
>>I tend to think that they tried to give a deep thought, but in the process
>>quickly came to conclusion that it matters very little.
>
> I've already provided an example in September, 2020 in the post:
><TOGdnf-OgvSdue_CnZ2dnUU7-R3NnZ2d@giganews.com> (you can read it at
> http://al.howardknight.net/?L=EN>). I don't know how to browse google groups
> without logging in, if someone posts a link, that would be helpful.
>
> Here is the code again. I'm trying to keep it as short as possible to show
> the problem, so I don't plan to respond to nitpicks about the particulars
> of this code.

[...]

The issue of the frame pointer register not being used shows up in the
code on godbolt, even with recent gcc trunk.

I have submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096 for
this.

Re: Intel goes to 32 GPRs

<Cf3FM.499505$qnnb.208430@fx11.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33748&group=comp.arch#33748

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
In-Reply-To: <uc0lga$22c76$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 60
Message-ID: <Cf3FM.499505$qnnb.208430@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 22 Aug 2023 14:04:18 UTC
Date: Tue, 22 Aug 2023 10:03:26 -0400
X-Received-Bytes: 3873

by: EricP - Tue, 22 Aug 2023 14:03 UTC

Stephen Fuld wrote:
> On 8/19/2023 10:02 AM, EricP wrote:
>
> snipped discussion on caller vs callee saved registers
>
>> The compiler could export a summary file describing the registers
>> used by each routine. This summary would be imported by the compiler
>> of callers to those routines so it could allocate registers not used
>> by the callee to the caller. Also the summary could be produced by
>> scanning a .OBJ or .DLL.
>>
>> Such a mechanism does not require modifications to source and
>> is also language independent.
>>
>> It does create a dependency between the caller and a particular
>> (timestamped) instance of a callee in a .OBJ or .DLL file.
>
> Not necessarily. If the compiler of the calling program also exported
> the used register lists of the routines it called when compiling the
> code, then the linker could check that the routines actually linked to
> did not use more registers than the caller assumed. This would work even
> if the called program was compiled at a different time than the one used
> at the main routine's compile time. (I hope this is clear. The
> language can get confusing.)

Yes, its clear.
I'm not so concerned about coordinating between separately compiled .OBJ
as they would just need to be compiled and summarized in the right order.

But a lot of leaf routines like math or string are in DLL shared libraries.
It would be a bit fragile and kinda defeats the whole purpose of shared
libraries if you have to recompile all your application code when a new
version of a DLL shows up.

So I thought it was worth noting, particularly for math library users
as those applications are the ones most likely to benefit from more
preserved registers across calls to sine, log, pow, etc.

> Of course, the question arises, what if a called routine does use more
> registers than the calling program was told at its compile time. There
> are several possibilities, abort the link, create trampoline routines to
> save and restore the newly used registers, perhaps others. The choice
> should certainly be made by the programmer.
>
> In fact, if at link time, the called program used fewer registers than
> the one assumed at compile time, that should be brought to the
> programmers attention as information, so perhaps the calling program
> could be recompiled and, by using more registers, be more efficient.

For modules that are part of a project and separately compiled then one
just recompiles them. The problem is for DLL's: detecting that a change
has occurred and then what to do about it (customers won't have source
code to recompile).

As a practical matter, apps wanting to do this optimization should
only use link (.OBJ) libraries for math, strings, etc and not DLL's.
(And let's just forget I mentioned DLL's at all.)

Re: Intel goes to 32 GPRs

<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33751&group=comp.arch#33751

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5504:0:b0:64a:b579:bd1a with SMTP id pz4-20020ad45504000000b0064ab579bd1amr80710qvb.7.1692718618655;
Tue, 22 Aug 2023 08:36:58 -0700 (PDT)
X-Received: by 2002:a05:6a00:1890:b0:68a:6082:2c54 with SMTP id
x16-20020a056a00189000b0068a60822c54mr2100824pfh.6.1692718618443; Tue, 22 Aug
2023 08:36:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Aug 2023 08:36:57 -0700 (PDT)
In-Reply-To: <Cf3FM.499505$qnnb.208430@fx11.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Aug 2023 15:36:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6251

by: MitchAlsup - Tue, 22 Aug 2023 15:36 UTC

On Tuesday, August 22, 2023 at 9:04:23 AM UTC-5, EricP wrote:
> Stephen Fuld wrote:
> > On 8/19/2023 10:02 AM, EricP wrote:
> >
> > snipped discussion on caller vs callee saved registers
> >
> >> The compiler could export a summary file describing the registers
> >> used by each routine. This summary would be imported by the compiler
> >> of callers to those routines so it could allocate registers not used
> >> by the callee to the caller. Also the summary could be produced by
> >> scanning a .OBJ or .DLL.
> >>
> >> Such a mechanism does not require modifications to source and
> >> is also language independent.
> >>
> >> It does create a dependency between the caller and a particular
> >> (timestamped) instance of a callee in a .OBJ or .DLL file.
> >
> > Not necessarily. If the compiler of the calling program also exported
> > the used register lists of the routines it called when compiling the
> > code, then the linker could check that the routines actually linked to
> > did not use more registers than the caller assumed. This would work even
> > if the called program was compiled at a different time than the one used
> > at the main routine's compile time. (I hope this is clear. The
> > language can get confusing.)
> Yes, its clear.
<
> I'm not so concerned about coordinating between separately compiled .OBJ
> as they would just need to be compiled and summarized in the right order.
>
> But a lot of leaf routines like math or string are in DLL shared libraries.
> It would be a bit fragile and kinda defeats the whole purpose of shared
> libraries if you have to recompile all your application code when a new
> version of a DLL shows up.
<
Don't allow the new version of the DLL to use more registers than the
previous in any groupings.
>
> So I thought it was worth noting, particularly for math library users
> as those applications are the ones most likely to benefit from more
> preserved registers across calls to sine, log, pow, etc.
<
All of these "function calls" end up being instructions in My 66000 ISA.
But, even those that are not functions take "way less code" than similar
RISC-V compilations of those codes--mainly because these codes are
polynomials with constant coefficients. For example r8_erf() from
polpack only needs 43% as many instructions as it needs in RISC-V
and when the compiler limit on constants in registers limitations are
removed, better still. {and as a side note:: this is a function where
RISC-V with separate GPR and FPR registers files needs spill/fill
code, whereas My 66000 with integrated GPR does not--entirely du
to universal constants}.
<
Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
registers (index, pointer, coefficient, power, product and summation)::
SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
this general category. I suspect many more straightforward polynomial
evaluations do too.
<
> > Of course, the question arises, what if a called routine does use more
> > registers than the calling program was told at its compile time. There
> > are several possibilities, abort the link, create trampoline routines to
> > save and restore the newly used registers, perhaps others. The choice
> > should certainly be made by the programmer.
> >
> > In fact, if at link time, the called program used fewer registers than
> > the one assumed at compile time, that should be brought to the
> > programmers attention as information, so perhaps the calling program
> > could be recompiled and, by using more registers, be more efficient.
<
> For modules that are part of a project and separately compiled then one
> just recompiles them. The problem is for DLL's: detecting that a change
> has occurred and then what to do about it (customers won't have source
> code to recompile).
<
DLLs have a defined interface--what you need is for that definition to
include the number of registers in each group and never change that
definition. A side effect is that the compiler will need flag-inputs to
specify those register groups.
>
> As a practical matter, apps wanting to do this optimization should
> only use link (.OBJ) libraries for math, strings, etc and not DLL's.
> (And let's just forget I mentioned DLL's at all.)
<
Do you really want 462 copies of SIN() and COS() occupying memory ??
{{Also note: My 66000 ISA can call DLLs using no more instructions
than local functions. No trampolines required.}}

In article <ublni6$3s8o9$1@dont-email.me>,
Kent Dickey <kegs@provalid.com> wrote:
>It is a mistake to make these new registers caller save. There are ways
>to make them callee save.

For unlimited performance, a port can statically mark call used/call
saved registers into the type system and code-gen is then perfect at
the expense of requiring source code annotations. We did this on our
last port of llvm, not that hard to do, and when there is uncertainty
around exactly the right trade off between call used and call saved.

This is mainly useful in code where performance is important and the
user is doing all the kernel work and not using pre-built libraries.

The benefit, no LTO cost, easy to see, audit in the source code,
direct control over exactly which registers are saved/used, if needed.
Downside, people using the system have to have stronger asm skills and
know what registers are. :-)

Re: Intel goes to 32 GPRs

<uc2nlt$2f7pc$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33753&group=comp.arch#33753

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 22 Aug 2023 09:27:41 -0700
Organization: A noiseless patient Spider
Lines: 85
Message-ID: <uc2nlt$2f7pc$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Aug 2023 16:27:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c791ee6ca3a519710bc74235e9c9d7c7";
logging-data="2596652"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ENzndW1JSgZcbMRVmxpb3QYh0J8hNgj0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:uEdMoFsN8/KLNNklKE3T9vCa7yA=
In-Reply-To: <Cf3FM.499505$qnnb.208430@fx11.iad>
Content-Language: en-US

by: Stephen Fuld - Tue, 22 Aug 2023 16:27 UTC

On 8/22/2023 7:03 AM, EricP wrote:
> Stephen Fuld wrote:
>> On 8/19/2023 10:02 AM, EricP wrote:
>>
>> snipped discussion on caller vs callee saved registers
>>
>>> The compiler could export a summary file describing the registers
>>> used by each routine. This summary would be imported by the compiler
>>> of callers to those routines so it could allocate registers not used
>>> by the callee to the caller. Also the summary could be produced by
>>> scanning a .OBJ or .DLL.
>>>
>>> Such a mechanism does not require modifications to source and
>>> is also language independent.
>>>
>>> It does create a dependency between the caller and a particular
>>> (timestamped) instance of a callee in a .OBJ or .DLL file.
>>
>> Not necessarily. If the compiler of the calling program also exported
>> the used register lists of the routines it called when compiling the
>> code, then the linker could check that the routines actually linked to
>> did not use more registers than the caller assumed. This would work
>> even if the called program was compiled at a different time than the
>> one used at the main routine's compile time. (I hope this is clear.
>> The language can get confusing.)
>
> Yes, its clear.
> I'm not so concerned about coordinating between separately compiled .OBJ
> as they would just need to be compiled and summarized in the right order.
>
> But a lot of leaf routines like math or string are in DLL shared
libraries.
> It would be a bit fragile and kinda defeats the whole purpose of shared
> libraries if you have to recompile all your application code when a new
> version of a DLL shows up.

Again, not necessarily. As Mitch pointed out, you just need to assure
that the new version of the DLL uses no more registers than the older
one. So you need a compiler option to retrieve the relevant information
that was exported from the previous version of the file being compiled,
and stick to the same registers. Perhaps a little ugly for the compiler
internally, but I suspect that many/most changes to a DLL are modest
fixes that wouldn't require more registers, thus needed recompiles would
be rare.

Of course, you do need to have a check at initial DLL load time, similar
to the check made at link time for statically linked libraries, to
verify compatibility of the callee with the caller.

snip

> For modules that are part of a project and separately compiled then one
> just recompiles them. The problem is for DLL's: detecting that a change
> has occurred and then what to do about it (customers won't have source
> code to recompile).

But they have the source code for their own programs. So if the
compiler retrieved the "registers used" information for any DLLs called,
similar to what it does for non-dynamically linked libraries,
recompiling the user program with the new DLL version available will
"automatically" fix the problem.

I realize this means that compiling a program on a system with a
differently register using versions of a DLL than the one the program
ultimately executes on could cause a problem. For this reason, the
compiler should have an option to restrict its register usage to the
ones specified in the registers used information for the previous
version of the program. This could at least, minimize, and perhaps
eliminate the problem.

> As a practical matter, apps wanting to do this optimization should
> only use link (.OBJ) libraries for math, strings, etc and not DLL's.
> (And let's just forget I mentioned DLL's at all.)

While DLLs do add some complexity, I think it can be handled without too
much additional work.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<Td6FM.501194$qnnb.321179@fx11.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33754&group=comp.arch#33754

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!45.76.7.193.MISMATCH!3.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
In-Reply-To: <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 112
Message-ID: <Td6FM.501194$qnnb.321179@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 22 Aug 2023 17:27:15 UTC
Date: Tue, 22 Aug 2023 13:26:27 -0400
X-Received-Bytes: 6955

by: EricP - Tue, 22 Aug 2023 17:26 UTC

MitchAlsup wrote:
> On Tuesday, August 22, 2023 at 9:04:23 AM UTC-5, EricP wrote:
>> Stephen Fuld wrote:
>>> On 8/19/2023 10:02 AM, EricP wrote:
>>>
>>> snipped discussion on caller vs callee saved registers
>>>
>>>> The compiler could export a summary file describing the registers
>>>> used by each routine. This summary would be imported by the compiler
>>>> of callers to those routines so it could allocate registers not used
>>>> by the callee to the caller. Also the summary could be produced by
>>>> scanning a .OBJ or .DLL.
>>>>
>>>> Such a mechanism does not require modifications to source and
>>>> is also language independent.
>>>>
>>>> It does create a dependency between the caller and a particular
>>>> (timestamped) instance of a callee in a .OBJ or .DLL file.
>>> Not necessarily. If the compiler of the calling program also exported
>>> the used register lists of the routines it called when compiling the
>>> code, then the linker could check that the routines actually linked to
>>> did not use more registers than the caller assumed. This would work even
>>> if the called program was compiled at a different time than the one used
>>> at the main routine's compile time. (I hope this is clear. The
>>> language can get confusing.)
>> Yes, its clear.
> <
>> I'm not so concerned about coordinating between separately compiled .OBJ
>> as they would just need to be compiled and summarized in the right order.
>>
>> But a lot of leaf routines like math or string are in DLL shared libraries.
>> It would be a bit fragile and kinda defeats the whole purpose of shared
>> libraries if you have to recompile all your application code when a new
>> version of a DLL shows up.
> <
> Don't allow the new version of the DLL to use more registers than the
> previous in any groupings.
>> So I thought it was worth noting, particularly for math library users
>> as those applications are the ones most likely to benefit from more
>> preserved registers across calls to sine, log, pow, etc.
> <
> All of these "function calls" end up being instructions in My 66000 ISA.
> But, even those that are not functions take "way less code" than similar
> RISC-V compilations of those codes--mainly because these codes are
> polynomials with constant coefficients. For example r8_erf() from
> polpack only needs 43% as many instructions as it needs in RISC-V
> and when the compiler limit on constants in registers limitations are
> removed, better still. {and as a side note:: this is a function where
> RISC-V with separate GPR and FPR registers files needs spill/fill
> code, whereas My 66000 with integrated GPR does not--entirely du
> to universal constants}.
> <
> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
> registers (index, pointer, coefficient, power, product and summation)::
> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
> this general category. I suspect many more straightforward polynomial
> evaluations do too.

Even with your transcendental instructions, would not these still be
implemented as non-inlined subroutines? Because for different languages,
Fortran, C, the math functions in particular have different ways of
interacting with the run-time environment to report errors.

Eg C/C++ math functions can set errno which is a TLS variable.
I don't know what GCC Fortran does for its error status reporting.
VAX Fortran language math functions threw an exception for error conditions,
however VAX library math functions returned a function status code.

> <
>>> Of course, the question arises, what if a called routine does use more
>>> registers than the calling program was told at its compile time. There
>>> are several possibilities, abort the link, create trampoline routines to
>>> save and restore the newly used registers, perhaps others. The choice
>>> should certainly be made by the programmer.
>>>
>>> In fact, if at link time, the called program used fewer registers than
>>> the one assumed at compile time, that should be brought to the
>>> programmers attention as information, so perhaps the calling program
>>> could be recompiled and, by using more registers, be more efficient.
> <
>> For modules that are part of a project and separately compiled then one
>> just recompiles them. The problem is for DLL's: detecting that a change
>> has occurred and then what to do about it (customers won't have source
>> code to recompile).
> <
> DLLs have a defined interface--what you need is for that definition to
> include the number of registers in each group and never change that
> definition. A side effect is that the compiler will need flag-inputs to
> specify those register groups.
>> As a practical matter, apps wanting to do this optimization should
>> only use link (.OBJ) libraries for math, strings, etc and not DLL's.
>> (And let's just forget I mentioned DLL's at all.)
> <
> Do you really want 462 copies of SIN() and COS() occupying memory ??
> {{Also note: My 66000 ISA can call DLLs using no more instructions
> than local functions. No trampolines required.}}

There are not multiple copies, unless one inlines routines
and we are not talking about inlined code.

If I link with an object OBJ library I get *one* copy of a routine
included in my EXE. However that code is not shared with other apps that
also link with that same routine (eg each app has its own sin routine).

If I link with a shared DLL library then I do not get a copy of that
routine in my EXE, I get a call to a routine in the DLL, of which there
is a single instance for all applications (eg one sin routine shared
between all applications).

But as I said, even if technically possible its probably better to
take DLL's out of consideration and just use link libraries.

Re: Intel goes to 32 GPRs

<650494f0-5969-4278-b254-3505c77d1a69n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33755&group=comp.arch#33755

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4f84:0:b0:63d:557f:b4c9 with SMTP id em4-20020ad44f84000000b0063d557fb4c9mr127917qvb.3.1692727440640;
Tue, 22 Aug 2023 11:04:00 -0700 (PDT)
X-Received: by 2002:a63:3606:0:b0:564:17ba:47cd with SMTP id
d6-20020a633606000000b0056417ba47cdmr1843393pga.10.1692727440205; Tue, 22 Aug
2023 11:04:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.swapon.de!2.eu.feeder.erje.net!feeder.erje.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Aug 2023 11:03:59 -0700 (PDT)
In-Reply-To: <Td6FM.501194$qnnb.321179@fx11.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <650494f0-5969-4278-b254-3505c77d1a69n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Aug 2023 18:04:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8871

by: MitchAlsup - Tue, 22 Aug 2023 18:03 UTC

On Tuesday, August 22, 2023 at 12:27:19 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, August 22, 2023 at 9:04:23 AM UTC-5, EricP wrote:
> >> Stephen Fuld wrote:
> >>> On 8/19/2023 10:02 AM, EricP wrote:
> >>>
> >>> snipped discussion on caller vs callee saved registers
> >>>
> >>>> The compiler could export a summary file describing the registers
> >>>> used by each routine. This summary would be imported by the compiler
> >>>> of callers to those routines so it could allocate registers not used
> >>>> by the callee to the caller. Also the summary could be produced by
> >>>> scanning a .OBJ or .DLL.
> >>>>
> >>>> Such a mechanism does not require modifications to source and
> >>>> is also language independent.
> >>>>
> >>>> It does create a dependency between the caller and a particular
> >>>> (timestamped) instance of a callee in a .OBJ or .DLL file.
> >>> Not necessarily. If the compiler of the calling program also exported
> >>> the used register lists of the routines it called when compiling the
> >>> code, then the linker could check that the routines actually linked to
> >>> did not use more registers than the caller assumed. This would work even
> >>> if the called program was compiled at a different time than the one used
> >>> at the main routine's compile time. (I hope this is clear. The
> >>> language can get confusing.)
> >> Yes, its clear.
> > <
> >> I'm not so concerned about coordinating between separately compiled .OBJ
> >> as they would just need to be compiled and summarized in the right order.
> >>
> >> But a lot of leaf routines like math or string are in DLL shared libraries.
> >> It would be a bit fragile and kinda defeats the whole purpose of shared
> >> libraries if you have to recompile all your application code when a new
> >> version of a DLL shows up.
> > <
> > Don't allow the new version of the DLL to use more registers than the
> > previous in any groupings.
> >> So I thought it was worth noting, particularly for math library users
> >> as those applications are the ones most likely to benefit from more
> >> preserved registers across calls to sine, log, pow, etc.
> > <
> > All of these "function calls" end up being instructions in My 66000 ISA..
> > But, even those that are not functions take "way less code" than similar
> > RISC-V compilations of those codes--mainly because these codes are
> > polynomials with constant coefficients. For example r8_erf() from
> > polpack only needs 43% as many instructions as it needs in RISC-V
> > and when the compiler limit on constants in registers limitations are
> > removed, better still. {and as a side note:: this is a function where
> > RISC-V with separate GPR and FPR registers files needs spill/fill
> > code, whereas My 66000 with integrated GPR does not--entirely du
> > to universal constants}.
> > <
> > Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
> > registers (index, pointer, coefficient, power, product and summation)::
> > SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
> > this general category. I suspect many more straightforward polynomial
> > evaluations do too.
> Even with your transcendental instructions, would not these still be
> implemented as non-inlined subroutines? Because for different languages,
> Fortran, C, the math functions in particular have different ways of
> interacting with the run-time environment to report errors.
>
> Eg C/C++ math functions can set errno which is a TLS variable.
> I don't know what GCC Fortran does for its error status reporting.
> VAX Fortran language math functions threw an exception for error conditions,
> however VAX library math functions returned a function status code.
> > <
> >>> Of course, the question arises, what if a called routine does use more
> >>> registers than the calling program was told at its compile time. There
> >>> are several possibilities, abort the link, create trampoline routines to
> >>> save and restore the newly used registers, perhaps others. The choice
> >>> should certainly be made by the programmer.
> >>>
> >>> In fact, if at link time, the called program used fewer registers than
> >>> the one assumed at compile time, that should be brought to the
> >>> programmers attention as information, so perhaps the calling program
> >>> could be recompiled and, by using more registers, be more efficient.
> > <
> >> For modules that are part of a project and separately compiled then one
> >> just recompiles them. The problem is for DLL's: detecting that a change
> >> has occurred and then what to do about it (customers won't have source
> >> code to recompile).
> > <
> > DLLs have a defined interface--what you need is for that definition to
> > include the number of registers in each group and never change that
> > definition. A side effect is that the compiler will need flag-inputs to
> > specify those register groups.
> >> As a practical matter, apps wanting to do this optimization should
> >> only use link (.OBJ) libraries for math, strings, etc and not DLL's.
> >> (And let's just forget I mentioned DLL's at all.)
> > <
> > Do you really want 462 copies of SIN() and COS() occupying memory ??
> > {{Also note: My 66000 ISA can call DLLs using no more instructions
> > than local functions. No trampolines required.}}
<
> There are not multiple copies, unless one inlines routines
> and we are not talking about inlined code.
<
Nor was I:: System is running 462 applications simultaneously that
call various transcendental functions, and each application has its
static copy of the transcendentals it uses. It is still 462 copies, even
when not inlined, being included in the object module is enough to
to be a unique instantiation of the <shareable> function.
>
> If I link with an object OBJ library I get *one* copy of a routine
<
per application.
<
> included in my EXE. However that code is not shared with other apps that
> also link with that same routine (eg each app has its own sin routine).
>
my point exactly.
>
> If I link with a shared DLL library then I do not get a copy of that
> routine in my EXE, I get a call to a routine in the DLL, of which there
> is a single instance for all applications (eg one sin routine shared
> between all applications).
<
Thus saving memory, and in my system to taking any more instructions
to perform the call and return.
>
> But as I said, even if technically possible its probably better to
> take DLL's out of consideration and just use link libraries.
<
Each instance suffering its own ICache misses because the code is
not shared.

Re: Intel goes to 32 GPRs

<uc310n$umu$1@gal.iecc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33757&group=comp.arch#33757

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: johnl@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 22 Aug 2023 19:07:03 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <uc310n$umu$1@gal.iecc.com>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
Injection-Date: Tue, 22 Aug 2023 19:07:03 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="31454"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Tue, 22 Aug 2023 19:07 UTC

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>Not necessarily. If the compiler of the calling program also exported
>the used register lists of the routines it called when compiling the
>code, then the linker could check that the routines actually linked to
>did not use more registers than the caller assumed. This would work even
>if the called program was compiled at a different time than the one used
>at the main routine's compile time. (I hope this is clear. The
>language can get confusing.)

Ah, you mean signatures for calling points. It's an old trick dating
back to the 1970s used to check that the arguments passed from the
caller match the ones the callee expects. No reason you couldn't do
the same think to check for compatible register sets.

This is why sensible libraries use version numbers with major numbers
meaning an interface change and minor numbers just a bug fix. You
relink your code to use a new major version of the library. Every Unix
and Linux system does this.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Callee-saved registers

<uc32c1$2ho1k$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33758&group=comp.arch#33758

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers
Date: Tue, 22 Aug 2023 21:30:08 +0200
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <uc32c1$2ho1k$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Aug18.082303@mips.complang.tuwien.ac.at> <ubuegr$1nk70$1@dont-email.me>
<c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<uc05oh$200om$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Aug 2023 19:30:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="be716e19aec2f81716fb804b09160d90";
logging-data="2678836"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+/EYGrAYFlJBzIPXrb5nxD1WjPWI1IuDT68cCNsayy4w=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:0bJ61EBwqRvhGx7QiG9BHW14KnY=
In-Reply-To: <uc05oh$200om$1@dont-email.me>

by: Terje Mathisen - Tue, 22 Aug 2023 19:30 UTC

Kent Dickey wrote:
> typedef unsigned int u32;
> typedef unsigned long long u64;
>
> u64 do_op(u64 out0, u64 in0, u64 in1, u32 opcode, int size);
>
> void
> calc_loop(u64 *optr, u64 *iptr0, u64 *iptr1, u32 opcode, int size, int len)
> {
> u64 o0, i0, i1, val, result;
> int num, shift, pos;
> int i, j;
>
> // size is 0,1,2,3 representing 8,16,32,64 bytes
> num = 8 >> size; // 8,4,2,1
> shift = 8 << size; // 8,16,32,64
> for(i = 0; i < len; i++) {
> o0 = optr[i];
> i0 = iptr0[i];
> i1 = iptr1[i];
> result = 0;
> pos = 0;
> for(j = 0; j < num; j++) {
> val = do_op(o0, i0, i1, opcode, size);

How expensive is do_op()? If it takes 100+ cycles, then calc_loop
doesn't really matter, right? So I'll assume do_op is sub-10 cycles and
small enough to be inlined!

> result = result | (val << pos);
pos cannot be >= 64?
> pos += shift;
> if(shift < 64) {
The branch above could be replaced with masked shifts:
> o0 = o0 >> shift;
o0 >>= (shift & 63);
> i0 = i0 >> shift;
i0 >>= (shift & 63);
> i1 = i1 >> shift;
i1 >>= (shift & 63);
> }

On x86 the compiler will probably elide the &63 operations since that is
a hw feature of the shift ops.

> }
> optr[i] = result;
> }
> }
>
> The idea is the outer 'i' loop iterates over 64-bit entries, and the
> inner 'j' loop iterates over each element, and the element size is encoded
> as size=0->byte; size=1->16-bit; size=2->32-bit, size=3->64-bit.
> The code keeps three arrays, one "out" array (which is used as an input
> sometimes), and two "in" arrays.

Would it be possible to allocate optr[], iptr0[], iptr1[] either
interleaved or concatenated so that a single base register to point to
optr[] and the two other being at fixed offsets?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Intel goes to 32 GPRs

<uc32dt$2ga95$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33759&group=comp.arch#33759

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 22 Aug 2023 12:31:09 -0700
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <uc32dt$2ga95$2@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<uc310n$umu$1@gal.iecc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Aug 2023 19:31:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c791ee6ca3a519710bc74235e9c9d7c7";
logging-data="2631973"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18vHF4fjtBuYxHAEInNGU+x2QJst1RKDTk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sTJctCD5nMNHLzxiz5Em9LsucS8=
Content-Language: en-US
In-Reply-To: <uc310n$umu$1@gal.iecc.com>

by: Stephen Fuld - Tue, 22 Aug 2023 19:31 UTC

On 8/22/2023 12:07 PM, John Levine wrote:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>> Not necessarily. If the compiler of the calling program also exported
>> the used register lists of the routines it called when compiling the
>> code, then the linker could check that the routines actually linked to
>> did not use more registers than the caller assumed. This would work even
>> if the called program was compiled at a different time than the one used
>> at the main routine's compile time. (I hope this is clear. The
>> language can get confusing.)
>
> Ah, you mean signatures for calling points. It's an old trick dating
> back to the 1970s used to check that the arguments passed from the
> caller match the ones the callee expects. No reason you couldn't do
> the same think to check for compatible register sets.

Good point. I hadn't thought of that connection, but yes.

>> Of course, the question arises, what if a called routine does use more
>> registers than the calling program was told at its compile time. There
>> are several possibilities, abort the link, create trampoline routines to
>> save and restore the newly used registers, perhaps others. The choice
>> should certainly be made by the programmer.
>
> This is why sensible libraries use version numbers with major numbers
> meaning an interface change and minor numbers just a bug fix. You
> relink your code to use a new major version of the library. Every Unix
> and Linux system does this.

But it isn't clear that changing the registers used internally is an
"interface change". It certainly isn't a change in the source code of
the interface. Or, to put it another way, a "bug fix" may change the
register usage without changing what is typically called the "interface".

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>> Of course, the question arises, what if a called routine does use more
>>> registers than the calling program was told at its compile time. There
>>> are several possibilities, abort the link, create trampoline routines to
>>> save and restore the newly used registers, perhaps others. The choice
>>> should certainly be made by the programmer.
>>
>> This is why sensible libraries use version numbers with major numbers
>> meaning an interface change and minor numbers just a bug fix. You
>> relink your code to use a new major version of the library. Every Unix
>> and Linux system does this.
>
>But it isn't clear that changing the registers used internally is an
>"interface change". It certainly isn't a change in the source code of
>the interface. Or, to put it another way, a "bug fix" may change the
>register usage without changing what is typically called the "interface".

If the register sets have to match between caller and callee, that seems
to me to be an interface by definition.

There's be some tool building work to be check that the register sets
don't change when rebuilding a library if we make them part of the
interface. But it doesn't seem that hard, just compare old and new
libraries and fail if the new one is more restrictive. Then of course
there's the issue of what to do, probably a new compiler flag or
pragma saying only use these registers, which I would be ugly.

R's,
John
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

In article <uc32c1$2ho1k$1@dont-email.me>,
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>Kent Dickey wrote:

You removed my comment saying I wasn't going to entertain nitpicks, but
you have one reasonable request.

>> typedef unsigned int u32;
>> typedef unsigned long long u64;
>>
>> u64 do_op(u64 out0, u64 in0, u64 in1, u32 opcode, int size);
>>
>> void
>> calc_loop(u64 *optr, u64 *iptr0, u64 *iptr1, u32 opcode, int size, int len)
>> {
>> u64 o0, i0, i1, val, result;
>> int num, shift, pos;
>> int i, j;
>>
>> // size is 0,1,2,3 representing 8,16,32,64 bytes
>> num = 8 >> size; // 8,4,2,1
>> shift = 8 << size; // 8,16,32,64
>> for(i = 0; i < len; i++) {
>> o0 = optr[i];
>> i0 = iptr0[i];
>> i1 = iptr1[i];
>> result = 0;
>> pos = 0;
>> for(j = 0; j < num; j++) {
>> val = do_op(o0, i0, i1, opcode, size);
>
>How expensive is do_op()? If it takes 100+ cycles, then calc_loop
>doesn't really matter, right? So I'll assume do_op is sub-10 cycles and
>small enough to be inlined!

do_op() is a giant switch with 50+ entries. It's an ALU simulation:

switch(opcode) {
case 0:
return i0 + i1;
case 1:
return i0 - i1;
....
case 49:
return i0 * i1;
}

So it's just a few instructions dynamically (the opcode isn't changing
in the loop, so it should get predicted after a few iterations to go to
the opcode of interest, which is generally one instruction). But it's
hundreds of instructions in total size. In a different file.

>> result = result | (val << pos);
>pos cannot be >= 64?

pos cannot be >= 64 when this shift is executed. Pos can be set to 64
in the next statement, but then this inner loop will end right after.

>> pos += shift;
>> if(shift < 64) {
>The branch above could be replaced with masked shifts:
>> o0 = o0 >> shift;
> o0 >>= (shift & 63);
>> i0 = i0 >> shift;
> i0 >>= (shift & 63);
>> i1 = i1 >> shift;
> i1 >>= (shift & 63);
>> }

I mentioned doing & 63 in my 2020 post, this is a flaw in C standard which
is beyond the scope of this post.

>
>On x86 the compiler will probably elide the &63 operations since that is
>a hw feature of the shift ops.
>
>> }
>> optr[i] = result;
>> }
>> }
>>
>> The idea is the outer 'i' loop iterates over 64-bit entries, and the
>> inner 'j' loop iterates over each element, and the element size is encoded
>> as size=0->byte; size=1->16-bit; size=2->32-bit, size=3->64-bit.
>> The code keeps three arrays, one "out" array (which is used as an input
>> sometimes), and two "in" arrays.
>
>Would it be possible to allocate optr[], iptr0[], iptr1[] either
>interleaved or concatenated so that a single base register to point to
>optr[] and the two other being at fixed offsets?

I was trying to show a short routine that needs about 16 registers to avoid
spills, since someone immediately doubted that such a thing existed.

The code is written in a careful way to allow the compiler to optimize it
to be nearly optimum, but in this case, there just aren't enough registers,
so something I expected to be short and fast isn't that short.

You are approaching this as a microbenchmark, where this routine is 80%
of the runtime, and making it faster is worth the effort. Instead, I
have 1000 functions like this, each 0.04% of the runtime, and I don't
have time or resources to micro optimize each of them. So I care about
overall performance, but I cannot hand optimize each function. I need
to rely on the compiler to do a good job.

>Terje
>
>--
>- <Terje.Mathisen at tmsw.no>
>"almost all programming can be viewed as an exercise in caching"

Kent

Re: Intel goes to 32 GPRs

<uc45qi$2pk2p$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33770&group=comp.arch#33770

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f2a7-0-3c5b-97e0-1685-9f7e.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 23 Aug 2023 05:35:14 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uc45qi$2pk2p$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
<Td6FM.501194$qnnb.321179@fx11.iad>
Injection-Date: Wed, 23 Aug 2023 05:35:14 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f2a7-0-3c5b-97e0-1685-9f7e.ipv6dyn.netcologne.de:2001:4dd4:f2a7:0:3c5b:97e0:1685:9f7e";
logging-data="2936921"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 23 Aug 2023 05:35 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:
> MitchAlsup wrote:

>> All of these "function calls" end up being instructions in My 66000 ISA.
>> But, even those that are not functions take "way less code" than similar
>> RISC-V compilations of those codes--mainly because these codes are
>> polynomials with constant coefficients. For example r8_erf() from
>> polpack only needs 43% as many instructions as it needs in RISC-V
>> and when the compiler limit on constants in registers limitations are
>> removed, better still. {and as a side note:: this is a function where
>> RISC-V with separate GPR and FPR registers files needs spill/fill
>> code, whereas My 66000 with integrated GPR does not--entirely du
>> to universal constants}.
>> <
>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
>> registers (index, pointer, coefficient, power, product and summation)::
>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
>> this general category. I suspect many more straightforward polynomial
>> evaluations do too.
>
> Even with your transcendental instructions, would not these still be
> implemented as non-inlined subroutines? Because for different languages,
> Fortran, C, the math functions in particular have different ways of
> interacting with the run-time environment to report errors.

C actually has no requirement of setting errno on out-of-range
calls to mathematical functions. You can use -fno-math-errno to
get this behavior from gcc (and, I believe, clang). Apple chose
not to include setting errno in MacOS, and they made the right
decision there - thread safety and vectorization make setting errno
a performance limiter.

Anybody making a new implementation is equally free to not set
errno on math functions, and this makes good sense on My 666000.

> Eg C/C++ math functions can set errno which is a TLS variable.
> I don't know what GCC Fortran does for its error status reporting.

Since Fortran has no errno, gfortran in effect just uses
-fno-math-errno. Range errors are usually treated by returning NaN.

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<2023Aug23.071732@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33771&group=comp.arch#33771

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Wed, 23 Aug 2023 05:17:32 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 109
Message-ID: <2023Aug23.071732@mips.complang.tuwien.ac.at>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at> <ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
Injection-Info: dont-email.me; posting-host="811b53b66829c2fe281d38fe9bdfa7b6";
logging-data="2965943"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cmMaSKrTjPxig72qmGw0B"
Cancel-Lock: sha1:e6FYWxLqCnfwZRsynRmKt1B6RZY=
X-newsreader: xrn 10.11

by: Anton Ertl - Wed, 23 Aug 2023 05:17 UTC

Michael S <already5chosen@yahoo.com> writes:
>There is other side of the same argument: can you show me (or more
>relevantly to designers of aarch64 ABI) some code which needs 16
> callee-saved registers, and would run noticeably slower with 11
> callee-saved registers ?

I think that a number of virtual-machine (VM) interpreters (a common
programming language implementation technique) would benefit. An
example is Gforth, and I am one of its implementors.

In Gforth we have the following virtual-machine registers:

ip VM instruction pointer
CFA Current word (VM instruction or other)
sp data stack pointer
fp FP stack pointer
rp return stack pointer
lp locals pointer
op object pointer (this/self)
up user pointer (base of thread-local variables)
TOS top of data stack
FTOS top of FP stack (in FP register)

Apart from TOS, these VM registers all live across many function
calls, and therefore gcc either puts them in caller-saved registers,
or, if it runs out of those, puts them in memory.

TOS is a special case: It is usually dead during most of the VM
instruction execution. The exception are VM instructions that have no
data-stack effect (e.g., FP instructions like FSIN). And there are
apparently few enough of those in the engine function that gcc manages
to put TOS in a caller-saved register (r8 in the engine that I am
currently looking at) and save and restore TOS around the few
functions where it lives.

So this means that we already have 8 VM registers for which gcc needs
caller-saved registers.

We have had some ideas that would have introduced additional VM
registers to enhance performance (if enough caller-saved registers
were available), but given the dearth of caller-saved registers on
most platforms, we usually have not implemented these ideas.

There is, however, one case where we implemented such an idea: We use
additional registers for keeping data-stack items in registers,
providing significant speedups. You can read about it in
[ertl&gregg05].

If more caller-saved registers were widely available, we would have
implemented additional ideas, which might have provided additional
speedups. E.g., we could have kept the top of the return stack in a
register, which also serves as the counter of counted loops,
increasing the performance of counted loops.

And Gforth uses a simple virtual machine. I expect that virtual
machines for more sophisticated languages, like Prolog, can make good
use of more VM registers and thus more callee-saved register.
Unfortunately, it is over 30 years since I implemented a virtual
machine interpreter for Prolog, so I'll skip the details.

CPython (the main Python implementation) also use a virtual machine
interpreter, and the CPython implementors have various projects to
make this interpreter faster; I expect that more callee-saved
registers would benefit them quite a bit.

@InProceedings{ertl&gregg05,
author = {M. Anton Ertl and David Gregg},
title = {Stack Caching in {Forth}},
crossref = {euroforth05},
pages = {6--15},
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
OPTnote = {not refereed},
abstract = {Stack caching speeds Forth up by keeping stack items
in registers, reducing the number of memory accesses
for stack items. This paper describes our work on
extending Gforth's stack caching implementation to
support more than one register in the canonical
state, and presents timing results for the resulting
Forth system. For single-representation stack
caches, keeping just one stack item in registers is
usually best, and provides speedups up to a factor
of 2.84 over the straight-forward stack
representation. For stack caches with multiple stack
representations, using the one-register
representation as canonical representation is
usually optimal, resulting in an overall speedup of
up to a factor of 3.80 (and up to a factor of 1.53
over single-representation stack caching).}
} @Proceedings{euroforth05,
title = {21st EuroForth Conference},
booktitle = {21st EuroForth Conference},
year = {2005},
key = {EuroForth'05},
editor = {M. Anton Ertl},
url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}
}

>I tend to think that they tried to give a deep thought, but in the process=
>=20
>quickly came to conclusion that it matters very little.

Based on what?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33772&group=comp.arch#33772

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:55eb:0:b0:641:8885:5010 with SMTP id bu11-20020ad455eb000000b0064188855010mr101496qvb.9.1692777696995;
Wed, 23 Aug 2023 01:01:36 -0700 (PDT)
X-Received: by 2002:a17:902:c449:b0:1bb:b74c:88fa with SMTP id
m9-20020a170902c44900b001bbb74c88famr4284951plm.6.1692777696626; Wed, 23 Aug
2023 01:01:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 01:01:36 -0700 (PDT)
In-Reply-To: <2023Aug23.071732@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at>
<ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<2023Aug23.071732@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Wed, 23 Aug 2023 08:01:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6964

by: Michael S - Wed, 23 Aug 2023 08:01 UTC

On Wednesday, August 23, 2023 at 9:12:54 AM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >There is other side of the same argument: can you show me (or more
> >relevantly to designers of aarch64 ABI) some code which needs 16
> > callee-saved registers, and would run noticeably slower with 11
> > callee-saved registers ?
> I think that a number of virtual-machine (VM) interpreters (a common
> programming language implementation technique) would benefit. An
> example is Gforth, and I am one of its implementors.
>
> In Gforth we have the following virtual-machine registers:
>
> ip VM instruction pointer
> CFA Current word (VM instruction or other)
> sp data stack pointer
> fp FP stack pointer
> rp return stack pointer
> lp locals pointer
> op object pointer (this/self)
> up user pointer (base of thread-local variables)
> TOS top of data stack
> FTOS top of FP stack (in FP register)
>
> Apart from TOS, these VM registers all live across many function
> calls, and therefore gcc either puts them in caller-saved registers,
> or, if it runs out of those, puts them in memory.
>
> TOS is a special case: It is usually dead during most of the VM
> instruction execution. The exception are VM instructions that have no
> data-stack effect (e.g., FP instructions like FSIN). And there are
> apparently few enough of those in the engine function that gcc manages
> to put TOS in a caller-saved register (r8 in the engine that I am
> currently looking at) and save and restore TOS around the few
> functions where it lives.
>
> So this means that we already have 8 VM registers for which gcc needs
> caller-saved registers.
>
> We have had some ideas that would have introduced additional VM
> registers to enhance performance (if enough caller-saved registers
> were available), but given the dearth of caller-saved registers on
> most platforms, we usually have not implemented these ideas.
>
> There is, however, one case where we implemented such an idea: We use
> additional registers for keeping data-stack items in registers,
> providing significant speedups. You can read about it in
> [ertl&gregg05].
>
> If more caller-saved registers were widely available, we would have
> implemented additional ideas, which might have provided additional
> speedups.
> E.g., we could have kept the top of the return stack in a
> register, which also serves as the counter of counted loops,
> increasing the performance of counted loops.
>

Did you try to experiment on POWER in order to see whether the speed
up is noticeable?

> And Gforth uses a simple virtual machine. I expect that virtual
> machines for more sophisticated languages, like Prolog, can make good
> use of more VM registers and thus more callee-saved register.
> Unfortunately, it is over 30 years since I implemented a virtual
> machine interpreter for Prolog, so I'll skip the details.
>
> CPython (the main Python implementation) also use a virtual machine
> interpreter, and the CPython implementors have various projects to
> make this interpreter faster; I expect that more callee-saved
> registers would benefit them quite a bit.
>
> @InProceedings{ertl&gregg05,
> author = {M. Anton Ertl and David Gregg},
> title = {Stack Caching in {Forth}},
> crossref = {euroforth05},
> pages = {6--15},
> url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
> pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
> OPTnote = {not refereed},
> abstract = {Stack caching speeds Forth up by keeping stack items
> in registers, reducing the number of memory accesses
> for stack items. This paper describes our work on
> extending Gforth's stack caching implementation to
> support more than one register in the canonical
> state, and presents timing results for the resulting
> Forth system. For single-representation stack
> caches, keeping just one stack item in registers is
> usually best, and provides speedups up to a factor
> of 2.84 over the straight-forward stack
> representation. For stack caches with multiple stack
> representations, using the one-register
> representation as canonical representation is
> usually optimal, resulting in an overall speedup of
> up to a factor of 3.80 (and up to a factor of 1.53
> over single-representation stack caching).}
> }
> @Proceedings{euroforth05,
> title = {21st EuroForth Conference},
> booktitle = {21st EuroForth Conference},
> year = {2005},
> key = {EuroForth'05},
> editor = {M. Anton Ertl},
> url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}
> }
>
> >I tend to think that they tried to give a deep thought, but in the process=
> >=20
> >quickly came to conclusion that it matters very little.
> Based on what?
>

Based on the corpus of benchmarks they have.
For 'other' industry players I would guess that it's primarily SpecInt.
In specific case of Arm Inc. - I don't know.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<2023Aug23.102035@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33773&group=comp.arch#33773

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Wed, 23 Aug 2023 08:20:35 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 60
Message-ID: <2023Aug23.102035@mips.complang.tuwien.ac.at>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at> <ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com> <2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>
Injection-Info: dont-email.me; posting-host="811b53b66829c2fe281d38fe9bdfa7b6";
logging-data="3014653"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+4TqB/Btoy6BDhFjK3bF0N"
Cancel-Lock: sha1:fjwJb5IHuNAFIrxcHf9UlpWG9xA=
X-newsreader: xrn 10.11

by: Anton Ertl - Wed, 23 Aug 2023 08:20 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Wednesday, August 23, 2023 at 9:12:54=E2=80=AFAM UTC+3, Anton Ertl wrote=
>:
>> If more caller-saved registers were widely available, we would have=20
>> implemented additional ideas, which might have provided additional=20
>> speedups.
>> E.g., we could have kept the top of the return stack in a=20
>> register, which also serves as the counter of counted loops,=20
>> increasing the performance of counted loops.=20
>>=20
>
>Did you try to experiment on POWER in order to see whether the speed=20
>up is noticeable?

No. From a product POV, Power/PowerPC is too insignificant as a
platform to invest that much effort (plus the effort to make it an
optional feature) even if we see a good speedup there. And I don't
think that this work would be considered sufficiently original and
relevant to be publishable as a research paper, so I also did not do
it as research, either.

>Based on the corpus of benchmarks they have.
>For 'other' industry players I would guess that it's primarily SpecInt.=20
>In specific case of Arm Inc. - I don't know.

With that methodology, I would expect the number of callee-saved
registers to be very similar across architectures. But they diverge
wildly.

OTOH, when most architectures were introduced, Hennessy&Patterson had
not written CA:AQA, so maybe the architects used handwawing for these
decisions (although, at least for HPPA, a quantitative approach was
reported for deciding which instructions to include. HPPA has
gr3..gr18 as callee-saved registers, i.e., 16 callee-saved registers).
Only AMD64, ARM A64, and RISC-V are significantly younger than CA:AQA.
For MIPS O32, derived from Stanford MIPS, one might also expect a
quantitative approach, but the benchmarks they used for Stanford MIPS
were not significant applications, but instead small integer
benchmarks, so one should not be surprised that they did not see a
benefit from more than 8 callee-saved registers.

For AMD64 the low number of callee-saved registers may be due to the
lack of registers overall (although even with stack pointer, frame
pointer, and six argument/return registers, 8 registers could be
callee-saved (9 with --fomit-frame-pointer).

For ARM A64 and RISC-V, the number may indeed be due to catering for
benchmarks, and indeed primarily for SPEC CPU and the like.

In this case: So decades of register-starved architectures (IA-32,
AMD64) and architectures with very few callee-saved registers (MIPS,
Alpha) have led to applications that are tuned to use few callee-saved
registers. And then such applications are used as benchmarks that
lead architects to have few callee-saved registers. It's a vicious
circle.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<dfa8e23e-eb63-4875-ab1a-7a6869e82e21n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33775&group=comp.arch#33775

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:190d:b0:63c:e59f:299e with SMTP id er13-20020a056214190d00b0063ce59f299emr127023qvb.3.1692792843654;
Wed, 23 Aug 2023 05:14:03 -0700 (PDT)
X-Received: by 2002:a17:90b:186:b0:26b:229d:6c8 with SMTP id
t6-20020a17090b018600b0026b229d06c8mr2841063pjs.3.1692792843193; Wed, 23 Aug
2023 05:14:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 05:14:02 -0700 (PDT)
In-Reply-To: <2023Aug23.102035@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at>
<ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>
<2023Aug23.102035@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dfa8e23e-eb63-4875-ab1a-7a6869e82e21n@googlegroups.com>
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Wed, 23 Aug 2023 12:14:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5083

by: Michael S - Wed, 23 Aug 2023 12:14 UTC

On Wednesday, August 23, 2023 at 11:59:30 AM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >On Wednesday, August 23, 2023 at 9:12:54=E2=80=AFAM UTC+3, Anton Ertl wrote=
> >:
> >> If more caller-saved registers were widely available, we would have=20
> >> implemented additional ideas, which might have provided additional=20
> >> speedups.
> >> E.g., we could have kept the top of the return stack in a=20
> >> register, which also serves as the counter of counted loops,=20
> >> increasing the performance of counted loops.=20
> >>=20
> >
> >Did you try to experiment on POWER in order to see whether the speed=20
> >up is noticeable?
>
> No. From a product POV, Power/PowerPC is too insignificant as a
> platform to invest that much effort (plus the effort to make it an
> optional feature) even if we see a good speedup there. And I don't
> think that this work would be considered sufficiently original and
> relevant to be publishable as a research paper, so I also did not do
> it as research, either.
> >Based on the corpus of benchmarks they have.
> >For 'other' industry players I would guess that it's primarily SpecInt.=20
> >In specific case of Arm Inc. - I don't know.
> With that methodology, I would expect the number of callee-saved
> registers to be very similar across architectures.

Not if, as I suspect, the curve in range [10:20] is almost flat.

> But they diverge wildly.

If curve is flat then decision is arbitrary == influenced by something else.

>
> OTOH, when most architectures were introduced, Hennessy&Patterson had
> not written CA:AQA, so maybe the architects used handwawing for these
> decisions (although, at least for HPPA, a quantitative approach was
> reported for deciding which instructions to include. HPPA has
> gr3..gr18 as callee-saved registers, i.e., 16 callee-saved registers).
> Only AMD64, ARM A64, and RISC-V are significantly younger than CA:AQA.
> For MIPS O32, derived from Stanford MIPS, one might also expect a
> quantitative approach, but the benchmarks they used for Stanford MIPS
> were not significant applications, but instead small integer
> benchmarks, so one should not be surprised that they did not see a
> benefit from more than 8 callee-saved registers.
>
> For AMD64 the low number of callee-saved registers may be due to the
> lack of registers overall (although even with stack pointer, frame
> pointer, and six argument/return registers, 8 registers could be
> callee-saved (9 with --fomit-frame-pointer).
>
> For ARM A64 and RISC-V, the number may indeed be due to catering for
> benchmarks, and indeed primarily for SPEC CPU and the like.
>
> In this case: So decades of register-starved architectures (IA-32,
> AMD64) and architectures with very few callee-saved registers (MIPS,
> Alpha) have led to applications that are tuned to use few callee-saved
> registers. And then such applications are used as benchmarks that
> lead architects to have few callee-saved registers. It's a vicious
> circle.

I hope that coding style in SpecCpu suite is mostly influenced by factors
not related to micro-optimizations of such low-level sort.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Intel goes to 32 GPRs

<NcoFM.574645$SuUf.392949@fx14.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33776&group=comp.arch#33776

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32 GPRs
Newsgroups: comp.arch
Distribution: world
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de>
Lines: 47
Message-ID: <NcoFM.574645$SuUf.392949@fx14.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 23 Aug 2023 13:54:53 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 23 Aug 2023 13:54:53 GMT
X-Received-Bytes: 3455

by: Scott Lurndal - Wed, 23 Aug 2023 13:54 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>> MitchAlsup wrote:
>
>>> All of these "function calls" end up being instructions in My 66000 ISA.
>>> But, even those that are not functions take "way less code" than similar
>>> RISC-V compilations of those codes--mainly because these codes are
>>> polynomials with constant coefficients. For example r8_erf() from
>>> polpack only needs 43% as many instructions as it needs in RISC-V
>>> and when the compiler limit on constants in registers limitations are
>>> removed, better still. {and as a side note:: this is a function where
>>> RISC-V with separate GPR and FPR registers files needs spill/fill
>>> code, whereas My 66000 with integrated GPR does not--entirely du
>>> to universal constants}.
>>> <
>>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
>>> registers (index, pointer, coefficient, power, product and summation)::
>>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
>>> this general category. I suspect many more straightforward polynomial
>>> evaluations do too.
>>
>> Even with your transcendental instructions, would not these still be
>> implemented as non-inlined subroutines? Because for different languages,
>> Fortran, C, the math functions in particular have different ways of
>> interacting with the run-time environment to report errors.
>
>C actually has no requirement of setting errno on out-of-range
>calls to mathematical functions. You can use -fno-math-errno to
>get this behavior from gcc (and, I believe, clang). Apple chose
>not to include setting errno in MacOS, and they made the right
>decision there - thread safety and vectorization make setting errno
>a performance limiter.

C doesn't. POSIX has an optional requirement to set errno.

"For all the functions in the <math.h> header, an application
wishing to check for error situations should set errno to 0 and
call feclearexcept(FE_ALL_EXCEPT) before calling the function.
On return, if errno is non-zero or
fetestexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)
is non-zero, an error has occurred."

Thread safety requires a thread-local version of errno, which is
pretty efficient in modern implementations (e.g. an address relative to
an otherwise unused segment register on intel).

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<2023Aug23.172341@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33777&group=comp.arch#33777

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Wed, 23 Aug 2023 15:23:41 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 39
Message-ID: <2023Aug23.172341@mips.complang.tuwien.ac.at>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at> <ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com> <2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com> <2023Aug23.102035@mips.complang.tuwien.ac.at> <dfa8e23e-eb63-4875-ab1a-7a6869e82e21n@googlegroups.com>
Injection-Info: dont-email.me; posting-host="811b53b66829c2fe281d38fe9bdfa7b6";
logging-data="3142171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+USjZmXwnVLU1KDVLIk0JL"
Cancel-Lock: sha1:qQ0vwPwG2UKo2YvHOB/ua1niewc=
X-newsreader: xrn 10.11

by: Anton Ertl - Wed, 23 Aug 2023 15:23 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Wednesday, August 23, 2023 at 11:59:30=E2=80=AFAM UTC+3, Anton Ertl wrot=
>e:
>> In this case: So decades of register-starved architectures (IA-32,=20
>> AMD64) and architectures with very few callee-saved registers (MIPS,=20
>> Alpha) have led to applications that are tuned to use few callee-saved=20
>> registers. And then such applications are used as benchmarks that=20
>> lead architects to have few callee-saved registers. It's a vicious=20
>> circle.
>
>I hope that coding style in SpecCpu suite is mostly influenced by factors
>not related to micro-optimizations of such low-level sort.

If any performance work was done at all on these applications that
actually checked whether the changes were effective (i.e., if any
performance work was competently done), natural selection would have
steered the programmers away from programs that need many variables
that live across calls.

And after a while the programmers might learn something; probably not
the real cause, but some twisted caricature of it (e.g." "use only
seven locals, because ..." (twisted and completely nonfactual
explanation elided)), but anyway, it would steer them away from
writing programs in a way that would benefit from more callee-saved
registers. If more callee-saved registers were available on the
platforms they use, they would learn something different.

E.g., when I first explored the idea of stack caching (more than just
using TOS) in Gforth on MIPS and Alpha, I quickly found out that it
would not provide a benefit, and did not go there. However, it was
hard to swallow that these two architectures with their 31 registers
are as starved of usable registers as IA-32 with 8, so eventually I
explored the idea again on PowerPC, and there things looked much
better, so I continued.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<uc5dfv$vkk$1@gal.iecc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33778&group=comp.arch#33778

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: johnl@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Wed, 23 Aug 2023 16:52:15 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <uc5dfv$vkk$1@gal.iecc.com>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com> <2023Aug23.102035@mips.complang.tuwien.ac.at>
Injection-Date: Wed, 23 Aug 2023 16:52:15 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="32404"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com> <2023Aug23.102035@mips.complang.tuwien.ac.at>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Wed, 23 Aug 2023 16:52 UTC

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>OTOH, when most architectures were introduced, Hennessy&Patterson had
>not written CA:AQA, so maybe the architects used handwawing for these
>decisions ...

I believe the first architecture that was designed using simulations
of workloads was S/360. After they considered a lot of options
including a stack, it ended up with 16 registers. In retrospect, they
made a few mistakes (no relative branches and the botched floating
point), but having a lot of registers, and making them all usable as
both accumulators and index or base registers was a good move.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Intel goes to 32 GPRs

<71a1224c-ea0f-4dee-af2f-adbd3135fd4dn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33779&group=comp.arch#33779

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:41c:b0:76d:c9c7:dd6d with SMTP id 28-20020a05620a041c00b0076dc9c7dd6dmr55752qkp.6.1692813918063;
Wed, 23 Aug 2023 11:05:18 -0700 (PDT)
X-Received: by 2002:a17:902:cec2:b0:1bb:b39d:8cb0 with SMTP id
d2-20020a170902cec200b001bbb39d8cb0mr6128511plg.1.1692813917855; Wed, 23 Aug
2023 11:05:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 11:05:17 -0700 (PDT)
In-Reply-To: <uc45qi$2pk2p$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7592:2330:8e35:f0f9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7592:2330:8e35:f0f9
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad>
<uc45qi$2pk2p$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <71a1224c-ea0f-4dee-af2f-adbd3135fd4dn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 18:05:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4262

by: MitchAlsup - Wed, 23 Aug 2023 18:05 UTC

On Wednesday, August 23, 2023 at 12:35:18 AM UTC-5, Thomas Koenig wrote:
> EricP <ThatWould...@thevillage.com> schrieb:
> > MitchAlsup wrote:
>
> >> All of these "function calls" end up being instructions in My 66000 ISA.
> >> But, even those that are not functions take "way less code" than similar
> >> RISC-V compilations of those codes--mainly because these codes are
> >> polynomials with constant coefficients. For example r8_erf() from
> >> polpack only needs 43% as many instructions as it needs in RISC-V
> >> and when the compiler limit on constants in registers limitations are
> >> removed, better still. {and as a side note:: this is a function where
> >> RISC-V with separate GPR and FPR registers files needs spill/fill
> >> code, whereas My 66000 with integrated GPR does not--entirely du
> >> to universal constants}.
> >> <
> >> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
> >> registers (index, pointer, coefficient, power, product and summation)::
> >> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
> >> this general category. I suspect many more straightforward polynomial
> >> evaluations do too.
> >
> > Even with your transcendental instructions, would not these still be
> > implemented as non-inlined subroutines? Because for different languages,
> > Fortran, C, the math functions in particular have different ways of
> > interacting with the run-time environment to report errors.
<
> C actually has no requirement of setting errno on out-of-range
> calls to mathematical functions. You can use -fno-math-errno to
> get this behavior from gcc (and, I believe, clang). Apple chose
> not to include setting errno in MacOS, and they made the right
> decision there - thread safety and vectorization make setting errno
> a performance limiter.
<
I would suggest that any/all functions taking all arguments {-Infinity
...+infinity, qNaN, and sNaN} and deliver IEEE 754-2019 compatible/
compliant results needs no errno.
>
> Anybody making a new implementation is equally free to not set
> errno on math functions, and this makes good sense on My 666000.
<
Too many 6's
<
> > Eg C/C++ math functions can set errno which is a TLS variable.
> > I don't know what GCC Fortran does for its error status reporting.
<
> Since Fortran has no errno, gfortran in effect just uses
> -fno-math-errno. Range errors are usually treated by returning NaN.

Re: Intel goes to 32 GPRs

<ef5ee740-6d36-43c8-b11d-912a1890f559n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33780&group=comp.arch#33780

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1aa4:b0:762:495d:8f89 with SMTP id bl36-20020a05620a1aa400b00762495d8f89mr176471qkb.2.1692814313851;
Wed, 23 Aug 2023 11:11:53 -0700 (PDT)
X-Received: by 2002:a63:a319:0:b0:565:e467:ef5e with SMTP id
s25-20020a63a319000000b00565e467ef5emr2324796pge.5.1692814313456; Wed, 23 Aug
2023 11:11:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 11:11:52 -0700 (PDT)
In-Reply-To: <NcoFM.574645$SuUf.392949@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7592:2330:8e35:f0f9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7592:2330:8e35:f0f9
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me>
<ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
<Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de>
<NcoFM.574645$SuUf.392949@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ef5ee740-6d36-43c8-b11d-912a1890f559n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 18:11:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4723

by: MitchAlsup - Wed, 23 Aug 2023 18:11 UTC

On Wednesday, August 23, 2023 at 8:54:57 AM UTC-5, Scott Lurndal wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
> >EricP <ThatWould...@thevillage.com> schrieb:
> >> MitchAlsup wrote:
> >
> >>> All of these "function calls" end up being instructions in My 66000 ISA.
> >>> But, even those that are not functions take "way less code" than similar
> >>> RISC-V compilations of those codes--mainly because these codes are
> >>> polynomials with constant coefficients. For example r8_erf() from
> >>> polpack only needs 43% as many instructions as it needs in RISC-V
> >>> and when the compiler limit on constants in registers limitations are
> >>> removed, better still. {and as a side note:: this is a function where
> >>> RISC-V with separate GPR and FPR registers files needs spill/fill
> >>> code, whereas My 66000 with integrated GPR does not--entirely du
> >>> to universal constants}.
> >>> <
> >>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
> >>> registers (index, pointer, coefficient, power, product and summation)::
> >>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
> >>> this general category. I suspect many more straightforward polynomial
> >>> evaluations do too.
> >>
> >> Even with your transcendental instructions, would not these still be
> >> implemented as non-inlined subroutines? Because for different languages,
> >> Fortran, C, the math functions in particular have different ways of
> >> interacting with the run-time environment to report errors.
> >
> >C actually has no requirement of setting errno on out-of-range
> >calls to mathematical functions. You can use -fno-math-errno to
> >get this behavior from gcc (and, I believe, clang). Apple chose
> >not to include setting errno in MacOS, and they made the right
> >decision there - thread safety and vectorization make setting errno
> >a performance limiter.
> C doesn't. POSIX has an optional requirement to set errno.
>
> "For all the functions in the <math.h> header, an application
> wishing to check for error situations should set errno to 0 and
> call feclearexcept(FE_ALL_EXCEPT) before calling the function.
<
So, only if an instruction takes an exception (note not raises; takes)
then errno gets set. I am perfectly happy to have the math exception
handlers set errno.
<
> On return, if errno is non-zero or
> fetestexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)
> is non-zero, an error has occurred."
<
This remains compliant since an instruction does not set errno
but when it raises an exception which is disabled, these flags are set.
>
> Thread safety requires a thread-local version of errno, which is
> pretty efficient in modern implementations (e.g. an address relative to
> an otherwise unused segment register on intel).
<
TLS eats R16 in My 66000 ABI.

"All Bibles are man-made." -- Thomas Edison

devel / comp.arch / Re: Intel goes to 32 GPRs

devel / comp.arch / Re: Intel goes to 32 GPRs

Subject	Author
Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Peter Lund
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Elijah Stone
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	BGB
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	JimBrakefield
Re: Intel goes to 32-bit general purpose registers	Michael S
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Stephen Fuld
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Quadibloc
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	JimBrakefield
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Ivan Godard
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	EricP
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Kent Dickey
Callee-saved registers (was: Intel goes to 32 GPRs)	Anton Ertl
Re: Intel goes to 32 GPRs	Mike Stump