Message-ID:

I ____knew I had some reason for not logging you off... If I could just remember what it was.

devel / comp.arch / Re: lots of inline, Intel goes to 32 GPRs

Re: Intel goes to 32 GPRs

<G2sFM.197201$ens9.139215@fx45.iad>

https://news.novabbs.org/devel/article-flat.php?id=33781&group=comp.arch#33781

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de>
In-Reply-To: <uc45qi$2pk2p$1@newsreader4.netcologne.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 59
Message-ID: <G2sFM.197201$ens9.139215@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 23 Aug 2023 18:17:10 UTC
Date: Wed, 23 Aug 2023 14:16:37 -0400
X-Received-Bytes: 4139

by: EricP - Wed, 23 Aug 2023 18:16 UTC

Thomas Koenig wrote:
> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>> MitchAlsup wrote:
>
>>> All of these "function calls" end up being instructions in My 66000 ISA.
>>> But, even those that are not functions take "way less code" than similar
>>> RISC-V compilations of those codes--mainly because these codes are
>>> polynomials with constant coefficients. For example r8_erf() from
>>> polpack only needs 43% as many instructions as it needs in RISC-V
>>> and when the compiler limit on constants in registers limitations are
>>> removed, better still. {and as a side note:: this is a function where
>>> RISC-V with separate GPR and FPR registers files needs spill/fill
>>> code, whereas My 66000 with integrated GPR does not--entirely du
>>> to universal constants}.
>>> <
>>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-ish
>>> registers (index, pointer, coefficient, power, product and summation)::
>>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall into
>>> this general category. I suspect many more straightforward polynomial
>>> evaluations do too.
>> Even with your transcendental instructions, would not these still be
>> implemented as non-inlined subroutines? Because for different languages,
>> Fortran, C, the math functions in particular have different ways of
>> interacting with the run-time environment to report errors.
>
> C actually has no requirement of setting errno on out-of-range
> calls to mathematical functions. You can use -fno-math-errno to
> get this behavior from gcc (and, I believe, clang). Apple chose
> not to include setting errno in MacOS, and they made the right
> decision there - thread safety and vectorization make setting errno
> a performance limiter.
>
> Anybody making a new implementation is equally free to not set
> errno on math functions, and this makes good sense on My 666000.
>
>> Eg C/C++ math functions can set errno which is a TLS variable.
>> I don't know what GCC Fortran does for its error status reporting.
>
> Since Fortran has no errno, gfortran in effect just uses
> -fno-math-errno. Range errors are usually treated by returning NaN.

Seems to be a bit of a dog's breakfast. There is errno, but also there
is the matherr() callback function that can provide more details
and may set errno. On Windows the matherr function is apparently
invoked by the structured exception handler, whereas other *nix
appear to just call that function. Or one can call fetestexcept()
to test for a previous FP exception.

And since errno is a thread_local variable then each DLL gets its
own copy so it matters where the library function was called and
whether it was linked with a static OBJ library or DLL as to
which errno gets set.

Anyway, my point was that the error handling code for each of these
functions would likely be large enough that one would not want it inlined
as it would have to map all the errors into some language specific reporting.

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<24e5dad3-9672-4bd1-bc2e-b51b0f4ea5a4n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33782&group=comp.arch#33782

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:8c85:b0:76d:9bb7:c867 with SMTP id ra5-20020a05620a8c8500b0076d9bb7c867mr116000qkn.0.1692815159271;
Wed, 23 Aug 2023 11:25:59 -0700 (PDT)
X-Received: by 2002:a17:90b:104:b0:262:de4e:3967 with SMTP id
p4-20020a17090b010400b00262de4e3967mr3457938pjz.0.1692815158380; Wed, 23 Aug
2023 11:25:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 11:25:57 -0700 (PDT)
In-Reply-To: <2023Aug23.172341@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7592:2330:8e35:f0f9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7592:2330:8e35:f0f9
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <2023Aug18.082303@mips.complang.tuwien.ac.at>
<ubuegr$1nk70$1@dont-email.me> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<2023Aug23.071732@mips.complang.tuwien.ac.at> <b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>
<2023Aug23.102035@mips.complang.tuwien.ac.at> <dfa8e23e-eb63-4875-ab1a-7a6869e82e21n@googlegroups.com>
<2023Aug23.172341@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <24e5dad3-9672-4bd1-bc2e-b51b0f4ea5a4n@googlegroups.com>
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 18:25:59 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 23 Aug 2023 18:25 UTC

On Wednesday, August 23, 2023 at 10:45:48 AM UTC-5, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >On Wednesday, August 23, 2023 at 11:59:30=E2=80=AFAM UTC+3, Anton Ertl wrot=
> >e:
> >> In this case: So decades of register-starved architectures (IA-32,=20
> >> AMD64) and architectures with very few callee-saved registers (MIPS,=20
> >> Alpha) have led to applications that are tuned to use few callee-saved=20
> >> registers. And then such applications are used as benchmarks that=20
> >> lead architects to have few callee-saved registers. It's a vicious=20
> >> circle.
> >
> >I hope that coding style in SpecCpu suite is mostly influenced by factors
> >not related to micro-optimizations of such low-level sort.
<
> If any performance work was done at all on these applications that
> actually checked whether the changes were effective (i.e., if any
> performance work was competently done), natural selection would have
> steered the programmers away from programs that need many variables
> that live across calls.
<
In my last big simulator, we had a SYS->CHIP[j]->CPU[k] structure and *CPU
would be passed around to the various cores. When executing instructions
the register files and function unit states would be passed around in said
structure by stripping off irrelevant upper layer details. Each called function
receives an address of that part of the data structure it had the capability
to manipulate and each returned an error code. Those functions had a few
arguments {0,1,2, or 3}. But, overall, and on average the number of preserved
registers was not much more than SP and FP.
<
We had similar structure hanging off CHIP[j]->Interconnect[n] so that the
interconnect and device hierarchy could be accurately simulated. SYS
provided the CHIP[j]->CHIP[m] routing.
<
The purpose of the simulator was not speed but multi-CPU cycle accuracy
with as much speed as we could get. Passing stripped down data structures
worked well for this design.
<
This 2M+ line (C) simulator needed only a handful of preserved registers,
and ran perfectly well on 32-bit x86 with its <then> register file.
>
> And after a while the programmers might learn something; probably not
> the real cause, but some twisted caricature of it (e.g." "use only
> seven locals, because ..." (twisted and completely nonfactual
> explanation elided)), but anyway, it would steer them away from
> writing programs in a way that would benefit from more callee-saved
> registers. If more callee-saved registers were available on the
> platforms they use, they would learn something different.
>
> E.g., when I first explored the idea of stack caching (more than just
> using TOS) in Gforth on MIPS and Alpha, I quickly found out that it
> would not provide a benefit, and did not go there. However, it was
> hard to swallow that these two architectures with their 31 registers
> are as starved of usable registers as IA-32 with 8, so eventually I
> explored the idea again on PowerPC, and there things looked much
> better, so I continued.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Intel goes to 32 GPRs

<9msFM.617377$mPI2.129202@fx15.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33783&group=comp.arch#33783

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32 GPRs
Newsgroups: comp.arch
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <NcoFM.574645$SuUf.392949@fx14.iad> <ef5ee740-6d36-43c8-b11d-912a1890f559n@googlegroups.com>
Lines: 81
Message-ID: <9msFM.617377$mPI2.129202@fx15.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 23 Aug 2023 18:37:57 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 23 Aug 2023 18:37:57 GMT
X-Received-Bytes: 4759

by: Scott Lurndal - Wed, 23 Aug 2023 18:37 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Wednesday, August 23, 2023 at 8:54:57=E2=80=AFAM UTC-5, Scott Lurndal wr=
>ote:
>> Thomas Koenig <tko...@netcologne.de> writes:=20
>> >EricP <ThatWould...@thevillage.com> schrieb:=20
>> >> MitchAlsup wrote:=20
>> >=20
>> >>> All of these "function calls" end up being instructions in My 66000 I=
>SA.=20
>> >>> But, even those that are not functions take "way less code" than simi=
>lar=20
>> >>> RISC-V compilations of those codes--mainly because these codes are=20
>> >>> polynomials with constant coefficients. For example r8_erf() from=20
>> >>> polpack only needs 43% as many instructions as it needs in RISC-V=20
>> >>> and when the compiler limit on constants in registers limitations are=
>=20
>> >>> removed, better still. {and as a side note:: this is a function where=
>=20
>> >>> RISC-V with separate GPR and FPR registers files needs spill/fill=20
>> >>> code, whereas My 66000 with integrated GPR does not--entirely du=20
>> >>> to universal constants}.=20
>> >>> <=20
>> >>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-i=
>sh=20
>> >>> registers (index, pointer, coefficient, power, product and summation)=
>::=20
>> >>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall i=
>nto=20
>> >>> this general category. I suspect many more straightforward polynomial=
>=20
>> >>> evaluations do too.=20
>> >>=20
>> >> Even with your transcendental instructions, would not these still be=
>=20
>> >> implemented as non-inlined subroutines? Because for different language=
>s,=20
>> >> Fortran, C, the math functions in particular have different ways of=20
>> >> interacting with the run-time environment to report errors.=20
>> >=20
>> >C actually has no requirement of setting errno on out-of-range=20
>> >calls to mathematical functions. You can use -fno-math-errno to=20
>> >get this behavior from gcc (and, I believe, clang). Apple chose=20
>> >not to include setting errno in MacOS, and they made the right=20
>> >decision there - thread safety and vectorization make setting errno=20
>> >a performance limiter.
>> C doesn't. POSIX has an optional requirement to set errno.=20
>>=20
>> "For all the functions in the <math.h> header, an application=20
>> wishing to check for error situations should set errno to 0 and=20
>> call feclearexcept(FE_ALL_EXCEPT) before calling the function.=20
><
>So, only if an instruction takes an exception (note not raises; takes)
>then errno gets set. I am perfectly happy to have the math exception
>handlers set errno.

One of the common errno values is EDOM "Mathematics argument out of domain of function",
which is also a C99 errno value. Another is ERANGE. EDOM covers the
input datum, ERANGE the output of the function.

Do you detect those cases in your hardware implementation?

><
>> On return, if errno is non-zero or=20
>> fetestexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)=20
>> is non-zero, an error has occurred."=20
><
>This remains compliant since an instruction does not set errno=20
>but when it raises an exception which is disabled, these flags are set.
>>=20
>> Thread safety requires a thread-local version of errno, which is=20
>> pretty efficient in modern implementations (e.g. an address relative to=
>=20
>> an otherwise unused segment register on intel).
><
>TLS eats R16 in My 66000 ABI.

The psABI for most RISC processors reserve a register for TLS. If I recall
correctly (and my 88open BCS is in a box somewhere), one of the linker
reserved registers (r26,r27,r28,r29) was used for TLS.

Re: Intel goes to 32 GPRs

<EvsFM.617381$mPI2.1857@fx15.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33784&group=comp.arch#33784

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32 GPRs
Newsgroups: comp.arch
References: <u9o14h$183or$1@newsreader4.netcologne.de> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad>
Lines: 19
Message-ID: <EvsFM.617381$mPI2.1857@fx15.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 23 Aug 2023 18:48:04 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 23 Aug 2023 18:48:04 GMT
X-Received-Bytes: 1933

by: Scott Lurndal - Wed, 23 Aug 2023 18:48 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Thomas Koenig wrote:

>
>And since errno is a thread_local variable then each DLL gets its
>own copy so it matters where the library function was called and
>whether it was linked with a static OBJ library or DLL as to
>which errno gets set.

Is windows really that screwed up? The thread local variables should
be managed by the OS and just referenced by the libraries
and the executable. Particularly process global variables like
errno where the 'thread local data' nature of it varies based
on whether the application is built as a single threaded app
or a multithreaded app but has nothing to do with libraries.

Yes, it's really that screwed up.

https://learn.microsoft.com/en-us/windows/win32/dlls/using-thread-local-storage-in-a-dynamic-link-library

John Levine <johnl@taugh.com> schrieb:

> There's be some tool building work to be check that the register sets
> don't change when rebuilding a library if we make them part of the
> interface. But it doesn't seem that hard, just compare old and new
> libraries and fail if the new one is more restrictive. Then of course
> there's the issue of what to do, probably a new compiler flag or
> pragma saying only use these registers, which I would be ugly.

gcc already has -fcall-used-REG and -fcall-saved-REG, where it is
possible to specify roles for registers (but not for the frame
pointer on aarch64).

This is ABI-changing, but (if all libraries are compiled in)
could be used to experiment which split between callee-saved and
caller-saved registers produces best results.

Re: Intel goes to 32 GPRs

<2724aa4c-f806-4435-b0df-4c3934e0a395n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33786&group=comp.arch#33786

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:3216:b0:635:e500:8dc7 with SMTP id qj22-20020a056214321600b00635e5008dc7mr190656qvb.4.1692821663781;
Wed, 23 Aug 2023 13:14:23 -0700 (PDT)
X-Received: by 2002:a17:902:e54b:b0:1b8:a134:6fcb with SMTP id
n11-20020a170902e54b00b001b8a1346fcbmr6437382plf.7.1692821663572; Wed, 23 Aug
2023 13:14:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 13:14:23 -0700 (PDT)
In-Reply-To: <9msFM.617377$mPI2.129202@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4ccc:30d:9483:976c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4ccc:30d:9483:976c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ubopm8$dn07$1@dont-email.me>
<ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
<Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de>
<NcoFM.574645$SuUf.392949@fx14.iad> <ef5ee740-6d36-43c8-b11d-912a1890f559n@googlegroups.com>
<9msFM.617377$mPI2.129202@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2724aa4c-f806-4435-b0df-4c3934e0a395n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 20:14:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6508

by: MitchAlsup - Wed, 23 Aug 2023 20:14 UTC

On Wednesday, August 23, 2023 at 1:38:02 PM UTC-5, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Wednesday, August 23, 2023 at 8:54:57=E2=80=AFAM UTC-5, Scott Lurndal wr=
> >ote:
> >> Thomas Koenig <tko...@netcologne.de> writes:=20
> >> >EricP <ThatWould...@thevillage.com> schrieb:=20
> >> >> MitchAlsup wrote:=20
> >> >=20
> >> >>> All of these "function calls" end up being instructions in My 66000 I=
> >SA.=20
> >> >>> But, even those that are not functions take "way less code" than simi=
> >lar=20
> >> >>> RISC-V compilations of those codes--mainly because these codes are=20
> >> >>> polynomials with constant coefficients. For example r8_erf() from=20
> >> >>> polpack only needs 43% as many instructions as it needs in RISC-V=20
> >> >>> and when the compiler limit on constants in registers limitations are=
> >=20
> >> >>> removed, better still. {and as a side note:: this is a function where=
> >=20
> >> >>> RISC-V with separate GPR and FPR registers files needs spill/fill=20
> >> >>> code, whereas My 66000 with integrated GPR does not--entirely du=20
> >> >>> to universal constants}.=20
> >> >>> <=20
> >> >>> Straightforward polynomial evaluation (Horner, Estrin) only needs 6-i=
> >sh=20
> >> >>> registers (index, pointer, coefficient, power, product and summation)=
> >::=20
> >> >>> SIN(), COS(), Ln() family, exp() family, ATAN(), and POW() all fall i=
> >nto=20
> >> >>> this general category. I suspect many more straightforward polynomial=
> >=20
> >> >>> evaluations do too.=20
> >> >>=20
> >> >> Even with your transcendental instructions, would not these still be=
> >=20
> >> >> implemented as non-inlined subroutines? Because for different language=
> >s,=20
> >> >> Fortran, C, the math functions in particular have different ways of=20
> >> >> interacting with the run-time environment to report errors.=20
> >> >=20
> >> >C actually has no requirement of setting errno on out-of-range=20
> >> >calls to mathematical functions. You can use -fno-math-errno to=20
> >> >get this behavior from gcc (and, I believe, clang). Apple chose=20
> >> >not to include setting errno in MacOS, and they made the right=20
> >> >decision there - thread safety and vectorization make setting errno=20
> >> >a performance limiter.
> >> C doesn't. POSIX has an optional requirement to set errno.=20
> >>=20
> >> "For all the functions in the <math.h> header, an application=20
> >> wishing to check for error situations should set errno to 0 and=20
> >> call feclearexcept(FE_ALL_EXCEPT) before calling the function.=20
> ><
> >So, only if an instruction takes an exception (note not raises; takes)
> >then errno gets set. I am perfectly happy to have the math exception
> >handlers set errno.
<
> One of the common errno values is EDOM "Mathematics argument out of domain of function",
> which is also a C99 errno value. Another is ERANGE. EDOM covers the
> input datum, ERANGE the output of the function.
<
SIN, COS, EXP family, Ln family take all numeric bit-patterns and deliver proper
IEEE 754-2019 results.
<
TAN take all numeric bit-patterns and deliver proper IEEE 754-2019 results.
ATAN delivers quiet NaNs for out of domain arguments.
ATAN2, POW properly considers the 10 specified special cases.
>
> Do you detect those cases in your hardware implementation?
>
If you want errno set, you call the function rather than use the instruction.
{and at considerable execution costs; you want it == you pay for it}
>
>
> ><
> >> On return, if errno is non-zero or=20
> >> fetestexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)=20
> >> is non-zero, an error has occurred."=20
> ><
> >This remains compliant since an instruction does not set errno=20
> >but when it raises an exception which is disabled, these flags are set.
> >>=20
> >> Thread safety requires a thread-local version of errno, which is=20
> >> pretty efficient in modern implementations (e.g. an address relative to=
> >=20
> >> an otherwise unused segment register on intel).
> ><
> >TLS eats R16 in My 66000 ABI.
<
> The psABI for most RISC processors reserve a register for TLS. If I recall
> correctly (and my 88open BCS is in a box somewhere), one of the linker
> reserved registers (r26,r27,r28,r29) was used for TLS.
<
My 66000 ABI has no registers reserved from the application.

Re: Intel goes to 32 GPRs

<6371dced-3fe0-4223-99d2-8cb2587e708bn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33787&group=comp.arch#33787

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1a15:b0:410:a9dd:bcf9 with SMTP id f21-20020a05622a1a1500b00410a9ddbcf9mr107173qtb.4.1692821705885;
Wed, 23 Aug 2023 13:15:05 -0700 (PDT)
X-Received: by 2002:a05:6a00:234b:b0:68a:613e:a360 with SMTP id
j11-20020a056a00234b00b0068a613ea360mr3322015pfj.0.1692821705482; Wed, 23 Aug
2023 13:15:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 13:15:04 -0700 (PDT)
In-Reply-To: <EvsFM.617381$mPI2.1857@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4ccc:30d:9483:976c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4ccc:30d:9483:976c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad>
<uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad>
<EvsFM.617381$mPI2.1857@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6371dced-3fe0-4223-99d2-8cb2587e708bn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 20:15:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 23 Aug 2023 20:15 UTC

On Wednesday, August 23, 2023 at 1:48:09 PM UTC-5, Scott Lurndal wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >Thomas Koenig wrote:
>
> >
> >And since errno is a thread_local variable then each DLL gets its
> >own copy so it matters where the library function was called and
> >whether it was linked with a static OBJ library or DLL as to
> >which errno gets set.
> Is windows really that screwed up? The thread local variables should
> be managed by the OS and just referenced by the libraries
> and the executable. Particularly process global variables like
> errno where the 'thread local data' nature of it varies based
> on whether the application is built as a single threaded app
> or a multithreaded app but has nothing to do with libraries.
>
> Yes, it's really that screwed up.
>
> https://learn.microsoft.com/en-us/windows/win32/dlls/using-thread-local-storage-in-a-dynamic-link-library
<
It is worse than that, it is part of C++, too.

Re: Intel goes to 32 GPRs

<uc5pp3$2qk1u$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33788&group=comp.arch#33788

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f2a7-0-3c5b-97e0-1685-9f7e.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 23 Aug 2023 20:21:55 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uc5pp3$2qk1u$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me>
<Cf3FM.499505$qnnb.208430@fx11.iad>
<b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com>
<Td6FM.501194$qnnb.321179@fx11.iad>
<uc45qi$2pk2p$1@newsreader4.netcologne.de>
<NcoFM.574645$SuUf.392949@fx14.iad>
<ef5ee740-6d36-43c8-b11d-912a1890f559n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 23 Aug 2023 20:21:55 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f2a7-0-3c5b-97e0-1685-9f7e.ipv6dyn.netcologne.de:2001:4dd4:f2a7:0:3c5b:97e0:1685:9f7e";
logging-data="2969662"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 23 Aug 2023 20:21 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, August 23, 2023 at 8:54:57 AM UTC-5, Scott Lurndal wrote:

>> "For all the functions in the <math.h> header, an application
>> wishing to check for error situations should set errno to 0 and
>> call feclearexcept(FE_ALL_EXCEPT) before calling the function.
><
> So, only if an instruction takes an exception (note not raises; takes)
> then errno gets set. I am perfectly happy to have the math exception
> handlers set errno.

It does not have to be set...

><
>> On return, if errno is non-zero or
>> fetestexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)
>> is non-zero, an error has occurred."
><
> This remains compliant since an instruction does not set errno
> but when it raises an exception which is disabled, these flags are set.

An implementation that does neither remains compliant - if both
conditions are false (i.e. errno is zero, and fetestexcept returns
zero, then the user program does not know if an error has occurred
or not.

For portable code, this is singularly useless - checking for NaN
is likely to be much more portable these days.

And it would be interesting to see if the vectorized libraries which
implement sin(), cos() and friends do actually set errno or not.

Apple doesn't.

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
> On 8/22/2023 12:07 PM, John Levine wrote:

>> This is why sensible libraries use version numbers with major numbers
>> meaning an interface change and minor numbers just a bug fix. You
>> relink your code to use a new major version of the library. Every Unix
>> and Linux system does this.
>
> But it isn't clear that changing the registers used internally is an
> "interface change".

It is an ABI (application binary interface) change - the caller
has to do something different, depending on what the callee is doing.

> It certainly isn't a change in the source code of
> the interface. Or, to put it another way, a "bug fix" may change the
> register usage without changing what is typically called the "interface".

It is something that could well be exposed to the user. Say you want
to roll your own with Intel's APX. You could then specify something
like

__attribute__((caller_saved(r16-r20))) foo(int *p);

in a header file, and both the caller and callee would be informed about
this.

Just don't forget to include the header file.

Re: Intel goes to 32 GPRs

<40acf32b-7470-40f0-a432-7da29140b8b1n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33790&group=comp.arch#33790

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1812:b0:40f:dc70:30de with SMTP id t18-20020a05622a181200b0040fdc7030demr173889qtc.5.1692829053153;
Wed, 23 Aug 2023 15:17:33 -0700 (PDT)
X-Received: by 2002:a17:902:fb05:b0:1b8:a593:7568 with SMTP id
le5-20020a170902fb0500b001b8a5937568mr4926541plb.8.1692829052861; Wed, 23 Aug
2023 15:17:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 15:17:32 -0700 (PDT)
In-Reply-To: <uc5lr4$2qhfa$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4ccc:30d:9483:976c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4ccc:30d:9483:976c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <uc0lga$22c76$1@dont-email.me>
<uc310n$umu$1@gal.iecc.com> <uc32dt$2ga95$2@dont-email.me>
<uc343j$161u$1@gal.iecc.com> <uc5lr4$2qhfa$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <40acf32b-7470-40f0-a432-7da29140b8b1n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 23 Aug 2023 22:17:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2851

by: MitchAlsup - Wed, 23 Aug 2023 22:17 UTC

On Wednesday, August 23, 2023 at 2:14:47 PM UTC-5, Thomas Koenig wrote:
> John Levine <jo...@taugh.com> schrieb:
> > There's be some tool building work to be check that the register sets
> > don't change when rebuilding a library if we make them part of the
> > interface. But it doesn't seem that hard, just compare old and new
> > libraries and fail if the new one is more restrictive. Then of course
> > there's the issue of what to do, probably a new compiler flag or
> > pragma saying only use these registers, which I would be ugly.
<
> gcc already has -fcall-used-REG and -fcall-saved-REG, where it is
> possible to specify roles for registers (but not for the frame
> pointer on aarch64).
>
> This is ABI-changing, but (if all libraries are compiled in)
> could be used to experiment which split between callee-saved and
> caller-saved registers produces best results.
<
It seems to me that {caller, callee} save is insufficient; considering
that one my need TLS in a register that is not modified, saved, restored
except under "very special" circumstances. Where "very special" basically
means the compiler is not doing any of that.
<
One of the reasons I put TLS in R16 was that the compiler could be taught
not to use R16 for anything other than being used as TLS, this removes
one preserved register, but it still gives a similar range [R16,R17]..[R30,R31]
as the preserved set--depending on TLS in use and FP in use.

Re: Intel goes to 32 GPRs

<taLFM.157982$8_8a.131964@fx48.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33799&group=comp.arch#33799

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com> <ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad>
In-Reply-To: <EvsFM.617381$mPI2.1857@fx15.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 57
Message-ID: <taLFM.157982$8_8a.131964@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 24 Aug 2023 16:02:33 UTC
Date: Thu, 24 Aug 2023 12:00:36 -0400
X-Received-Bytes: 3759

by: EricP - Thu, 24 Aug 2023 16:00 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Thomas Koenig wrote:
>
>> And since errno is a thread_local variable then each DLL gets its
>> own copy so it matters where the library function was called and
>> whether it was linked with a static OBJ library or DLL as to
>> which errno gets set.
>
> Is windows really that screwed up? The thread local variables should
> be managed by the OS and just referenced by the libraries
> and the executable. Particularly process global variables like
> errno where the 'thread local data' nature of it varies based
> on whether the application is built as a single threaded app
> or a multithreaded app but has nothing to do with libraries.

Its not Microsoft specific - it could happen on any OS shared library.
Its a consequence of mixing global variables and dynamic link libraries,
what *nix calls shared libraries, that each linkage unit gets its own
copy of any global variables referenced by routines in the linkage unit.

TLS doesn't change this because each DLL defines its own data sets for
its own TLS variables.

The same thing can happen for the C RTL locale global variables
or the default memory heap.

The way to avoid it is don't pass arguments and return values to
subroutines as global variable values, TLS or not.
Write properly re-entrant subroutines that access only their arguments
and return a function value and this doesn't happen.

> Yes, it's really that screwed up.
>
> https://learn.microsoft.com/en-us/windows/win32/dlls/using-thread-local-storage-in-a-dynamic-link-library

That is a warning to apps that load DLL's themselves using LoadLibrary
*after* the app is up and running, as opposed to the usual way which is
to let the OS loader do it all before program start.

All this says that TLS declarations are allocated at thread start,
as specified by each DLL that was loaded at the time the thread starts.
If an app dynamically loads a DLL itself *after* a thread is already
running then any TLS variables defined by that DLL are not allocated for
already running threads. Caveat emptor - you have to patch this up yourself.

Note that since WinNT was released in 1992 there are very few reasons
that an app would want to dynamically manage loading and unloading its
own *executable* code on the fly. This is really a hold over from Win3.1.

The documentation for Linux mmap says that it ignores the MAP_EXECUTABLE
option so it does not appear to have equivalent functionality to
LoadLibrary and thereby avoids the whole issue.

https://man7.org/linux/man-pages/man2/mmap.2.html

Re: Intel goes to 32 GPRs

<oKLFM.142637$ftCb.43121@fx34.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33800&group=comp.arch#33800

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32 GPRs
Newsgroups: comp.arch
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad> <taLFM.157982$8_8a.131964@fx48.iad>
Lines: 43
Message-ID: <oKLFM.142637$ftCb.43121@fx34.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 24 Aug 2023 16:40:52 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 24 Aug 2023 16:40:52 GMT
X-Received-Bytes: 2871

by: Scott Lurndal - Thu, 24 Aug 2023 16:40 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Scott Lurndal wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> Thomas Koenig wrote:
>>
>>> And since errno is a thread_local variable then each DLL gets its
>>> own copy so it matters where the library function was called and
>>> whether it was linked with a static OBJ library or DLL as to
>>> which errno gets set.
>>
>> Is windows really that screwed up? The thread local variables should
>> be managed by the OS and just referenced by the libraries
>> and the executable. Particularly process global variables like
>> errno where the 'thread local data' nature of it varies based
>> on whether the application is built as a single threaded app
>> or a multithreaded app but has nothing to do with libraries.
>
>Its not Microsoft specific - it could happen on any OS shared library.
>Its a consequence of mixing global variables and dynamic link libraries,
>what *nix calls shared libraries, that each linkage unit gets its own
>copy of any global variables referenced by routines in the linkage unit.

In unix, errno is a thread local variable with global scope. All
shared objects will use same 'errno' within a given thread, regardless
of whether it is accessed from the originally loaded executable or
a subsequently loaded (either at exec() time by the rtld, or later
using dlopen(3)) shared object.

This is true for any 'thread local' variable declared in the global
application scope.

Yes, thread local variables _declared in a shared object_ will be
specific to that shared object.

>The documentation for Linux mmap says that it ignores the MAP_EXECUTABLE
>option so it does not appear to have equivalent functionality to
>LoadLibrary and thereby avoids the whole issue.
>
>https://man7.org/linux/man-pages/man2/mmap.2.html

See 'dlopen(3)'.

Re: Callee-saved registers

<uc84ut$3icje$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33803&group=comp.arch#33803

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers
Date: Thu, 24 Aug 2023 10:45:01 -0700
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <uc84ut$3icje$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Aug23.071732@mips.complang.tuwien.ac.at>
<b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com>
<2023Aug23.102035@mips.complang.tuwien.ac.at> <uc5dfv$vkk$1@gal.iecc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Aug 2023 17:45:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7f71ad1ac6f8a8e5c02f5670f708bbcd";
logging-data="3748462"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19nnoPjrOkwfc50qoudW4t1nK5E3wWewxw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ZHFHGznpjWzC5UwJmGGt2SHOXPM=
Content-Language: en-US
In-Reply-To: <uc5dfv$vkk$1@gal.iecc.com>

by: Stephen Fuld - Thu, 24 Aug 2023 17:45 UTC

On 8/23/2023 9:52 AM, John Levine wrote:
> According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>> OTOH, when most architectures were introduced, Hennessy&Patterson had
>> not written CA:AQA, so maybe the architects used handwawing for these
>> decisions ...
>
> I believe the first architecture that was designed using simulations
> of workloads was S/360. After they considered a lot of options
> including a stack, it ended up with 16 registers. In retrospect, they
> made a few mistakes (no relative branches and the botched floating
> point), but having a lot of registers, and making them all usable as
> both accumulators and index or base registers was a good move.

Agreed. Although if they considered each program in isolation, that
would account for what I believe was a big mistake in the architecture.
Specifically, the lack of a non-user settable base address register,
instead relying on the USING,BALR mechanism led to the situation where,
once a program was loaded, it could never be relocated.

So OS/360 didn't have "swap", but what IIRC was called rollout/rollback,
where a program could be written to disk, then reloaded, but it had to
be reloaded to the same physical address as it was written out from.
Besides potentially causing memory fragmentation, it was responsible for
the utter catastrophe that was TSO/360.

If course, this became irrelevant by the time of virtual memory in the
360/67 and later S/370s and beyond.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Callee-saved registers

<211ffe18-e188-4906-aea2-8381328959a7n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33805&group=comp.arch#33805

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:e30a:0:b0:767:f284:a452 with SMTP id v10-20020ae9e30a000000b00767f284a452mr181600qkf.2.1692900028253;
Thu, 24 Aug 2023 11:00:28 -0700 (PDT)
X-Received: by 2002:a17:902:c40c:b0:1c0:ac09:4032 with SMTP id
k12-20020a170902c40c00b001c0ac094032mr2127219plk.9.1692900027718; Thu, 24 Aug
2023 11:00:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Aug 2023 11:00:27 -0700 (PDT)
In-Reply-To: <uc84ut$3icje$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:858d:3bb5:7746:21e2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:858d:3bb5:7746:21e2
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Aug23.071732@mips.complang.tuwien.ac.at>
<b3c8c458-b46c-4a2d-bec5-01d9397c2ad4n@googlegroups.com> <2023Aug23.102035@mips.complang.tuwien.ac.at>
<uc5dfv$vkk$1@gal.iecc.com> <uc84ut$3icje$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <211ffe18-e188-4906-aea2-8381328959a7n@googlegroups.com>
Subject: Re: Callee-saved registers
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 24 Aug 2023 18:00:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3733

by: MitchAlsup - Thu, 24 Aug 2023 18:00 UTC

On Thursday, August 24, 2023 at 12:45:05 PM UTC-5, Stephen Fuld wrote:
> On 8/23/2023 9:52 AM, John Levine wrote:
> > According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
> >> OTOH, when most architectures were introduced, Hennessy&Patterson had
> >> not written CA:AQA, so maybe the architects used handwawing for these
> >> decisions ...
> >
> > I believe the first architecture that was designed using simulations
> > of workloads was S/360. After they considered a lot of options
> > including a stack, it ended up with 16 registers. In retrospect, they
> > made a few mistakes (no relative branches and the botched floating
> > point), but having a lot of registers, and making them all usable as
> > both accumulators and index or base registers was a good move.
<
I am willing to give them a pass on 15 GPRs, register storage was
expensive back then, and 5-bit register specifiers would have
seriously crimped their ability to get 16-bit instructions in.
<
> Agreed. Although if they considered each program in isolation, that
> would account for what I believe was a big mistake in the architecture.
> Specifically, the lack of a non-user settable base address register,
> instead relying on the USING,BALR mechanism led to the situation where,
> once a program was loaded, it could never be relocated.
<
Having branches use the exact same memory addressing as memory
references was (WAS) stunningly BAD decision. Wasting GPRs to
address branch labels was brain dead.....code is fundamentally
different than data.
>
> So OS/360 didn't have "swap", but what IIRC was called rollout/rollback,
> where a program could be written to disk, then reloaded, but it had to
> be reloaded to the same physical address as it was written out from.
<
Virtualizing memory solved this problem (360/67 leading the way)
So the program thought it was at the same address, but it could
be put back anywhere.
<
> Besides potentially causing memory fragmentation, it was responsible for
> the utter catastrophe that was TSO/360.
>
> If course, this became irrelevant by the time of virtual memory in the
> 360/67 and later S/370s and beyond.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: virutal 360, Callee-saved registers

<uc8m97$16u7$1@gal.iecc.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33807&group=comp.arch#33807

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: johnl@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: virutal 360, Callee-saved registers
Date: Thu, 24 Aug 2023 22:40:39 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <uc8m97$16u7$1@gal.iecc.com>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <uc5dfv$vkk$1@gal.iecc.com> <uc84ut$3icje$1@dont-email.me> <211ffe18-e188-4906-aea2-8381328959a7n@googlegroups.com>
Injection-Date: Thu, 24 Aug 2023 22:40:39 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="39879"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <u9o14h$183or$1@newsreader4.netcologne.de> <uc5dfv$vkk$1@gal.iecc.com> <uc84ut$3icje$1@dont-email.me> <211ffe18-e188-4906-aea2-8381328959a7n@googlegroups.com>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Thu, 24 Aug 2023 22:40 UTC

According to MitchAlsup <MitchAlsup@aol.com>:
>Having branches use the exact same memory addressing as memory
>references was (WAS) stunningly BAD decision. Wasting GPRs to
>address branch labels was brain dead.....code is fundamentally
>different than data. ..

>> So OS/360 didn't have "swap", but what IIRC was called rollout/rollback,
>> where a program could be written to disk, then reloaded, but it had to
>> be reloaded to the same physical address as it was written out from.
><
>Virtualizing memory solved this problem (360/67 leading the way)
>So the program thought it was at the same address, but it could
>be put back anywhere.

Yes, but. TSS/360 had shared libraries, and could map shared libraries
to different places in different processes. The kludgery to make that
work was quite impressive.

They finally added relative branches so S/390 in 1990 but by then the
ship had sailed and IBM mainframes had settled into their high cost
high reliability niche.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: virutal 360, Callee-saved registers

<463370f5-27ee-494b-8228-c70a12f1743cn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33808&group=comp.arch#33808

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:3884:b0:76d:a231:7c92 with SMTP id qp4-20020a05620a388400b0076da2317c92mr234619qkn.9.1692921682011;
Thu, 24 Aug 2023 17:01:22 -0700 (PDT)
X-Received: by 2002:a05:6638:1ee2:b0:42b:60ec:2f61 with SMTP id
gw2-20020a0566381ee200b0042b60ec2f61mr314510jab.2.1692921675540; Thu, 24 Aug
2023 17:01:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Aug 2023 17:01:15 -0700 (PDT)
In-Reply-To: <uc8m97$16u7$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:858d:3bb5:7746:21e2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:858d:3bb5:7746:21e2
References: <u9o14h$183or$1@newsreader4.netcologne.de> <uc5dfv$vkk$1@gal.iecc.com>
<uc84ut$3icje$1@dont-email.me> <211ffe18-e188-4906-aea2-8381328959a7n@googlegroups.com>
<uc8m97$16u7$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <463370f5-27ee-494b-8228-c70a12f1743cn@googlegroups.com>
Subject: Re: virutal 360, Callee-saved registers
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Aug 2023 00:01:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3211

by: MitchAlsup - Fri, 25 Aug 2023 00:01 UTC

On Thursday, August 24, 2023 at 5:40:43 PM UTC-5, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
> >Having branches use the exact same memory addressing as memory
> >references was (WAS) stunningly BAD decision. Wasting GPRs to
> >address branch labels was brain dead.....code is fundamentally
> >different than data. ..
> >> So OS/360 didn't have "swap", but what IIRC was called rollout/rollback,
> >> where a program could be written to disk, then reloaded, but it had to
> >> be reloaded to the same physical address as it was written out from.
> ><
> >Virtualizing memory solved this problem (360/67 leading the way)
> >So the program thought it was at the same address, but it could
> >be put back anywhere.
<
> Yes, but. TSS/360 had shared libraries, and could map shared libraries
> to different places in different processes. The kludgery to make that
> work was quite impressive.
<
They pretty much had to have shared libraries since overall memory was
so paltry. I was not aware of the underlying kludgery.
<
In 1971 when I got to CMU, the /67 had an average up-time of 30-odd
minutes. One of the operators dropped out of school, and over 2 years
basically rewrote much of the I/O and swap system and average up-
time was on the order of 5-days {and this was all in BAL}. Perhaps
the kludgery was partly to blame--I only saw if from the user side and
got only a few hints about what was going on inside.
>
> They finally added relative branches so S/390 in 1990 but by then the
> ship had sailed and IBM mainframes had settled into their high cost
> high reliability niche.
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

Re: Intel goes to 32 GPRs

<QH3GM.95667$VzFf.17010@fx03.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33811&group=comp.arch#33811

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
References: <u9o14h$183or$1@newsreader4.netcologne.de> <ubpr3d$2iugr$1@newsreader4.netcologne.de> <2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad> <taLFM.157982$8_8a.131964@fx48.iad> <oKLFM.142637$ftCb.43121@fx34.iad>
In-Reply-To: <oKLFM.142637$ftCb.43121@fx34.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 76
Message-ID: <QH3GM.95667$VzFf.17010@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 25 Aug 2023 15:23:28 UTC
Date: Fri, 25 Aug 2023 11:22:58 -0400
X-Received-Bytes: 4716

by: EricP - Fri, 25 Aug 2023 15:22 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Scott Lurndal wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>> Thomas Koenig wrote:
>>>> And since errno is a thread_local variable then each DLL gets its
>>>> own copy so it matters where the library function was called and
>>>> whether it was linked with a static OBJ library or DLL as to
>>>> which errno gets set.
>>> Is windows really that screwed up? The thread local variables should
>>> be managed by the OS and just referenced by the libraries
>>> and the executable. Particularly process global variables like
>>> errno where the 'thread local data' nature of it varies based
>>> on whether the application is built as a single threaded app
>>> or a multithreaded app but has nothing to do with libraries.
>> Its not Microsoft specific - it could happen on any OS shared library.
>> Its a consequence of mixing global variables and dynamic link libraries,
>> what *nix calls shared libraries, that each linkage unit gets its own
>> copy of any global variables referenced by routines in the linkage unit.
>
> In unix, errno is a thread local variable with global scope. All
> shared objects will use same 'errno' within a given thread, regardless
> of whether it is accessed from the originally loaded executable or
> a subsequently loaded (either at exec() time by the rtld, or later
> using dlopen(3)) shared object.

Ah, yes I see Linux has a concept of global shared symbols.
dlOpen has a flag RTLD_GLOBAL "The symbols defined by this
shared object will be made available for symbol resolution of
subsequently loaded shared objects."

Sounds handy. Windows has no such concept.
Windows is as if all DLL's are loaded with RTLD_LOCAL.

> This is true for any 'thread local' variable declared in the global
> application scope.
>
> Yes, thread local variables _declared in a shared object_ will be
> specific to that shared object.

On Windows errno is what the C standard says, an integer with file scope,
which they moved to the TLS area. Each EXE/DLL that references errno gets
an errno, just like any other global scope variable.

>> The documentation for Linux mmap says that it ignores the MAP_EXECUTABLE
>> option so it does not appear to have equivalent functionality to
>> LoadLibrary and thereby avoids the whole issue.
>>
>> https://man7.org/linux/man-pages/man2/mmap.2.html
>
> See 'dlopen(3)'.

Thanks. Seems this is intended for the same thing as Windows LoadLibrary:
applications that need to dynamically load "plug-ins" (codecs, etc).

Looking at dlOpen and dlClose documentation I see no mention of TLS at all,
let alone a clear statement of how TLS interacts with running threads,
and that dlClose cleans up and recovers TLS allocated storage for that
shared library *for all the threads of a process*.

Because the Windows documentation cleary states that there is no
interaction with already running threads, one can make use of that.
A program which wants to use plug-ins should use LoadLibrary to map
the DLL, then create new threads which use the DLL functions,
then allow those threads to exit and terminate, then FreeLibrary to
unmap the DLL. Other threads should not interact with the loaded DLL.
(But note there are almost always other gotcha's in Windows.)

Based on the lack of Linux information, I would be inclined to create
a separate new process for the thread that calls dlOpen and then
toss the whole process instead of dlClose.
That way one can be sure it won't leak TLS memory in your application.

Re: Intel goes to 32 GPRs

<7L4GM.909644$GMN3.642632@fx16.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33812&group=comp.arch#33812

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Intel goes to 32 GPRs
Newsgroups: comp.arch
References: <u9o14h$183or$1@newsreader4.netcologne.de> <AA6EM.518412$TCKc.373270@fx13.iad> <uc0lga$22c76$1@dont-email.me> <Cf3FM.499505$qnnb.208430@fx11.iad> <b92f55d2-aceb-495c-a240-faf5f578b5f0n@googlegroups.com> <Td6FM.501194$qnnb.321179@fx11.iad> <uc45qi$2pk2p$1@newsreader4.netcologne.de> <G2sFM.197201$ens9.139215@fx45.iad> <EvsFM.617381$mPI2.1857@fx15.iad> <taLFM.157982$8_8a.131964@fx48.iad> <oKLFM.142637$ftCb.43121@fx34.iad> <QH3GM.95667$VzFf.17010@fx03.iad>
Lines: 138
Message-ID: <7L4GM.909644$GMN3.642632@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Fri, 25 Aug 2023 16:35:15 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Fri, 25 Aug 2023 16:35:15 GMT
X-Received-Bytes: 7238

by: Scott Lurndal - Fri, 25 Aug 2023 16:35 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Scott Lurndal wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> Scott Lurndal wrote:
>>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>>> Thomas Koenig wrote:
>>>>> And since errno is a thread_local variable then each DLL gets its
>>>>> own copy so it matters where the library function was called and
>>>>> whether it was linked with a static OBJ library or DLL as to
>>>>> which errno gets set.
>>>> Is windows really that screwed up? The thread local variables should
>>>> be managed by the OS and just referenced by the libraries
>>>> and the executable. Particularly process global variables like
>>>> errno where the 'thread local data' nature of it varies based
>>>> on whether the application is built as a single threaded app
>>>> or a multithreaded app but has nothing to do with libraries.
>>> Its not Microsoft specific - it could happen on any OS shared library.
>>> Its a consequence of mixing global variables and dynamic link libraries,
>>> what *nix calls shared libraries, that each linkage unit gets its own
>>> copy of any global variables referenced by routines in the linkage unit.
>>
>> In unix, errno is a thread local variable with global scope. All
>> shared objects will use same 'errno' within a given thread, regardless
>> of whether it is accessed from the originally loaded executable or
>> a subsequently loaded (either at exec() time by the rtld, or later
>> using dlopen(3)) shared object.
>
>Ah, yes I see Linux has a concept of global shared symbols.
>dlOpen has a flag RTLD_GLOBAL "The symbols defined by this
>shared object will be made available for symbol resolution of
>subsequently loaded shared objects."
>
>Sounds handy. Windows has no such concept.
>Windows is as if all DLL's are loaded with RTLD_LOCAL.
>
>> This is true for any 'thread local' variable declared in the global
>> application scope.
>>
>> Yes, thread local variables _declared in a shared object_ will be
>> specific to that shared object.
>
>On Windows errno is what the C standard says, an integer with file scope,
>which they moved to the TLS area. Each EXE/DLL that references errno gets
>an errno, just like any other global scope variable.

POSIX extends the definition a bit.

The symbol errno shall expand to a modifiable lvalue of
type int. It is unspecified whether errno is a macro or
an identifier declared with external linkage. If a macro
definition is suppressed in order to access an actual object,
or a program defines an identifier with the name errno, the
behavior is undefined.

Most linux systems define errno (when pthreads supported) as follows:

# define errno (*__errno_location ())

Where __errno_location() accesses thread local storage.

>
>>> The documentation for Linux mmap says that it ignores the MAP_EXECUTABLE
>>> option so it does not appear to have equivalent functionality to
>>> LoadLibrary and thereby avoids the whole issue.
>>>
>>> https://man7.org/linux/man-pages/man2/mmap.2.html
>>
>> See 'dlopen(3)'.
>
>Thanks. Seems this is intended for the same thing as Windows LoadLibrary:
>applications that need to dynamically load "plug-ins" (codecs, etc).

Indeed. My current project (an SoC simulator) uses dlopen to load
models of devices (e.g. uart, sata controller, network adapter)
as required by the SoC configuration file. Generically, the
model shared objects export a "extern C" definition of a function
called "get_device" that when called by the application returns
a pointer to a common base class (c_device in this case) from which
all device models derive.

/**
* Obtain a pointer to an instance of the 16550 UART device
* simulation. Invoked via dlsym("get_device") when shared object
* containing the device model is loaded.
*
* @param name The name associated with this device instance.
* @param icp A pointer to the interrupt controller to signal interrupts to
* @param lp A pointer to the logger instance to use for diagnostic output
* @param pp A Pointer to the SoC object
* @returns A pointer to a c_device instance representing this device.
*/
c_device *
get_device(const char *name, c_interrupt_controller *icp, c_logger const* lp, c_soc *pp)
{ return new c_uart(name, lp, icp, pp);
}

>
>Looking at dlOpen and dlClose documentation I see no mention of TLS at all,
>let alone a clear statement of how TLS interacts with running threads,
>and that dlClose cleans up and recovers TLS allocated storage for that
>shared library *for all the threads of a process*.

I use dlopen extensively in threaded applications. dlopen(3) itself
doesn't particularly care whether the code is threaded or not threaded;
it will simply reference the "modifiable lvalue errno", which in a
threaded application will be the __errno_location() function noted
above. All other globals in the application are used to resolve
external references in the shared object either when loaded (RTLD_NOW)
or as referenced (RTLD_LAZY).

>
>Because the Windows documentation cleary states that there is no
>interaction with already running threads, one can make use of that.
>A program which wants to use plug-ins should use LoadLibrary to map
>the DLL, then create new threads which use the DLL functions,
>then allow those threads to exit and terminate, then FreeLibrary to
>unmap the DLL. Other threads should not interact with the loaded DLL.
>(But note there are almost always other gotcha's in Windows.)
>
>Based on the lack of Linux information, I would be inclined to create
>a separate new process for the thread that calls dlOpen and then
>toss the whole process instead of dlClose.

Not necessary on linux.

See 'pthread_key_create' and "pthread_getspecific".

>That way one can be sure it won't leak TLS memory in your application.

Unix/linux shared objects have mechanisms for load time initialization
and unload time destruction (consider it the equivalent of C++ constructors
and destructors at the full library level). Static destructors for C++
code in the library are called when the library is unloaded, for example.

In article <uc1hsv$2ntj0$1@newsreader4.netcologne.de>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
>Kent Dickey <kegs@provalid.com> schrieb:
>> Here is the code again. I'm trying to keep it as short as possible to show
>> the problem, so I don't plan to respond to nitpicks about the particulars
>> of this code.
>
>[...]
>
>The issue of the frame pointer register not being used shows up in the
>code on godbolt, even with recent gcc trunk.
>
>I have submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096 for
>this.

The GCC bug is about to be closed since GCC wants to reserve r29 on ARM64
to always be a valid stack frame pointer, just "out of date" if a given
function doesn't use it. So r29 is simply wasted on ARM64 when using GCC
I have no idea what use a frame pointer is, I have no use for it.

Someone also mentioned GCC supported changing registers to callee-saves
using "-fcall-saved-18" which will switch r18 from being a scratch
register to being preserved. (Note this doesn't work for r29, GCC
treats it as "fixed" which overrides this setting). I can make this
work since my program has no external dependencies (other than libgcc.a,
which on ARM64, isn't used).

I had to make some changes to some support routines (I have my own
lightweight setjmp()/longjmp()) to handle this, and then compiled and
ran with -fcall-saved-18. It reduced the code size 3136 bytes (out of
900KB), and made the overall execution time 0.5% faster.

So then I tried also adding -fcall-saved-17 as well. And the executable
got larger than the default size, and runtime went back to about the
same, maybe slightly slower.

What's happening is several things:

1) Adding one more preserved register (callee saved) is almost free.
The default number of preserved registers is 10 (r19-r28), and
the link register also needs to be saved (r30). GCC uses STP at
procedure entry and LDP at procedure exit to do these
spills/fills as register pairs. With 11 total registers to
spill/fill (r19-r28, and r30), this takes 5 STP (or LDP) and one
"STR" and "LDR" instruction. Using "-fcall-saved-18" creates 11
preserved registers, plus r30 needs to be saved, for a total of
12 registers to be saved. So that left-over STR/LDR becomes
STP/LDP and costs no instructions. But adding -fcall-saved-17
means one more register is needed, so an extra STR/LDR is needed
in each function which needs all the preserved registers.

2) GCC's register allocator misses some reuse. If it feels it has
plenty of registers, then it ends up doing stuff like:

LDR x20,[x8,#120]
ADD x20,x20,x9
STR x20,[x8,#120]
BL some_other_func
LDR x21,[x8,#128]
ADD x21,x21,x0
STR x21,[x8,#128]

Basically, it needs a register to hold some value, and it even
though x20 and x21's use do not overlap, it doesn't just use x20
for both. What's actually going on is more complex (these are
preserved registers, not scrach registers), but effectively GCC
can waste some preserved (callee save) registers.

3) Combining #1 and #2, "-fcall-saved-18 -fcall-saved-17" causes GCC to
want to use an extra preserved register in many functions, even
though it doesn't use it for anything useful. Since using r17
now leads to 13 registers to be saved, this costs an extra LDR
and STR instruction in each function. This leads to the code
expansion. Having few preserved registers actually helps
performance in the face of this GCC issue--it seems to much more
aggressively reuse registers when it would have to add in
explicit spill/fills to the stack. I suspect GCC is not
properly accounting the cost of grabbing "one more" preserved
register.

4) I suspect -fcall-saved-xx is not well tested. I noticed NOP instructions
appearing inside functions occasionally. This just wastes
space/time. There were a lot more NOPs with -fcall-saved-18
-fcall-saved-17, probably one in every other function I looked
at. This seems to explain the rest of the code expansion.
There's no reason for NOPs on ARM64. There are always aligning
NOPs between functions (for my code, I suspect removing these
NOPs would help icache misses enough to make it a win for me,
but I don't know the option to get rid of them.)

5) Also adding -fcall-saved-16 causes my code to crash. So I couldn't go
further. I suspect it's GCC's bug, but I didn't debug it.

Kent

kegs@provalid.com (Kent Dickey) writes:
>In article <uc1hsv$2ntj0$1@newsreader4.netcologne.de>,
>Thomas Koenig <tkoenig@netcologne.de> wrote:
>>Kent Dickey <kegs@provalid.com> schrieb:
>>> Here is the code again. I'm trying to keep it as short as possible to show
>>> the problem, so I don't plan to respond to nitpicks about the particulars
>>> of this code.
>>
>>[...]
>>
>>The issue of the frame pointer register not being used shows up in the
>>code on godbolt, even with recent gcc trunk.
>>
>>I have submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096 for
>>this.
>
>The GCC bug is about to be closed since GCC wants to reserve r29 on ARM64
>to always be a valid stack frame pointer, just "out of date" if a given
>function doesn't use it. So r29 is simply wasted on ARM64 when using GCC
>I have no idea what use a frame pointer is, I have no use for it.

A frame pointer is useful for debugging, and in some cases for
stack unwinding (e.g. C++ exceptions).

On intel, gcc supports -fomit-frame-pointer which releases RBP
to be a general register. I'd be surprised if the ARM64 compiler
doesn't always support omit-frame-pointer.

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<e8cce067-d38f-46c6-8cd1-831db706eabfn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33819&group=comp.arch#33819

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:162c:b0:641:89b5:e1e8 with SMTP id e12-20020a056214162c00b0064189b5e1e8mr528481qvw.13.1693001451619;
Fri, 25 Aug 2023 15:10:51 -0700 (PDT)
X-Received: by 2002:a05:6830:120b:b0:6ba:8e4a:8e62 with SMTP id
r11-20020a056830120b00b006ba8e4a8e62mr405319otp.7.1693001451321; Fri, 25 Aug
2023 15:10:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Aug 2023 15:10:51 -0700 (PDT)
In-Reply-To: <pk9GM.619557$SuUf.333498@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d57d:7c1:1d34:77be;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d57d:7c1:1d34:77be
References: <u9o14h$183or$1@newsreader4.netcologne.de> <c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<uc05oh$200om$1@dont-email.me> <uc1hsv$2ntj0$1@newsreader4.netcologne.de>
<ucb30g$6hlj$1@dont-email.me> <pk9GM.619557$SuUf.333498@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e8cce067-d38f-46c6-8cd1-831db706eabfn@googlegroups.com>
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Aug 2023 22:10:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3589

by: MitchAlsup - Fri, 25 Aug 2023 22:10 UTC

On Friday, August 25, 2023 at 4:48:09 PM UTC-5, Scott Lurndal wrote:
> ke...@provalid.com (Kent Dickey) writes:
> >In article <uc1hsv$2ntj0$1...@newsreader4.netcologne.de>,
> >Thomas Koenig <tko...@netcologne.de> wrote:
> >>Kent Dickey <ke...@provalid.com> schrieb:
> >>> Here is the code again. I'm trying to keep it as short as possible to show
> >>> the problem, so I don't plan to respond to nitpicks about the particulars
> >>> of this code.
> >>
> >>[...]
> >>
> >>The issue of the frame pointer register not being used shows up in the
> >>code on godbolt, even with recent gcc trunk.
> >>
> >>I have submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096 for
> >>this.
> >
> >The GCC bug is about to be closed since GCC wants to reserve r29 on ARM64
> >to always be a valid stack frame pointer, just "out of date" if a given
> >function doesn't use it. So r29 is simply wasted on ARM64 when using GCC
> >I have no idea what use a frame pointer is, I have no use for it.
<
> A frame pointer is useful for debugging, and in some cases for
> stack unwinding (e.g. C++ exceptions).
<
The frame pointer can be used as a means to access locations on the stack
which are static when the TOS contains dynamically sized data. Local data,
dynamic descriptors, and destructor lists. It is hard to imaging doing general
dynamic stack allocations without one. FPs are also used in block structured
languages. FPs generally require 4-6 more instructions to use per subroutine
than when one does not need an FP.
<
Never found myself in a debugging situation where having a FP (when otherwise
unneeded) would have been help in debugging.
>
> On intel, gcc supports -fomit-frame-pointer which releases RBP
> to be a general register. I'd be surprised if the ARM64 compiler
> doesn't always support omit-frame-pointer.
<
I should note that My 66000 supports the use of a FP (when needed) without
any cost in the instruction stream (no more instructions to use than to avoid);
your only cost is that FP is not a free register.

scott@slp53.sl.home (Scott Lurndal) writes:
>On intel, gcc supports -fomit-frame-pointer which releases RBP
>to be a general register. I'd be surprised if the ARM64 compiler
>doesn't always support omit-frame-pointer.

It does, but as Kent Dickey reports, gcc then does not use x29 for
other purposes.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Callee-saved registers (was: Intel goes to 32 GPRs)

<uccedu$2urln$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33823&group=comp.arch#33823

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f2a7-0-6f56-ce2f-62c0-c855.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Callee-saved registers (was: Intel goes to 32 GPRs)
Date: Sat, 26 Aug 2023 08:51:10 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uccedu$2urln$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<c61a30d4-9e83-41ce-91a7-72c296f22b20n@googlegroups.com>
<uc05oh$200om$1@dont-email.me> <uc1hsv$2ntj0$1@newsreader4.netcologne.de>
<ucb30g$6hlj$1@dont-email.me>
Injection-Date: Sat, 26 Aug 2023 08:51:10 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f2a7-0-6f56-ce2f-62c0-c855.ipv6dyn.netcologne.de:2001:4dd4:f2a7:0:6f56:ce2f:62c0:c855";
logging-data="3108535"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 26 Aug 2023 08:51 UTC

Kent Dickey <kegs@provalid.com> schrieb:

> 2) GCC's register allocator misses some reuse. If it feels it has
> plenty of registers, then it ends up doing stuff like:
>
> LDR x20,[x8,#120]
> ADD x20,x20,x9
> STR x20,[x8,#120]
> BL some_other_func
> LDR x21,[x8,#128]
> ADD x21,x21,x0
> STR x21,[x8,#128]
>
> Basically, it needs a register to hold some value, and it even
> though x20 and x21's use do not overlap, it doesn't just use x20
> for both. What's actually going on is more complex (these are
> preserved registers, not scrach registers), but effectively GCC
> can waste some preserved (callee save) registers.

Register allocation is a hard problem, and known to be far from
perfect; a bugzilla search for the keyword ra (register allocation)
shows 214 open bugs.

What you showed looks like it should be number 215. Can you
open a PR on gcc bugzilla, or send me the code (godbolt, mail)?
I would then submit it.

> 4) I suspect -fcall-saved-xx is not well tested.

That is probably the case.

> I noticed NOP instructions
> appearing inside functions occasionally. This just wastes
> space/time. There were a lot more NOPs with -fcall-saved-18
> -fcall-saved-17, probably one in every other function I looked
> at. This seems to explain the rest of the code expansion.
> There's no reason for NOPs on ARM64. There are always aligning
> NOPs between functions (for my code, I suspect removing these
> NOPs would help icache misses enough to make it a win for me,
> but I don't know the option to get rid of them.)

See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html and
search for -falign-functions.

> 5) Also adding -fcall-saved-16 causes my code to crash. So I couldn't go
> further. I suspect it's GCC's bug, but I didn't debug it.

Do not forget that -fcall-saved is an ABI-changing option.

Did your whole program, including those usually called via shared
libraries, with the -fall-saved-xx option? If not, calling (for
example) a system-provided libc function is a likely
source for crashes.

In article <uccedu$2urln$1@newsreader4.netcologne.de>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
>Kent Dickey <kegs@provalid.com> schrieb:
>
>> 2) GCC's register allocator misses some reuse. If it feels it has
>> plenty of registers, then it ends up doing stuff like:
>>
>> LDR x20,[x8,#120]
>> ADD x20,x20,x9
>> STR x20,[x8,#120]
>> BL some_other_func
>> LDR x21,[x8,#128]
>> ADD x21,x21,x0
>> STR x21,[x8,#128]
>>
>> Basically, it needs a register to hold some value, and it even
>> though x20 and x21's use do not overlap, it doesn't just use x20
>> for both. What's actually going on is more complex (these are
>> preserved registers, not scrach registers), but effectively GCC
>> can waste some preserved (callee save) registers.
>
>Register allocation is a hard problem, and known to be far from
>perfect; a bugzilla search for the keyword ra (register allocation)
>shows 214 open bugs.
>
>What you showed looks like it should be number 215. Can you
>open a PR on gcc bugzilla, or send me the code (godbolt, mail)?
>I would then submit it.

You can actually see the function get longer using the example function
I posted earlier in this thread, and adding -fcall-saved-xx on godbolt.org.
Feel free to knock yourself out. It's annoying to compare functions since
GCC mixes up the register numbers, making it tedious to figure out where
it goes "wrong".

>> 4) I suspect -fcall-saved-xx is not well tested.
>
>That is probably the case.
>
>> I noticed NOP instructions
>> appearing inside functions occasionally. This just wastes
>> space/time. There were a lot more NOPs with -fcall-saved-18
>> -fcall-saved-17, probably one in every other function I looked
>> at. This seems to explain the rest of the code expansion.
>> There's no reason for NOPs on ARM64. There are always aligning
>> NOPs between functions (for my code, I suspect removing these
>> NOPs would help icache misses enough to make it a win for me,
>> but I don't know the option to get rid of them.)
>
>See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html and
>search for -falign-functions.
>
>> 5) Also adding -fcall-saved-16 causes my code to crash. So I couldn't go
>> further. I suspect it's GCC's bug, but I didn't debug it.
>
>Do not forget that -fcall-saved is an ABI-changing option.
>
>Did your whole program, including those usually called via shared
>libraries, with the -fall-saved-xx option? If not, calling (for
>example) a system-provided libc function is a likely
>source for crashes.

Yes, every line of C code is recompiled with that option (except for
libgcc.a, which based on past experience my code doesn't use on ARM64).
There are no libraries, not even libc. I have a pretty unique use case.
It runs for a little while, but then jumps to address 0. I'm pretty sure
it's a GCC bug that I don't have interest in debugging due to the r29
fiasco.

Kent

Subject	Author
Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Peter Lund
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Elijah Stone
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	BGB
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	JimBrakefield
Re: Intel goes to 32-bit general purpose registers	Michael S
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Stephen Fuld
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Quadibloc
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	JimBrakefield
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Ivan Godard
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	EricP
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Kent Dickey
Callee-saved registers (was: Intel goes to 32 GPRs)	Anton Ertl
Re: Intel goes to 32 GPRs	Mike Stump

I *____knew* I had some reason for not logging you off... If I could just remember what it was.

devel / comp.arch / Re: lots of inline, Intel goes to 32 GPRs

devel / comp.arch / Re: lots of inline, Intel goes to 32 GPRs

I ____knew I had some reason for not logging you off... If I could just remember what it was.