Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Avoid the Gates of Hell. Use Linux -- unknown source


devel / comp.arch / Re: Intel goes to 32 GPRs

SubjectAuthor
* Intel goes to 32-bit general purpose registersThomas Koenig
+* Re: Intel goes to 32-bit general purpose registersScott Lurndal
|`* Re: Intel goes to 32-bit general purpose registersMitchAlsup
| `* Re: Intel goes to 32-bit general purpose registersQuadibloc
|  `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|   `* Re: Intel goes to 32-bit general purpose registersPeter Lund
|    `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|     `* Re: Intel goes to 32-bit general purpose registersElijah Stone
|      `* Re: Intel goes to 32-bit general purpose registersMitchAlsup
|       `- Re: Intel goes to 32-bit general purpose registersThomas Koenig
+* Re: Intel goes to 32-bit general purpose registersQuadibloc
|`* Re: Intel goes to 32-bit general purpose registersQuadibloc
| +* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |+* Re: Intel goes to 32-bit general purpose registersScott Lurndal
| ||`- Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |+* Re: Intel goes to 32-bit general purpose registersAnton Ertl
| ||+* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| |||+- Re: Intel goes to 32-bit general purpose registersBGB
| |||`- Re: Intel goes to 32-bit general purpose registersAnton Ertl
| ||+- Re: Intel goes to 32-bit general purpose registersJimBrakefield
| ||`* Re: Intel goes to 32-bit general purpose registersMichael S
| || `- Re: Intel goes to 32-bit general purpose registersAnton Ertl
| |`* Re: Intel goes to 32-bit general purpose registersJohn Dallman
| | +- Re: Intel goes to 32-bit general purpose registersStephen Fuld
| | `- Re: Intel goes to 32-bit general purpose registersMitchAlsup
| `* Re: Intel goes to 32-bit general purpose registersAnton Ertl
|  `- Re: Intel goes to 32-bit general purpose registersJohn Dallman
`* Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Anton Ertl
 `* Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Quadibloc
  +- Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)Anton Ertl
  `* Re: Intel goes to 32 GPRsTerje Mathisen
   `* Re: Intel goes to 32 GPRsThomas Koenig
    `* Re: Intel goes to 32 GPRsTerje Mathisen
     +* Re: Intel goes to 32 GPRsThomas Koenig
     |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | +- Re: Intel goes to 32 GPRsMitchAlsup
     | `* Re: Intel goes to 32 GPRsThomas Koenig
     |  `* Re: Intel goes to 32 GPRsTerje Mathisen
     |   +- Re: Intel goes to 32 GPRsMitchAlsup
     |   `* Re: Intel goes to 32 GPRsThomas Koenig
     |    `- Re: Intel goes to 32 GPRsAnton Ertl
     +- Re: Intel goes to 32 GPRsMitchAlsup
     +* Re: Intel goes to 32 GPRsAnton Ertl
     |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | +* Re: Intel goes to 32 GPRsScott Lurndal
     | |`* Re: Intel goes to 32 GPRsMitchAlsup
     | | +- Re: Intel goes to 32 GPRsMitchAlsup
     | | +* Re: Intel goes to 32 GPRsScott Lurndal
     | | |`* Re: Intel goes to 32 GPRsTerje Mathisen
     | | | +* Re: Intel goes to 32 GPRsBGB
     | | | |+* Re: Intel goes to 32 GPRsMitchAlsup
     | | | ||`- Re: Intel goes to 32 GPRsBGB
     | | | |`* Re: Intel goes to 32 GPRsQuadibloc
     | | | | `- Re: Intel goes to 32 GPRsBGB
     | | | `* Re: Intel goes to 32 GPRsAnton Ertl
     | | |  `- Re: Intel goes to 32 GPRsTerje Mathisen
     | | `* Re: Intel goes to 32 GPRsBGB
     | |  `* Re: Intel goes to 32 GPRsMitchAlsup
     | |   `- Re: Intel goes to 32 GPRsBGB
     | +* Re: Intel goes to 32 GPRsAnton Ertl
     | |`* Re: Intel goes to 32 GPRsThomas Koenig
     | | +* Re: Intel goes to 32 GPRsMitchAlsup
     | | |`- Re: Intel goes to 32 GPRsAnton Ertl
     | | +* Re: Intel goes to 32 GPRsTerje Mathisen
     | | |+* Re: Intel goes to 32 GPRsAnton Ertl
     | | ||+- Re: Intel goes to 32 GPRsMitchAlsup
     | | ||`- Re: Intel goes to 32 GPRsJimBrakefield
     | | |`* Re: Intel goes to 32 GPRsMitchAlsup
     | | | `* Re: Intel goes to 32 GPRsBGB
     | | |  `* Re: Intel goes to 32 GPRsMitchAlsup
     | | |   +- Re: Intel goes to 32 GPRsBGB
     | | |   `* Re: Intel goes to 32 GPRsTerje Mathisen
     | | |    `- Re: Intel goes to 32 GPRsBGB
     | | `* Re: Intel goes to 32 GPRsStephen Fuld
     | |  `* Re: Intel goes to 32 GPRsAnton Ertl
     | |   +- Re: Intel goes to 32 GPRsStephen Fuld
     | |   `- Re: Intel goes to 32 GPRsThomas Koenig
     | `* Re: Intel goes to 32 GPRsThomas Koenig
     |  `* Re: Intel goes to 32 GPRsTerje Mathisen
     |   `* Re: Intel goes to 32 GPRsThomas Koenig
     |    `* Re: Intel goes to 32 GPRsMitchAlsup
     |     `* Re: Intel goes to 32 GPRsNiklas Holsti
     |      `* Re: Intel goes to 32 GPRsMitchAlsup
     |       `* Re: Intel goes to 32 GPRsNiklas Holsti
     |        `* Re: Intel goes to 32 GPRsStephen Fuld
     |         +- Re: Intel goes to 32 GPRsNiklas Holsti
     |         `- Re: Intel goes to 32 GPRsIvan Godard
     `* Re: Intel goes to 32 GPRsKent Dickey
      +* Re: Intel goes to 32 GPRsMitchAlsup
      |+* Re: Intel goes to 32 GPRsQuadibloc
      ||`- Re: Intel goes to 32 GPRsTerje Mathisen
      |`* Re: Intel goes to 32 GPRsKent Dickey
      | `* Re: Intel goes to 32 GPRsThomas Koenig
      |  +* Re: Intel goes to 32 GPRsAnton Ertl
      |  |+- Re: Intel goes to 32 GPRsAnton Ertl
      |  |`* Re: Intel goes to 32 GPRsEricP
      |  | +* Re: Intel goes to 32 GPRsMitchAlsup
      |  | |`* Re: Intel goes to 32 GPRsThomas Koenig
      |  | | `* Re: Intel goes to 32 GPRsBGB
      |  | |  `* Re: Intel goes to 32 GPRsMitchAlsup
      |  | |   +* Re: Intel goes to 32 GPRsBGB
      |  | |   `* Re: Intel goes to 32 GPRsTerje Mathisen
      |  | `* Re: Intel goes to 32 GPRsStephen Fuld
      |  `* Re: Intel goes to 32 GPRsKent Dickey
      +* Callee-saved registers (was: Intel goes to 32 GPRs)Anton Ertl
      `- Re: Intel goes to 32 GPRsMike Stump

Pages:12345678910
Re: Intel goes to 32 GPRs

<2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33532&group=comp.arch#33532

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:450a:b0:635:e9f6:9470 with SMTP id oo10-20020a056214450a00b00635e9f69470mr70038qvb.5.1690829980584;
Mon, 31 Jul 2023 11:59:40 -0700 (PDT)
X-Received: by 2002:a05:6808:1924:b0:3a1:d419:9c64 with SMTP id
bf36-20020a056808192400b003a1d4199c64mr19917705oib.5.1690829980289; Mon, 31
Jul 2023 11:59:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 31 Jul 2023 11:59:40 -0700 (PDT)
In-Reply-To: <ua8vg8$3bnba$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1dd2:2482:63f7:65eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1dd2:2482:63f7:65eb
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de>
<ua7mn3$37cuf$1@dont-email.me> <2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>
<ua8vg8$3bnba$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 31 Jul 2023 18:59:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Mon, 31 Jul 2023 18:59 UTC

On Monday, July 31, 2023 at 1:45:32 PM UTC-5, BGB wrote:
> On 7/31/2023 12:00 PM, MitchAlsup wrote:
> > On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
> >> Thomas Koenig wrote:
> >>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>>
> >>>> Every performance improvement project for an interpreter
> >>>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
> >>>> of the interpreter matters. The fact that people perform performance
> >>>> measurements shows this, too.
> >>>
> >>> Widely-used interpeted languages such as Python or Matlab are
> >>> known to be as slow as molasses when they do not call highly-
> >>> efficient compiled code. Any improvement (such as the link above)
> >>> is welcome there.
> >>>
> >>> A concept followed by languages like the one used by Julia, which
> >>> uses JIT compilation much more aggressively, seems to be a better
> >>> approach. I have barely glanced at it yet, but it certainly seems
> >>> worth looking into for scientific work.
> >>>
> >> This is what I'm really arguing for: If any real interpreter is doomed
> >> to always be much, much slower than inline binary code, then we need
> >> better ways to (at runtime?) convert the former into the latter, i.e.
> >> JIT optimization of the parts that need it.
> > <
> > {{
> > A real interpreter is going to average DIV 1000
> > Jitting instructions into a trace is going to average DIV 100.
> > A JIT trace is going to average DIV 10
> > }} per interpreted instruction.
> > But here JIT means converting simulated instructions into native
> > instructions with a touch of peepholeing.
> This seems a little pessimistic.
>
> My own past stats were:
> Naive bytecode: ~ 300..1000, depending on the design of the bytecode.
> Dynamic types are slower than static types;
> Stack vs 3R is another factor of 2.
<
Simple RISC instruction interpretation is on the order of 100 instructions
per instruction interpretation. Memory references where you have to update
caches and TLBs are the ones making the average high. Branches are
also expensive if you are keeping ICache and ITLB up to date.
>
<snip>
> Well, and also it would be nice if the operator precedence hierarchy
> were "less stupid".
>
> Like, if it were up to me, say:
> * / %
> + -
> & | ^
> << >>
> == != ...
> && ||
> = += -= ...
<
One does not have to deal with operator precedence when interpreting
non-native ISA (all that crap has been taken care of already.)
>

Re: Intel goes to 32 GPRs

<ua93a7$3c7cp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33533&group=comp.arch#33533

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 31 Jul 2023 14:48:57 -0500
Organization: A noiseless patient Spider
Lines: 125
Message-ID: <ua93a7$3c7cp$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de> <ua7mn3$37cuf$1@dont-email.me>
<2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>
<ua8vg8$3bnba$1@dont-email.me>
<2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 31 Jul 2023 19:50:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2874dc1142fa3c0c8ad9617baecbf15a";
logging-data="3546521"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KnCoJWZyMli54sQ5FF+62"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:OAITv1q69icbRynoCGcDixgn0rU=
Content-Language: en-US
In-Reply-To: <2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
 by: BGB - Mon, 31 Jul 2023 19:48 UTC

On 7/31/2023 1:59 PM, MitchAlsup wrote:
> On Monday, July 31, 2023 at 1:45:32 PM UTC-5, BGB wrote:
>> On 7/31/2023 12:00 PM, MitchAlsup wrote:
>>> On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
>>>> Thomas Koenig wrote:
>>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>>
>>>>>> Every performance improvement project for an interpreter
>>>>>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>>>>>> of the interpreter matters. The fact that people perform performance
>>>>>> measurements shows this, too.
>>>>>
>>>>> Widely-used interpeted languages such as Python or Matlab are
>>>>> known to be as slow as molasses when they do not call highly-
>>>>> efficient compiled code. Any improvement (such as the link above)
>>>>> is welcome there.
>>>>>
>>>>> A concept followed by languages like the one used by Julia, which
>>>>> uses JIT compilation much more aggressively, seems to be a better
>>>>> approach. I have barely glanced at it yet, but it certainly seems
>>>>> worth looking into for scientific work.
>>>>>
>>>> This is what I'm really arguing for: If any real interpreter is doomed
>>>> to always be much, much slower than inline binary code, then we need
>>>> better ways to (at runtime?) convert the former into the latter, i.e.
>>>> JIT optimization of the parts that need it.
>>> <
>>> {{
>>> A real interpreter is going to average DIV 1000
>>> Jitting instructions into a trace is going to average DIV 100.
>>> A JIT trace is going to average DIV 10
>>> }} per interpreted instruction.
>>> But here JIT means converting simulated instructions into native
>>> instructions with a touch of peepholeing.
>> This seems a little pessimistic.
>>
>> My own past stats were:
>> Naive bytecode: ~ 300..1000, depending on the design of the bytecode.
>> Dynamic types are slower than static types;
>> Stack vs 3R is another factor of 2.
> <
> Simple RISC instruction interpretation is on the order of 100 instructions
> per instruction interpretation. Memory references where you have to update
> caches and TLBs are the ones making the average high. Branches are
> also expensive if you are keeping ICache and ITLB up to date.

For emulating the SH-2, I had a mechanism to mark which pages held
previously decoded instruction traces, and would trigger a trace-flush
if any of these pages were written to (with some special handling to
allow the current trace to "finish gracefully", the I$/trace flush being
handled like a sort of "non-throwing exception").

Early on, I had tried a mechanism to selectively flush traces based on
whether they touched the modified page; but then found it was faster to
simply flush everything and start over from a clean slate whenever this
happened (the JIT was similar; if the JIT cache got full, flush
everything and start over).

For BJX1 and BJX2, the rules were slightly different, and one needs to
trigger an I$ flush manually if any executable code in memory is
modified (otherwise, behavior is undefined; and may result in executing
stale code).

Note that trace dispatch is typically via a sort of set-associative hash
table, with the traces typically also caching pointers to other traces
for direct branches (to limit cases where it needs to fall back to using
the general-case of using the hash table).

In most of my other interpreters, it is more convenient to assume some
variation of a Harvard architecture.

>>
> <snip>
>> Well, and also it would be nice if the operator precedence hierarchy
>> were "less stupid".
>>
>> Like, if it were up to me, say:
>> * / %
>> + -
>> & | ^
>> << >>
>> == != ...
>> && ||
>> = += -= ...
> <
> One does not have to deal with operator precedence when interpreting
> non-native ISA (all that crap has been taken care of already.)

Yeah, at this point I was talking about the language level.
Granted, bytecode or below does not need to care about operator precedence.

Some of my VMs had been designed to use a stack bytecode with the
assumption that it will be translated on load to a 3AC trace-form
internally (as opposed to running the stack ops directly).

Though, this does imply putting some restrictions on how the stack may
be used by the bytecode; but given things like .NET and similar had
similar restrictions, I suspect they were operating under a similar
assumption.

One can similarly, also, have code that is dynamically or 'auto typed'
at the level of the stack bytecode, but the types are already resolved
by the time of the 3AC code; but, this does create a little bit of a
messy situation with "who is responsible for the language-level
type-promotion handling".

In BGBCC, it ended up being handled mostly in the stack-IR to 3AC stage,
which is "not ideal" (in .NET, the front-end dealt with it, and had much
more simplistic / explicit handling of type conversion and promotion in
the backend).

If I were "doing it all again", I might be more inclined to more closely
follow .NET's model in this area...

I had started work on a new (possible) C compiler, but this is likely to
go directly from ASTs to 3AC without an intermediate stack-IR stage.
Well, unless one argues that I should still include such an IR stage.

....

Re: Intel goes to 32 GPRs

<6f230587-b426-4d46-89ea-e4b4b57763d1n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33534&group=comp.arch#33534

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4f85:0:b0:63d:34b7:37a3 with SMTP id em5-20020ad44f85000000b0063d34b737a3mr40617qvb.2.1690843217310;
Mon, 31 Jul 2023 15:40:17 -0700 (PDT)
X-Received: by 2002:a9d:74c3:0:b0:6b4:4ea2:4371 with SMTP id
a3-20020a9d74c3000000b006b44ea24371mr13570964otl.2.1690843216853; Mon, 31 Jul
2023 15:40:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 31 Jul 2023 15:40:16 -0700 (PDT)
In-Reply-To: <2023Jul31.172330@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.14.162; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.14.162
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de>
<ua7mn3$37cuf$1@dont-email.me> <2023Jul31.172330@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6f230587-b426-4d46-89ea-e4b4b57763d1n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: jim.brakefield@ieee.org (JimBrakefield)
Injection-Date: Mon, 31 Jul 2023 22:40:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: JimBrakefield - Mon, 31 Jul 2023 22:40 UTC

On Monday, July 31, 2023 at 11:34:46 AM UTC-5, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
> >Thomas Koenig wrote:
> >> A concept followed by languages like the one used by Julia, which
> >> uses JIT compilation much more aggressively, seems to be a better
> >> approach. I have barely glanced at it yet, but it certainly seems
> >> worth looking into for scientific work.
> Julia is interesting in many respects, but it uses a slow compiler
> (IIRC LLVM) plus caching in a JIT-like setting. I.e., when the
> compiler sees new code, you see delays far beyond what you see with
> interpreters or more conventional JITs. This may be a good choice for
> Julia, but I have my doubts that it would gained popularity for a
> language with a different application area.
> >This is what I'm really arguing for: If any real interpreter is doomed
> >to always be much, much slower than inline binary code, then we need
> >better ways to (at runtime?) convert the former into the latter, i.e.
> >JIT optimization of the parts that need it.
> You may think so, but the reality often is: When a language designer
> wants to design an interactive language, the cheap and straightforward
> way is to write an interpreter.
>
> Will you finance the teams that are necessary to implement JIT
> compilers for every newfangled language?
>
> Will you finance the teams necessary to implement JIT compilers for
> those languages that have found a following? That's of course a much
> smaller number of languages, but then, each of these teams will have
> to deal with entrenched usage and may not be able to replace the
> interpreter. Note that Pypy has been started 16 years ago, and yet
> CPython (the slow interpreter) is by far the most popular Python
> implementation.
>
> You can put your hopes in Truffle, and maybe Truffle will change the
> way how aspiring language designers work, but if so, it will be a big
> marketing success.
> >Changing the hardware to make indirect function calls and returns far
> >faster would be nice, but they will always cost a lot more than zero.
> Indirect function calls and returns are fast on modern hardware in the
> common case (except if you work around Spectre with retpolines, then
> they are very slow).
> >I believe conditional branches is the one CPU feature where a naive
> >interpreter (calling a function for each opcode) can be relatively fast:
> >Calculate the new IP with a conditional/predicated move and return, with
> >no branch predictor issues.
> I don't know many interpreters that call a function for each opcode
> (actually only one: PFE); many use a big switch in one function; this
> is better for performance, because more than one virtual-machine
> registers can be passed in real registers. Some are more advanced and
> use techniques like threaded code. In every one of these cases, you
> have an indirect call or indirect branch for every VM instruction, and
> you therefore need indirect branch prediction dearly; compiling an
> interpreter with Spectre v2 to use a retpoline for the indirect
> jumps/calls causes a huge slowdown.
>
> Concerning a VM-level conditional branch, if you use a conditional
> move to set the instruction pointer (IP), yes, you have eliminated
> that branch prediction, but a few instructions later there will be a
> fetch from that IP and then an indirect branch (or indirect call)
> based on the fetched value, and that has to be predicted, and it will
> tend to predict worse than the conditional branch predictor (which is
> tuned for conditional branches). I am not an expert in indirect
> branch predictors in current CPUs, but if the conditional-branch
> history is also used for predicting indirect branches, it may be
> detrimental to eliminate the conditional branch from the
> implementation of the VM-level conditional branch.
>
> There is one technique called selective inlining (aka dynamic
> superinstructions) that eliminates the indirect branches for
> straight-line VM code. In that case a VM-level conditional branch
> only needs an indirect brach in the branch-taken case, but then you
> need a native-code-level conditional branch to skip the indirect
> branch.
>
> E.g., for the Forth code
>
> : foo dup if 1+ then ;
>
> the resulting VM code plus the machine code is (gforth-fast on
> RISC-V):
>
> $3FAA160638 dup 1->2
> 0x0000003fa9e04ede: mv s0,s7
> $3FAA160640 ?branch 2->1
> $3FAA160648 <foo+$20>
> 0x0000003fa9e04ee0: addi s10,s10,24
> 0x0000003fa9e04ee2: ld a5,-8(s10)
> 0x0000003fa9e04ee6: bnez s0,0x3fa9e04eee
> 0x0000003fa9e04ee8: ld a4,0(a5)
> 0x0000003fa9e04eea: mv s10,a5
> 0x0000003fa9e04eec: jr a4
> $3FAA160650 1+ 1->1
> 0x0000003fa9e04eee: addi s7,s7,1
> 0x0000003fa9e04ef0: addi s10,s10,8
> $3FAA160658 ;s 1->1
> 0x0000003fa9e04ef2: ld a6,0(s2)
> 0x0000003fa9e04ef6: addi s2,s2,8
> 0x0000003fa9e04ef8: mv s10,a6
> 0x0000003fa9e04efa: ld a4,0(s10)
> 0x0000003fa9e04efe: jr a4
>
> The two "jr" instructions are the indirect branches. The "bnez"
> branches around the branch-taken case and its indirect branch to the
> code for 1+.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

|> Julia is interesting in many respects

Am wondering about using Julia parse-tree, meta and macro capabilities
to generate assembler source code for various architectures?
E.g using compilation that supports constants, variables and simple loops
then with minimal effort and some template matching generate assembler
or binary source.

Re: Intel goes to 32 GPRs

<uab3o8$3m4sn$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33536&group=comp.arch#33536

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 16:10:16 +0200
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <uab3o8$3m4sn$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 1 Aug 2023 14:10:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="def0cdad784f7030e94f19aeda8e8011";
logging-data="3871639"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19awqn92xbzc5tetbL80CKqCUJXHwEu2O3zp5X/cC2e+Q=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:VTbi/PE94uNPlHdSXgQqAbWmSWQ=
In-Reply-To: <ua8q9v$1j50n$1@newsreader4.netcologne.de>
 by: Terje Mathisen - Tue, 1 Aug 2023 14:10 UTC

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>
> [...]
>
>> in the case of having non-saved registers
>> that you'd like to use across function calls, I would probably consider
>> writing wrappers for those function calls: The wrapper would do the
>> save, call the actual function, then restore and return.
>
> The discussion on gcc has turned up an interesting option that I had
> overlooked:
>
> '-fipa-ra'
> Use caller save registers for allocation if those registers are not
> used by any called function. In that case it is not necessary to
> save and restore them around calls. This is only possible if
> called functions are part of same compilation unit as current
> function and they are compiled before it.
>
> Enabled at levels '-O2', '-O3', '-Os', however the option is
> disabled if generated code will be instrumented for profiling
> ('-p', or '-pg') or if callee's register usage cannot be known
> exactly (this happens on targets that do not expose prologues and
> epilogues in RTL).
>
> This should also be enabled with LTO.

So this is basically the compiler people doing automatically the same
kind of register optimizations that I do manually in my asm. Nice!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Intel goes to 32 GPRs

<uab3s2$3m4sn$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33537&group=comp.arch#33537

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 16:12:18 +0200
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <uab3s2$3m4sn$2@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de> <ua7mn3$37cuf$1@dont-email.me>
<2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>
<ua8vg8$3bnba$1@dont-email.me>
<2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 1 Aug 2023 14:12:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="def0cdad784f7030e94f19aeda8e8011";
logging-data="3871639"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xYp2AHA0/0OIMc3VcQZiFopthRl8WvN2W+Bnhf4TChw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:GcKcwbcBPuKMBDTo6kAJj0/SPM0=
In-Reply-To: <2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
 by: Terje Mathisen - Tue, 1 Aug 2023 14:12 UTC

MitchAlsup wrote:
> On Monday, July 31, 2023 at 1:45:32 PM UTC-5, BGB wrote:
>> On 7/31/2023 12:00 PM, MitchAlsup wrote:
>>> On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
>>>> Thomas Koenig wrote:
>>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>>
>>>>>> Every performance improvement project for an interpreter
>>>>>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>>>>>> of the interpreter matters. The fact that people perform performance
>>>>>> measurements shows this, too.
>>>>>
>>>>> Widely-used interpeted languages such as Python or Matlab are
>>>>> known to be as slow as molasses when they do not call highly-
>>>>> efficient compiled code. Any improvement (such as the link above)
>>>>> is welcome there.
>>>>>
>>>>> A concept followed by languages like the one used by Julia, which
>>>>> uses JIT compilation much more aggressively, seems to be a better
>>>>> approach. I have barely glanced at it yet, but it certainly seems
>>>>> worth looking into for scientific work.
>>>>>
>>>> This is what I'm really arguing for: If any real interpreter is doomed
>>>> to always be much, much slower than inline binary code, then we need
>>>> better ways to (at runtime?) convert the former into the latter, i.e.
>>>> JIT optimization of the parts that need it.
>>> <
>>> {{
>>> A real interpreter is going to average DIV 1000
>>> Jitting instructions into a trace is going to average DIV 100.
>>> A JIT trace is going to average DIV 10
>>> }} per interpreted instruction.
>>> But here JIT means converting simulated instructions into native
>>> instructions with a touch of peepholeing.
>> This seems a little pessimistic.
>>
>> My own past stats were:
>> Naive bytecode: ~ 300..1000, depending on the design of the bytecode.
>> Dynamic types are slower than static types;
>> Stack vs 3R is another factor of 2.
> <
> Simple RISC instruction interpretation is on the order of 100 instructions
> per instruction interpretation. Memory references where you have to update
> caches and TLBs are the ones making the average high. Branches are
> also expensive if you are keeping ICache and ITLB up to date.

It sounds like your numbers are for a CPU emulator, not just an
interpreter of the running code?

Terje

>>
> <snip>
>> Well, and also it would be nice if the operator precedence hierarchy
>> were "less stupid".
>>
>> Like, if it were up to me, say:
>> * / %
>> + -
>> & | ^
>> << >>
>> == != ...
>> && ||
>> = += -= ...
> <
> One does not have to deal with operator precedence when interpreting
> non-native ISA (all that crap has been taken care of already.)
>>
>

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Intel goes to 32 GPRs

<uaba1j$3mt5k$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33538&group=comp.arch#33538

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 10:56:06 -0500
Organization: A noiseless patient Spider
Lines: 120
Message-ID: <uaba1j$3mt5k$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de> <ua7mn3$37cuf$1@dont-email.me>
<2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>
<ua8vg8$3bnba$1@dont-email.me>
<2cf9fc86-dd3b-4d89-99a7-42605f699ae9n@googlegroups.com>
<uab3s2$3m4sn$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 1 Aug 2023 15:57:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="84ced8ae363d731a5929a50e6a2030f2";
logging-data="3896500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/NZEyOYtxWxZPnWBcB/bLP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:NI7hjd9BkO/ew+9J0DvWB37Uprk=
Content-Language: en-US
In-Reply-To: <uab3s2$3m4sn$2@dont-email.me>
 by: BGB - Tue, 1 Aug 2023 15:56 UTC

On 8/1/2023 9:12 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Monday, July 31, 2023 at 1:45:32 PM UTC-5, BGB wrote:
>>> On 7/31/2023 12:00 PM, MitchAlsup wrote:
>>>> On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
>>>>> Thomas Koenig wrote:
>>>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>>>
>>>>>>> Every performance improvement project for an interpreter
>>>>>>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>>>>>>> of the interpreter matters. The fact that people perform performance
>>>>>>> measurements shows this, too.
>>>>>>
>>>>>> Widely-used interpeted languages such as Python or Matlab are
>>>>>> known to be as slow as molasses when they do not call highly-
>>>>>> efficient compiled code. Any improvement (such as the link above)
>>>>>> is welcome there.
>>>>>>
>>>>>> A concept followed by languages like the one used by Julia, which
>>>>>> uses JIT compilation much more aggressively, seems to be a better
>>>>>> approach. I have barely glanced at it yet, but it certainly seems
>>>>>> worth looking into for scientific work.
>>>>>>
>>>>> This is what I'm really arguing for: If any real interpreter is doomed
>>>>> to always be much, much slower than inline binary code, then we need
>>>>> better ways to (at runtime?) convert the former into the latter, i.e.
>>>>> JIT optimization of the parts that need it.
>>>> <
>>>> {{
>>>> A real interpreter is going to average DIV 1000
>>>> Jitting instructions into a trace is going to average DIV 100.
>>>> A JIT trace is going to average DIV 10
>>>> }} per interpreted instruction.
>>>> But here JIT means converting simulated instructions into native
>>>> instructions with a touch of peepholeing.
>>> This seems a little pessimistic.
>>>
>>> My own past stats were:
>>> Naive bytecode: ~ 300..1000, depending on the design of the bytecode.
>>> Dynamic types are slower than static types;
>>> Stack vs 3R is another factor of 2.
>> <
>> Simple RISC instruction interpretation is on the order of 100
>> instructions
>> per instruction interpretation. Memory references where you have to
>> update
>> caches and TLBs are the ones making the average high. Branches are
>> also expensive if you are keeping ICache and ITLB up to date.
>
> It sounds like your numbers are for a CPU emulator, not just an
> interpreter of the running code?
>

Yeah. Or, in this case, seemingly an emulator which decodes instructions
inline rather than decoding them in advance and keeping them in a
trace-cache or similar...

My fastest "plain interpreters" got to around 10x slower than native,
but as can be noted, these were using a three-address representation and
decoding into a trace-cache; and ran directly in the host address space
rather than in an emulated memory map.

Emulating the memory map (and modeling the cache) being one of the major
costs of an emulator.

If going for speed rather than trying to keep track of clock cycles, one
useful speedup is to keep a sort of small TLB of "already translated"
addresses (which point to the "backing memory" for the page).

So, say, something like:
u64 MemGetQWord(VM_CTX *ctx, vm_addr addr)
{
char *p;
int ix, ix2;

ix=(addr>>12)&63; ix2=((addr+7)>>12)&63;
if((ctx->utlb_addr_rd[ix]==(addr&VM_ADDR_PAGE_MASK)) && (ix==ix2))
{
p=ctx->utlb_ptr_rd[ix];
return(*(u64 *)(p+(addr&4095)));
}
... Handle memory access as normal ...
}

This can in-turn sidestep much of the usual cost of the address
translation and memory-span-lookup steps (and can also deal with
emulating paged virtual memory and similar).

Though, some emulators had apparently gained some speed by leveraging
things like "nested page tables" and similar to run the emulated target
inside of a memory map managed by the underlying CPU (also presumably
running a JIT as well).

> Terje
>
>>>
>> <snip>
>>> Well, and also it would be nice if the operator precedence hierarchy
>>> were "less stupid".
>>>
>>> Like, if it were up to me, say:
>>> * / %
>>> + -
>>> & | ^
>>> << >>
>>> == != ...
>>> && ||
>>> = += -= ...
>> <
>> One does not have to deal with operator precedence when interpreting
>> non-native ISA (all that crap has been taken care of already.)
>>>
>>
>
>

Re: Intel goes to 32 GPRs

<uabnh5$1l37t$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33539&group=comp.arch#33539

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-ee56-0-27ac-fad3-13f4-2032.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 19:47:49 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uabnh5$1l37t$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me>
Injection-Date: Tue, 1 Aug 2023 19:47:49 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-ee56-0-27ac-fad3-13f4-2032.ipv6dyn.netcologne.de:2001:4dd4:ee56:0:27ac:fad3:13f4:2032";
logging-data="1740029"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 1 Aug 2023 19:47 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>
>> [...]
>>
>>> in the case of having non-saved registers
>>> that you'd like to use across function calls, I would probably consider
>>> writing wrappers for those function calls: The wrapper would do the
>>> save, call the actual function, then restore and return.
>>
>> The discussion on gcc has turned up an interesting option that I had
>> overlooked:
>>
>> '-fipa-ra'
>> Use caller save registers for allocation if those registers are not
>> used by any called function. In that case it is not necessary to
>> save and restore them around calls. This is only possible if
>> called functions are part of same compilation unit as current
>> function and they are compiled before it.
>>
>> Enabled at levels '-O2', '-O3', '-Os', however the option is
>> disabled if generated code will be instrumented for profiling
>> ('-p', or '-pg') or if callee's register usage cannot be known
>> exactly (this happens on targets that do not expose prologues and
>> epilogues in RTL).
>>
>> This should also be enabled with LTO.
>
> So this is basically the compiler people doing automatically the same
> kind of register optimizations that I do manually in my asm. Nice!

Yes, but only if the called function is visible.

If you link code from separate object files, or use shared libraries,
this is currently not possible.

At least for the first case, it would be interesting to record the
registers that are actually used in the object file, and use that
for later optimization.

Re: Intel goes to 32 GPRs

<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33540&group=comp.arch#33540

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b23:b0:63d:7e3:d9c with SMTP id w3-20020a0562140b2300b0063d07e30d9cmr59292qvj.1.1690919663661;
Tue, 01 Aug 2023 12:54:23 -0700 (PDT)
X-Received: by 2002:a05:6870:5b16:b0:1bb:470c:901c with SMTP id
ds22-20020a0568705b1600b001bb470c901cmr15650357oab.7.1690919663470; Tue, 01
Aug 2023 12:54:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 1 Aug 2023 12:54:23 -0700 (PDT)
In-Reply-To: <uabnh5$1l37t$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:49d2:7b0a:6010:2a65;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:49d2:7b0a:6010:2a65
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<ua8q9v$1j50n$1@newsreader4.netcologne.de> <uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 01 Aug 2023 19:54:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3538
 by: MitchAlsup - Tue, 1 Aug 2023 19:54 UTC

On Tuesday, August 1, 2023 at 2:47:52 PM UTC-5, Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> > Thomas Koenig wrote:
> >> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >>
> >> [...]
> >>
> >>> in the case of having non-saved registers
> >>> that you'd like to use across function calls, I would probably consider
> >>> writing wrappers for those function calls: The wrapper would do the
> >>> save, call the actual function, then restore and return.
> >>
> >> The discussion on gcc has turned up an interesting option that I had
> >> overlooked:
> >>
> >> '-fipa-ra'
> >> Use caller save registers for allocation if those registers are not
> >> used by any called function. In that case it is not necessary to
> >> save and restore them around calls. This is only possible if
> >> called functions are part of same compilation unit as current
> >> function and they are compiled before it.
> >>
> >> Enabled at levels '-O2', '-O3', '-Os', however the option is
> >> disabled if generated code will be instrumented for profiling
> >> ('-p', or '-pg') or if callee's register usage cannot be known
> >> exactly (this happens on targets that do not expose prologues and
> >> epilogues in RTL).
> >>
> >> This should also be enabled with LTO.
> >
> > So this is basically the compiler people doing automatically the same
> > kind of register optimizations that I do manually in my asm. Nice!
> Yes, but only if the called function is visible.
>
> If you link code from separate object files, or use shared libraries,
> this is currently not possible.
<
I thought Ivan (i.e., Mill) was doing this.
>
> At least for the first case, it would be interesting to record the
> registers that are actually used in the object file, and use that
> for later optimization.

Re: Intel goes to 32 GPRs

<kit8h8Fl6gnU1@mid.individual.net>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33541&group=comp.arch#33541

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: niklas.holsti@tidorum.invalid (Niklas Holsti)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 23:33:44 +0300
Organization: Tidorum Ltd
Lines: 61
Message-ID: <kit8h8Fl6gnU1@mid.individual.net>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net MfVjWN6QdQYGakpWXtSPJgmoLceSod777+pP1RiqBWjJTvdOPL
Cancel-Lock: sha1:Ficvrp+o2AtPal2/ymaEcGn0FX4= sha256:6W3WinuUwBf3wVMdl39lu4xrovGEVyMfUD+rEV0jakc=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:102.0)
Gecko/20100101 Thunderbird/102.12.0
Content-Language: en-US
In-Reply-To: <92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
 by: Niklas Holsti - Tue, 1 Aug 2023 20:33 UTC

On 2023-08-01 22:54, MitchAlsup wrote:
> On Tuesday, August 1, 2023 at 2:47:52 PM UTC-5, Thomas Koenig wrote:
>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>>> Thomas Koenig wrote:
>>>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>>>>
>>>> [...]
>>>>
>>>>> in the case of having non-saved registers
>>>>> that you'd like to use across function calls, I would probably consider
>>>>> writing wrappers for those function calls: The wrapper would do the
>>>>> save, call the actual function, then restore and return.
>>>>
>>>> The discussion on gcc has turned up an interesting option that I had
>>>> overlooked:
>>>>
>>>> '-fipa-ra'
>>>> Use caller save registers for allocation if those registers are not
>>>> used by any called function. In that case it is not necessary to
>>>> save and restore them around calls. This is only possible if
>>>> called functions are part of same compilation unit as current
>>>> function and they are compiled before it.
>>>>
>>>> Enabled at levels '-O2', '-O3', '-Os', however the option is
>>>> disabled if generated code will be instrumented for profiling
>>>> ('-p', or '-pg') or if callee's register usage cannot be known
>>>> exactly (this happens on targets that do not expose prologues and
>>>> epilogues in RTL).
>>>>
>>>> This should also be enabled with LTO.
>>>
>>> So this is basically the compiler people doing automatically the same
>>> kind of register optimizations that I do manually in my asm. Nice!
>> Yes, but only if the called function is visible.
>>
>> If you link code from separate object files, or use shared libraries,
>> this is currently not possible.
> <
> I thought Ivan (i.e., Mill) was doing this.

I don't think that is quite the case. The Mill has no general working
registers; they are replaced by the "belt", a kind of finite stack onto
which all operations push their results. Any element of the belt can be
used as an input operand for other operations. The belt is never popped.
When a new result is pushed onto a full belt, the oldest result in the
belt falls off the other end and is lost.

When a call occurs, whether within the same object file or across object
files, the callee gets a new instance of the belt for its use, empty
except for the passed parameters at the front of the belt. The caller's
belt is saved and is invisible to the callee. When the call returns, the
caller sees its belt as it was before the call, except that the values
returned by the callee have been pushed onto it. The callee's belt
vanishes.

So there is no question of callee-saved vs caller-saved registers, or of
which registers are used by a function. The saving/restoring of the belt
is a HW function and is expected to run in parallel with the normal
computations.

Re: Intel goes to 32 GPRs

<44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33542&group=comp.arch#33542

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2452:b0:76a:6993:c4d8 with SMTP id h18-20020a05620a245200b0076a6993c4d8mr58126qkn.14.1690923449722; Tue, 01 Aug 2023 13:57:29 -0700 (PDT)
X-Received: by 2002:a05:6808:191a:b0:3a1:a15b:ef9f with SMTP id bf26-20020a056808191a00b003a1a15bef9fmr28591602oib.0.1690923449489; Tue, 01 Aug 2023 13:57:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!69.80.99.14.MISMATCH!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 1 Aug 2023 13:57:29 -0700 (PDT)
In-Reply-To: <kit8h8Fl6gnU1@mid.individual.net>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:49d2:7b0a:6010:2a65; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:49d2:7b0a:6010:2a65
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at> <c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de> <uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de> <92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com> <kit8h8Fl6gnU1@mid.individual.net>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 01 Aug 2023 20:57:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 82
 by: MitchAlsup - Tue, 1 Aug 2023 20:57 UTC

On Tuesday, August 1, 2023 at 3:33:48 PM UTC-5, Niklas Holsti wrote:
> On 2023-08-01 22:54, MitchAlsup wrote:
> > On Tuesday, August 1, 2023 at 2:47:52 PM UTC-5, Thomas Koenig wrote:
> >> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >>> Thomas Koenig wrote:
> >>>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >>>>
> >>>> [...]
> >>>>
> >>>>> in the case of having non-saved registers
> >>>>> that you'd like to use across function calls, I would probably consider
> >>>>> writing wrappers for those function calls: The wrapper would do the
> >>>>> save, call the actual function, then restore and return.
> >>>>
> >>>> The discussion on gcc has turned up an interesting option that I had
> >>>> overlooked:
> >>>>
> >>>> '-fipa-ra'
> >>>> Use caller save registers for allocation if those registers are not
> >>>> used by any called function. In that case it is not necessary to
> >>>> save and restore them around calls. This is only possible if
> >>>> called functions are part of same compilation unit as current
> >>>> function and they are compiled before it.
> >>>>
> >>>> Enabled at levels '-O2', '-O3', '-Os', however the option is
> >>>> disabled if generated code will be instrumented for profiling
> >>>> ('-p', or '-pg') or if callee's register usage cannot be known
> >>>> exactly (this happens on targets that do not expose prologues and
> >>>> epilogues in RTL).
> >>>>
> >>>> This should also be enabled with LTO.
> >>>
> >>> So this is basically the compiler people doing automatically the same
> >>> kind of register optimizations that I do manually in my asm. Nice!
> >> Yes, but only if the called function is visible.
> >>
> >> If you link code from separate object files, or use shared libraries,
> >> this is currently not possible.
> > <
> > I thought Ivan (i.e., Mill) was doing this.
> I don't think that is quite the case. The Mill has no general working
> registers; they are replaced by the "belt", a kind of finite stack onto
> which all operations push their results. Any element of the belt can be
> used as an input operand for other operations. The belt is never popped.
> When a new result is pushed onto a full belt, the oldest result in the
> belt falls off the other end and is lost.
<
{I know well that Mill does not have a general register model and HW based ABI.}
<
I though MILL was doing code generation and final optimizations at
link time (specialization) even when the function was not visible.
>
> When a call occurs, whether within the same object file or across object
> files, the callee gets a new instance of the belt for its use, empty
> except for the passed parameters at the front of the belt. The caller's
> belt is saved and is invisible to the callee. When the call returns, the
> caller sees its belt as it was before the call, except that the values
> returned by the callee have been pushed onto it. The callee's belt
> vanishes.
<
It is the LTO I was talking about, not the belt/register distinction.
>
> So there is no question of callee-saved vs caller-saved registers, or of
> which registers are used by a function. The saving/restoring of the belt
> is a HW function and is expected to run in parallel with the normal
> computations.

Re: Intel goes to 32 GPRs

<uacid2$3ure2$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33543&group=comp.arch#33543

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Tue, 1 Aug 2023 20:26:24 -0700
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <uacid2$3ure2$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 03:26:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fe07d88a5b8428dd6abe19dbabdab5cc";
logging-data="4156866"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5TymNh5ZQQsyUBXQoQIYrb2752zlxSKg="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:mMVaITJnl9zXD1opo/CtRSBuzL4=
In-Reply-To: <ua6lu2$1hq28$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Stephen Fuld - Wed, 2 Aug 2023 03:26 UTC

On 7/30/2023 2:49 PM, Thomas Koenig wrote:
> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> Every performance improvement project for an interpreter
>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>> of the interpreter matters. The fact that people perform performance
>> measurements shows this, too.
>
> Widely-used interpeted languages such as Python or Matlab are
> known to be as slow as molasses when they do not call highly-
> efficient compiled code. Any improvement (such as the link above)
> is welcome there.
>
> A concept followed by languages like the one used by Julia, which
> uses JIT compilation much more aggressively, seems to be a better
> approach. I have barely glanced at it yet, but it certainly seems
> worth looking into for scientific work.

I too have been intrigued by Julia. It seems to have a lot going for
it. I have been following it a little, intermittently. There have been
complaints from early on about performance, especially the start up
time. The Julia team has done a lot to improve this. See

https://julialang.org/blog/2023/04/julia-1.9-highlights/

I think they still have work to do, but it is an area of active development.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<2023Aug2.090705@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33544&group=comp.arch#33544

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 02 Aug 2023 07:07:05 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 25
Message-ID: <2023Aug2.090705@mips.complang.tuwien.ac.at>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at> <c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de> <uacid2$3ure2$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="883f522f72f3a64e2d4eadd60205df91";
logging-data="4188989"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19R02vEzFojQ8eBjYC9niBg"
Cancel-Lock: sha1:Zexg9Mna4xmjMR3UaHz/vtI2enQ=
X-newsreader: xrn 10.11
 by: Anton Ertl - Wed, 2 Aug 2023 07:07 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>I too have been intrigued by Julia. It seems to have a lot going for
>it. I have been following it a little, intermittently. There have been
>complaints from early on about performance, especially the start up
>time. The Julia team has done a lot to improve this. See
>
>https://julialang.org/blog/2023/04/julia-1.9-highlights/

And yet the data there show a time-to-load (TTL) of up to 13.5s for
the benchmarks. It's not explained what TTL is but I think that this
is the startup overhead on repeated startup (while the additional
overhead on first execution is TTFX). I had expected TTL to be much
lower.

This kind of startup overhead is unacceptable for many other
interactive languages. And for batch-compiled languages the startup
overhead is also usually small (at least as far as the language can
influence it). I expect that Julia is not used for executing short
scripts, but instead for long interactive sessions with few loadings
or reloadings of programs, and with relatively long computations.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Intel goes to 32 GPRs

<kiukrtFsfgcU1@mid.individual.net>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33545&group=comp.arch#33545

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: niklas.holsti@tidorum.invalid (Niklas Holsti)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 2 Aug 2023 12:10:21 +0300
Organization: Tidorum Ltd
Lines: 90
Message-ID: <kiukrtFsfgcU1@mid.individual.net>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
<kit8h8Fl6gnU1@mid.individual.net>
<44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net MoFgrK6GVqMsyGlT/TZn/gy7t1TsSc0kFR3witkY18OEs+1H+i
Cancel-Lock: sha1:pHRqa+u/knFv2SHTIiDEmdcroSY= sha256:9C8L9rttd3GUiEwe7/RAmKu4xCAp9AyN0aXcIDxhQb4=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:102.0)
Gecko/20100101 Thunderbird/102.12.0
Content-Language: en-US
In-Reply-To: <44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
 by: Niklas Holsti - Wed, 2 Aug 2023 09:10 UTC

On 2023-08-01 23:57, MitchAlsup wrote:
> On Tuesday, August 1, 2023 at 3:33:48 PM UTC-5, Niklas Holsti wrote:
>> On 2023-08-01 22:54, MitchAlsup wrote:
>>> On Tuesday, August 1, 2023 at 2:47:52 PM UTC-5, Thomas Koenig wrote:
>>>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>>>>> Thomas Koenig wrote:
>>>>>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> in the case of having non-saved registers
>>>>>>> that you'd like to use across function calls, I would probably consider
>>>>>>> writing wrappers for those function calls: The wrapper would do the
>>>>>>> save, call the actual function, then restore and return.
>>>>>>
>>>>>> The discussion on gcc has turned up an interesting option that I had
>>>>>> overlooked:
>>>>>>
>>>>>> '-fipa-ra'
>>>>>> Use caller save registers for allocation if those registers are not
>>>>>> used by any called function. In that case it is not necessary to
>>>>>> save and restore them around calls. This is only possible if
>>>>>> called functions are part of same compilation unit as current
>>>>>> function and they are compiled before it.
>>>>>>
>>>>>> Enabled at levels '-O2', '-O3', '-Os', however the option is
>>>>>> disabled if generated code will be instrumented for profiling
>>>>>> ('-p', or '-pg') or if callee's register usage cannot be known
>>>>>> exactly (this happens on targets that do not expose prologues and
>>>>>> epilogues in RTL).
>>>>>>
>>>>>> This should also be enabled with LTO.
>>>>>
>>>>> So this is basically the compiler people doing automatically the same
>>>>> kind of register optimizations that I do manually in my asm. Nice!
>>>> Yes, but only if the called function is visible.
>>>>
>>>> If you link code from separate object files, or use shared libraries,
>>>> this is currently not possible.
>>> <
>>> I thought Ivan (i.e., Mill) was doing this.
>> I don't think that is quite the case. The Mill has no general working
>> registers; they are replaced by the "belt", a kind of finite stack onto
>> which all operations push their results. Any element of the belt can be
>> used as an input operand for other operations. The belt is never popped.
>> When a new result is pushed onto a full belt, the oldest result in the
>> belt falls off the other end and is lost.
> <
> {I know well that Mill does not have a general register model and HW based ABI.}

{Apologies .. I should have known that, but I misunderstood your point,
as you saw.}

> <
> I though MILL was doing code generation and final optimizations at
> link time (specialization) even when the function was not visible.

The specialization stage is basically a translation from the "generic"
Mill machine code to the machine code for a particular Mill model. I
believe that the "optimization" at this stage is mainly the packing of
operations into bundles for multi-issue, which depends crucially on the
length of the belt and the set of Function Units available in the target
model.

AIUI, "specialization time" is not necessarily the same as "link time",
for the Mill. I think it will be possible to link generic-code modules
into a generic-code program, which is later (say, when the program is
installed on a machine) specialized as a whole with no further (static)
linking necessary.

>> When a call occurs, whether within the same object file or across object
>> files, the callee gets a new instance of the belt for its use, empty
>> except for the passed parameters at the front of the belt. The caller's
>> belt is saved and is invisible to the callee. When the call returns, the
>> caller sees its belt as it was before the call, except that the values
>> returned by the callee have been pushed onto it. The callee's belt
>> vanishes.
> <
> It is the LTO I was talking about, not the belt/register distinction.

Even so, the way the calls use the belt means that there is no question
of optimizing the saving and passing of registers across calls, even in
LTO. The most LTO can do is to inline the callee even for cross-module
calls.

Re: Intel goes to 32 GPRs

<uado9d$3mtd$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33547&group=comp.arch#33547

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 2 Aug 2023 07:12:59 -0700
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <uado9d$3mtd$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de> <uacid2$3ure2$1@dont-email.me>
<2023Aug2.090705@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 14:13:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fe07d88a5b8428dd6abe19dbabdab5cc";
logging-data="121773"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/d98toTDBnTN5O0/mRi3W1dVoAtVRmaKY="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:mkRd4F7mRMUopuU6coKoMMnzmKc=
In-Reply-To: <2023Aug2.090705@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Wed, 2 Aug 2023 14:12 UTC

On 8/2/2023 12:07 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> I too have been intrigued by Julia. It seems to have a lot going for
>> it. I have been following it a little, intermittently. There have been
>> complaints from early on about performance, especially the start up
>> time. The Julia team has done a lot to improve this. See
>>
>> https://julialang.org/blog/2023/04/julia-1.9-highlights/
>
> And yet the data there show a time-to-load (TTL) of up to 13.5s for
> the benchmarks. It's not explained what TTL is but I think that this
> is the startup overhead on repeated startup (while the additional
> overhead on first execution is TTFX). I had expected TTL to be much
> lower.
>
> This kind of startup overhead is unacceptable for many other
> interactive languages. And for batch-compiled languages the startup
> overhead is also usually small (at least as far as the language can
> influence it). I expect that Julia is not used for executing short
> scripts, but instead for long interactive sessions with few loadings
> or reloadings of programs, and with relatively long computations.

As you can see from the sentence you didn't copy from my post, I agree
with you, and furthermore, I think the developers do too. Also, I
expect from its primary use as a "scientific oriented" language, that
your expectations of its usage patterns are correct, but I am not expert
in the field.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<uadp5n$3s17$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33548&group=comp.arch#33548

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 2 Aug 2023 07:28:05 -0700
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <uadp5n$3s17$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
<kit8h8Fl6gnU1@mid.individual.net>
<44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
<kiukrtFsfgcU1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 14:28:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="fe07d88a5b8428dd6abe19dbabdab5cc";
logging-data="127015"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19hW9EzrJv+qB5Ke7g0V8CWPSbhRZa7aY4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:YX8dxez6oBKhHkYhX6ri4HodcsA=
Content-Language: en-US
In-Reply-To: <kiukrtFsfgcU1@mid.individual.net>
 by: Stephen Fuld - Wed, 2 Aug 2023 14:28 UTC

On 8/2/2023 2:10 AM, Niklas Holsti wrote:
> On 2023-08-01 23:57, MitchAlsup wrote:

snip

>> I though MILL was doing code generation and final optimizations at
>> link time (specialization) even when the function was not visible.
>
>
> The specialization stage is basically a translation from the "generic"
> Mill machine code to the machine code for a particular Mill model. I
> believe that the "optimization" at this stage is mainly the packing of
> operations into bundles for multi-issue, which depends crucially on the
> length of the belt and the set of Function Units available in the target
> model.
>
> AIUI, "specialization time" is not necessarily the same as "link time",
> for the Mill.

Right.

> I think it will be possible to link generic-code modules
> into a generic-code program, which is later (say, when the program is
> installed on a machine) specialized as a whole with no further (static)
> linking necessary.

That implies that link time occurs prior to specialization time. I
don't know, but ISTM that would increase specialization time, as you
would have to "repspecialize" the "library modules" each time you made a
change to the source and had to recompile and thus relink the program.
Doing the specialization before the link would eliminate that.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<kivmnhF2eqqU1@mid.individual.net>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33551&group=comp.arch#33551

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: niklas.holsti@tidorum.invalid (Niklas Holsti)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Wed, 2 Aug 2023 21:48:17 +0300
Organization: Tidorum Ltd
Lines: 45
Message-ID: <kivmnhF2eqqU1@mid.individual.net>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
<kit8h8Fl6gnU1@mid.individual.net>
<44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
<kiukrtFsfgcU1@mid.individual.net> <uadp5n$3s17$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net rD3v5MMPUtlcRdPyTYUhUAC9YzlAG1hZeAGb79Yc+F8QTrlmAM
Cancel-Lock: sha1:W/pip2uB/ZSN0by/kISaDh1OppU= sha256:CBREkeBkQ6BqR90MVpSmRSEuEMrd36Tsv4IEXb+yUxA=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:102.0)
Gecko/20100101 Thunderbird/102.12.0
Content-Language: en-US
In-Reply-To: <uadp5n$3s17$1@dont-email.me>
 by: Niklas Holsti - Wed, 2 Aug 2023 18:48 UTC

On 2023-08-02 17:28, Stephen Fuld wrote:
> On 8/2/2023 2:10 AM, Niklas Holsti wrote:
>> On 2023-08-01 23:57, MitchAlsup wrote:
>
> snip
>
>>> I though MILL was doing code generation and final optimizations at
>>> link time (specialization) even when the function was not visible.
>>
>>
>> The specialization stage is basically a translation from the "generic"
>> Mill machine code to the machine code for a particular Mill model. I
>> believe that the "optimization" at this stage is mainly the packing of
>> operations into bundles for multi-issue, which depends crucially on
>> the length of the belt and the set of Function Units available in the
>> target model.
>>
>> AIUI, "specialization time" is not necessarily the same as "link
>> time", for the Mill.
>
> Right.
>
>
>> I think it will be possible to link generic-code modules into a
>> generic-code program, which is later (say, when the program is
>> installed on a machine) specialized as a whole with no further
>> (static) linking necessary.
>
> That implies that link time occurs prior to specialization time.

I did not mean that this would always be the order, only that it would
be a possible order.

> I don't know, but ISTM that would increase specialization time, as
> you would have to "repspecialize" the "library modules" each time you
> made a change to the source and had to recompile and thus relink the
> program. Doing the specialization before the link would eliminate
> that.

Yes, and I think it will also be possible do it in that order, perhaps
during program development when there is no need to generate a portable
generic-code version of the program.

Re: Intel goes to 32 GPRs

<uai38i$15j70$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33557&group=comp.arch#33557

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ivan@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Thu, 3 Aug 2023 22:44:50 -0700
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <uai38i$15j70$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <ua8q9v$1j50n$1@newsreader4.netcologne.de>
<uab3o8$3m4sn$1@dont-email.me> <uabnh5$1l37t$1@newsreader4.netcologne.de>
<92bc7f04-33bb-4a4f-902b-7d95ac54a38fn@googlegroups.com>
<kit8h8Fl6gnU1@mid.individual.net>
<44fb72dd-dcf6-4e4e-a4a7-844dcc4028ban@googlegroups.com>
<kiukrtFsfgcU1@mid.individual.net> <uadp5n$3s17$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Aug 2023 05:44:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9314398a0d581e10b2801b5a70c1c0cb";
logging-data="1232096"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/yuKqjDgDXZwOjb3Lxw4eh"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:AjHz/14+EYJ/mNvVmiRTQ2Ch68Q=
In-Reply-To: <uadp5n$3s17$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Fri, 4 Aug 2023 05:44 UTC

On 8/2/2023 7:28 AM, Stephen Fuld wrote:
> On 8/2/2023 2:10 AM, Niklas Holsti wrote:
>> On 2023-08-01 23:57, MitchAlsup wrote:
>
> snip
>
>>> I though MILL was doing code generation and final optimizations at
>>> link time (specialization) even when the function was not visible.
>>
>>
>> The specialization stage is basically a translation from the "generic"
>> Mill machine code to the machine code for a particular Mill model. I
>> believe that the "optimization" at this stage is mainly the packing of
>> operations into bundles for multi-issue, which depends crucially on
>> the length of the belt and the set of Function Units available in the
>> target model.
>>
>> AIUI, "specialization time" is not necessarily the same as "link
>> time", for the Mill.
>
> Right.
>
>
>> I think it will be possible to link generic-code modules into a
>> generic-code program, which is later (say, when the program is
>> installed on a machine) specialized as a whole with no further
>> (static) linking necessary.
>
> That implies that link time occurs prior to specialization time.  I
> don't know, but ISTM that would increase specialization time, as you
> would have to "repspecialize" the "library modules" each time you made a
> change to the source and had to recompile and thus relink the program.
> Doing the specialization before the link would eliminate that.

Linking before specialization has the same advantages and costs as any
other LTO: you have to regenerate the machine code from the optimized
program as linked. You can link after specialization too, or mix with
some modules prelinked and some postlinked.

Re: Intel goes to 32 GPRs

<ual08k$1r24u$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33560&group=comp.arch#33560

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-24d2-0-6cd3-4049-f6c9-5577.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sat, 5 Aug 2023 08:12:04 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <ual08k$1r24u$1@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at>
<ua6lu2$1hq28$1@newsreader4.netcologne.de> <uacid2$3ure2$1@dont-email.me>
<2023Aug2.090705@mips.complang.tuwien.ac.at>
Injection-Date: Sat, 5 Aug 2023 08:12:04 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-24d2-0-6cd3-4049-f6c9-5577.ipv6dyn.netcologne.de:2001:4dd6:24d2:0:6cd3:4049:f6c9:5577";
logging-data="1935518"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 5 Aug 2023 08:12 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>I too have been intrigued by Julia. It seems to have a lot going for
>>it. I have been following it a little, intermittently. There have been
>>complaints from early on about performance, especially the start up
>>time. The Julia team has done a lot to improve this. See
>>
>>https://julialang.org/blog/2023/04/julia-1.9-highlights/
>
> And yet the data there show a time-to-load (TTL) of up to 13.5s for
> the benchmarks. It's not explained what TTL is but I think that this
> is the startup overhead on repeated startup (while the additional
> overhead on first execution is TTFX). I had expected TTL to be much
> lower.

The Julia core is slow to start up, but not that slow:

$ echo "" | time ./julia
0.24user 0.27system 0:00.15elapsed 338%CPU (0avgtext+0avgdata 230452maxresident)k
0inputs+0outputs (0major+29689minor)pagefaults 0swaps

(where this is actually from "spinning rust", as Tejre is wont to say).

The long loading times are for packages, where they have now
improved the precompilation.

> This kind of startup overhead is unacceptable for many other
> interactive languages.

People don't use Julia for extracting the first column of a text file,
it is possible to use awk or cut or perl or Python for that.

Where it shines is its performance for mathematical tasks.
Take solving ordinary differential equations. In Python, the
expression for the right-hand side has to be interpreted, which
makes it slooooow. Julia compiles it JIT and gets performance
which is fairly close to the ones by a compiled language.

And the advantage of using an interpteter, rather than a library in
a compiled language, is ease of use - no need to worry about
overlong argument lists.

In a way, this is funny. Pre-Fortran 90 scientific packages
suffer from overlong argument lists, mostly for array bounds and
work space. Fortran 90, influenced by Matlab, introduced ways to
do away with that, but when it was gaining traction, many people
had already moved to C, which suffers from the same shortcomings
as FORTRAN <= 77, plus some more.

I've worked with Sundials, and I have certainly wished for a version
with fewer possibilities for bugs...

>And for batch-compiled languages the startup
> overhead is also usually small (at least as far as the language can
> influence it). I expect that Julia is not used for executing short
> scripts, but instead for long interactive sessions with few loadings
> or reloadings of programs, and with relatively long computations.

Or in batch.

Re: Intel goes to 32-bit general purpose registers

<341fc74-15ae-5da9-2d2d-673465f29b@elronnd.net>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33620&group=comp.arch#33620

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: elronnd@elronnd.net (Elijah Stone)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sat, 12 Aug 2023 15:26:28 -0700
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <341fc74-15ae-5da9-2d2d-673465f29b@elronnd.net>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <d6RvM.21474$cc2c.10950@fx37.iad> <db82b799-811e-413e-b838-98b85b96a3b1n@googlegroups.com> <989dc353-b18d-4a40-8b61-11f65b4b1bd0n@googlegroups.com> <2023Jul26.072712@mips.complang.tuwien.ac.at>
<d185bbc6-1a82-425a-b932-95aac063b189n@googlegroups.com> <2023Jul30.182952@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
Injection-Info: dont-email.me; posting-host="88722521d1ee332bc5238f64257695d5";
logging-data="1611255"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+2wMinMQdZ0EQkQ1dXjNkr"
Cancel-Lock: sha1:FA4IQ3xYDnoztoZUV115LSBr1LY=
In-Reply-To: <2023Jul30.182952@mips.complang.tuwien.ac.at>
 by: Elijah Stone - Sat, 12 Aug 2023 22:26 UTC

On Sun, 30 Jul 2023, Anton Ertl wrote:

> Apart from saving the F0 prefix, what alternative uses could that bit have
> been used for?

If the bit is set, then you get another prefix byte. That would allow
three-address instructions more compactly than with the current scheme. But
probably not so nice for the decoders.

Re: Intel goes to 32-bit general purpose registers

<ae3ea555-85a8-4775-88ea-33046ec2ecdbn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33621&group=comp.arch#33621

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:4c0a:b0:63c:f38d:e0ce with SMTP id qh10-20020a0562144c0a00b0063cf38de0cemr93104qvb.1.1691884703141;
Sat, 12 Aug 2023 16:58:23 -0700 (PDT)
X-Received: by 2002:a17:903:2015:b0:1b8:84d9:dea6 with SMTP id
s21-20020a170903201500b001b884d9dea6mr1800715pla.12.1691884702896; Sat, 12
Aug 2023 16:58:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 12 Aug 2023 16:58:22 -0700 (PDT)
In-Reply-To: <341fc74-15ae-5da9-2d2d-673465f29b@elronnd.net>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f13f:c8fd:311b:b1a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f13f:c8fd:311b:b1a
References: <u9o14h$183or$1@newsreader4.netcologne.de> <d6RvM.21474$cc2c.10950@fx37.iad>
<db82b799-811e-413e-b838-98b85b96a3b1n@googlegroups.com> <989dc353-b18d-4a40-8b61-11f65b4b1bd0n@googlegroups.com>
<2023Jul26.072712@mips.complang.tuwien.ac.at> <d185bbc6-1a82-425a-b932-95aac063b189n@googlegroups.com>
<2023Jul30.182952@mips.complang.tuwien.ac.at> <341fc74-15ae-5da9-2d2d-673465f29b@elronnd.net>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae3ea555-85a8-4775-88ea-33046ec2ecdbn@googlegroups.com>
Subject: Re: Intel goes to 32-bit general purpose registers
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 12 Aug 2023 23:58:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 14
 by: MitchAlsup - Sat, 12 Aug 2023 23:58 UTC

On Saturday, August 12, 2023 at 5:26:36 PM UTC-5, Elijah Stone wrote:
> On Sun, 30 Jul 2023, Anton Ertl wrote:
>
> > Apart from saving the F0 prefix, what alternative uses could that bit have
> > been used for?
>
> If the bit is set, then you get another prefix byte. That would allow
> three-address instructions more compactly than with the current scheme. But
> probably not so nice for the decoders.
<
With 15,000 instructions, decoders are already stupidly complicated.
What is another prefix going to add to it ?? A:: not much.

Re: Intel goes to 32-bit general purpose registers

<ubasm5$2984t$2@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33627&group=comp.arch#33627

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-1c08-0-d25-c3ed-1d88-1820.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32-bit general purpose registers
Date: Sun, 13 Aug 2023 15:25:57 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <ubasm5$2984t$2@newsreader4.netcologne.de>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<d6RvM.21474$cc2c.10950@fx37.iad>
<db82b799-811e-413e-b838-98b85b96a3b1n@googlegroups.com>
<989dc353-b18d-4a40-8b61-11f65b4b1bd0n@googlegroups.com>
<2023Jul26.072712@mips.complang.tuwien.ac.at>
<d185bbc6-1a82-425a-b932-95aac063b189n@googlegroups.com>
<2023Jul30.182952@mips.complang.tuwien.ac.at>
<341fc74-15ae-5da9-2d2d-673465f29b@elronnd.net>
<ae3ea555-85a8-4775-88ea-33046ec2ecdbn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 13 Aug 2023 15:25:57 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-1c08-0-d25-c3ed-1d88-1820.ipv6dyn.netcologne.de:2001:4dd6:1c08:0:d25:c3ed:1d88:1820";
logging-data="2400413"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 13 Aug 2023 15:25 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Saturday, August 12, 2023 at 5:26:36 PM UTC-5, Elijah Stone wrote:
>> On Sun, 30 Jul 2023, Anton Ertl wrote:
>>
>> > Apart from saving the F0 prefix, what alternative uses could that bit have
>> > been used for?
>>
>> If the bit is set, then you get another prefix byte. That would allow
>> three-address instructions more compactly than with the current scheme. But
>> probably not so nice for the decoders.
><
> With 15,000 instructions, decoders are already stupidly complicated.
> What is another prefix going to add to it ?? A:: not much.

I recently read that Intel decoders have formed a union and gone on
strike, demanding better working conditions.

Re: Intel goes to 32 GPRs

<ublni6$3s8o9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33667&group=comp.arch#33667

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Thu, 17 Aug 2023 18:05:58 -0000 (UTC)
Organization: provalid.com
Lines: 60
Message-ID: <ublni6$3s8o9$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
Injection-Date: Thu, 17 Aug 2023 18:05:58 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="43147f8f2b93caf0c089ef20bee338e1";
logging-data="4072201"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BbFVGt+Ezy4PTGeOgLTFw"
Cancel-Lock: sha1:HedJ/5ioyXJCGyO4jfd+COYEbpE=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: Kent Dickey - Thu, 17 Aug 2023 18:05 UTC

In article <u9r0jd$1ftnb$1@dont-email.me>,
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>Thomas Koenig wrote:
>> One problem I see is that all the new registers are caller-saved,
>> for compatibility with existing ABIs. This is needed due to stack
>> unwinding and setjmp/longjmp, but restricts their benefit due to
>> having to spill them across function calls. It might be possible
>> to set __attribute__((nothrow)) on functions where this cannot
>> happen, and change some caller-saved to callee-saved registers
>> in that case, but that could be an interesting discussion.
>
>I'm not worried at all about this point: The only places where I really
>want lots of registers are in big/complicated leaf functions!
>
>If a function both needs lots of registers _and_ have to call any
>non-inlined functions, then it really isn't that time critical.

Defensively written code which calls error handlers offline mean most
leaf calls are not really leafs. Also, small leaf functions are pulled
into the calling function through inlining, and the resulting function
is no longer a leaf.

Code other than small toys often have critical loops that are not in
leaf functions. If you spend your time looking at programs of less than
500 lines, then you will miss what larger programs are doing. Designing
for benchmarks is a bad idea.

A large program I work on is helped by the GCC ability to "borrow"
scratch registers if it can peek into the functions being called and see
that the registers won't be used, and so use scratch registers knowing
the called routine won't overwrite them. This doesn't always work since
the called functions are often in other files, defeating the GCC
optimization. I don't do LTO since although performance is important,
build time and debuggability is also important, and you have to draw the
line somewhere. I've dumped out a disassembly when this program is
compiled for Aarch64, and I see no use of x14-x18 (the highest numbered
scratch registers) EXCEPT when GCC has realized it can borrow them from
a leaf routine which isn't using them. Meanwhile, loops are forced to
spill values to the stack since they don't fit in registers. This is
not just a performance issue, it's a code bloat issue.

The mere existence of the GCC feature to do this scratch register
borrowing shows it is useful.

And again, if you need 16 more registers in your leaf function, then you
definitely have time to spill/fill them, since whatever you are doing is
taking a lot more time than the spill/fill. And the argument is simple:
it's trivial for large leaf routines to spill/fill registers at almost
no performance cost, but there really is no great solution for a
performance critical non-leaf routine to get more registers. Therefore,
ABIs should provide more preserved registers and fewer scratch
registers.

It is a mistake to make these new registers caller save. There are ways
to make them callee save. However, I will admit due to the way x86_64
CPUs optimize loads, the cost to it of spill/fill of registers onto the
stack in a non-leaf routine is not as bad as on a RISC. It's how it
gets away with not having enough registers now.

Kent

Re: Intel goes to 32 GPRs

<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33670&group=comp.arch#33670

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1866:b0:63f:c14d:fac4 with SMTP id eh6-20020a056214186600b0063fc14dfac4mr10020qvb.2.1692307236354;
Thu, 17 Aug 2023 14:20:36 -0700 (PDT)
X-Received: by 2002:a05:6a00:2daa:b0:682:5748:2e88 with SMTP id
fb42-20020a056a002daa00b0068257482e88mr392712pfb.0.1692307236129; Thu, 17 Aug
2023 14:20:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Aug 2023 14:20:35 -0700 (PDT)
In-Reply-To: <ublni6$3s8o9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8169:1937:9c0f:546c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8169:1937:9c0f:546c
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Aug 2023 21:20:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 52
 by: MitchAlsup - Thu, 17 Aug 2023 21:20 UTC

On Thursday, August 17, 2023 at 1:06:02 PM UTC-5, Kent Dickey wrote:
> In article <u9r0jd$1ftnb$1...@dont-email.me>,
> Terje Mathisen <terje.m...@tmsw.no> wrote:
> >Thomas Koenig wrote:
> >> One problem I see is that all the new registers are caller-saved,
> >> for compatibility with existing ABIs. This is needed due to stack
> >> unwinding and setjmp/longjmp, but restricts their benefit due to
> >> having to spill them across function calls. It might be possible
> >> to set __attribute__((nothrow)) on functions where this cannot
> >> happen, and change some caller-saved to callee-saved registers
> >> in that case, but that could be an interesting discussion.
> >
> >I'm not worried at all about this point: The only places where I really
> >want lots of registers are in big/complicated leaf functions!
> >
> >If a function both needs lots of registers _and_ have to call any
> >non-inlined functions, then it really isn't that time critical.
>
> Defensively written code which calls error handlers offline mean most
> leaf calls are not really leafs.
<
This seems to be poor programming style--because the error checking
can be returned to the caller converting::
<
leaf_subroutine()
<
leaf_subroutine()
{ .
.
if( some condition ) exception( 37 );
<
converts into::
<
if( except = leaf_now_function() ) exception( excpt );
<
leaf_now_function()
{ .
.
if( some condition ) return 37;
<
> Also, small leaf functions are pulled
> into the calling function through inlining, and the resulting function
> is no longer a leaf.
<
Don't you mean that when a non-leaf subroutine inlines all of its
called subroutines it becomes a leaf ??
<

Re: Intel goes to 32 GPRs

<0c44dd40-72fe-4f91-a4da-3e10ac650c7dn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33671&group=comp.arch#33671

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2042:b0:76d:8d4f:5215 with SMTP id d2-20020a05620a204200b0076d8d4f5215mr107qka.15.1692317912101;
Thu, 17 Aug 2023 17:18:32 -0700 (PDT)
X-Received: by 2002:a05:6a00:1a11:b0:682:24c1:2951 with SMTP id
g17-20020a056a001a1100b0068224c12951mr586431pfv.0.1692317911773; Thu, 17 Aug
2023 17:18:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Aug 2023 17:18:31 -0700 (PDT)
In-Reply-To: <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:b924:3f7e:d30a:ead1;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:b924:3f7e:d30a:ead1
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0c44dd40-72fe-4f91-a4da-3e10ac650c7dn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 18 Aug 2023 00:18:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Quadibloc - Fri, 18 Aug 2023 00:18 UTC

On Thursday, August 17, 2023 at 3:20:38 PM UTC-6, MitchAlsup wrote:
> On Thursday, August 17, 2023 at 1:06:02 PM UTC-5, Kent Dickey wrote:

> > Also, small leaf functions are pulled
> > into the calling function through inlining, and the resulting function
> > is no longer a leaf.
> <
> Don't you mean that when a non-leaf subroutine inlines all of its
> called subroutines it becomes a leaf ??

That is _also_ true, but that doesn't make what he wrote wrong.

He was saying that if a non-leaf subroutine inlines _one_ of its
called subroutines, which _was_ a leaf, then, unless that was its
last called subroutine, the caller remains a non-leaf routine, and
so the code moved inline is no longer part of a leaf.

That, too, is entirely true.

John Savard

Re: Intel goes to 32 GPRs

<ubn2ab$5ed1$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=33673&group=comp.arch#33673

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Fri, 18 Aug 2023 08:15:37 +0200
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <ubn2ab$5ed1$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<0c44dd40-72fe-4f91-a4da-3e10ac650c7dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 18 Aug 2023 06:15:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d219d5efb561221604239eaaf55a97f6";
logging-data="178593"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18sLOpTEI2FmlB8W2CNz97TsgWwPdA6X2PnFS0xeT2TTA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:4PEa/3FkG9ptfzUp14oD5eJeXYc=
In-Reply-To: <0c44dd40-72fe-4f91-a4da-3e10ac650c7dn@googlegroups.com>
 by: Terje Mathisen - Fri, 18 Aug 2023 06:15 UTC

Quadibloc wrote:
> On Thursday, August 17, 2023 at 3:20:38 PM UTC-6, MitchAlsup wrote:
>> On Thursday, August 17, 2023 at 1:06:02 PM UTC-5, Kent Dickey wrote:
>
>>> Also, small leaf functions are pulled
>>> into the calling function through inlining, and the resulting function
>>> is no longer a leaf.
>> <
>> Don't you mean that when a non-leaf subroutine inlines all of its
>> called subroutines it becomes a leaf ??
>
> That is _also_ true, but that doesn't make what he wrote wrong.
>
> He was saying that if a non-leaf subroutine inlines _one_ of its
> called subroutines, which _was_ a leaf, then, unless that was its
> last called subroutine, the caller remains a non-leaf routine, and
> so the code moved inline is no longer part of a leaf.
>
> That, too, is entirely true.

To me as an asm programmer, this indicates the need to partition (if
possible) non-leaf functions so that major parts can use all registers,
with the required push/pop blocks around them. It is only when you have
to call something inside the innermost loops that you really get into
trouble.

Probably the most common pattern for calling functions inside inner
loops is like Mitch mentioned, i.e. for error handling/exceptions, and
these are by definition not speed critical, so they can easily afford
extra caller save/restore blocks around them, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


devel / comp.arch / Re: Intel goes to 32 GPRs

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor