Rocksolid Light - comp.arch - Re: Intel goes to 32 GPRs

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>> Thomas Koenig wrote:
>>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>>> Thomas Koenig wrote:
>>>
>>>>> One problem I see is that all the new registers are caller-saved,
>>>>> for compatibility with existing ABIs. This is needed due to stack
>>>>> unwinding and setjmp/longjmp, but restricts their benefit due to
>>>>> having to spill them across function calls. It might be possible
>>>>> to set __attribute__((nothrow)) on functions where this cannot
>>>>> happen, and change some caller-saved to callee-saved registers
>>>>> in that case, but that could be an interesting discussion.
>>>>>
>>>>
>>>> I'm not worried at all about this point: The only places where I really
>>>> want lots of registers are in big/complicated leaf functions!
>>>>
>>>> If a function both needs lots of registers _and_ have to call any
>>>> non-inlined functions, then it really isn't that time critical.
>>>
>>> Fortran can use lots of registers for its array descriptors,
>>> and also can use lots of library calls for mathematical functions
>>> (because most CPUs don't have Mitch's single instructions for them).
>>> Fortran library functions are typically __attribute__((nothrow)),
>>> so in that field being able to use more registers across calls
>>> would be a good thing, generally.
>>>
>> If said Fortran code is really performance critical, like in an FFT,
>> then all the sin/cos function calls will be done up front and cached.
>
> If you're doing lots of chemical reaction calculation, it is
> not possible to pre-compute the Arrhenius equation coefficients
> (and their Jacobians).
>
> There's more to life than FFT :-)
>
I believe you.

I still think the crux of the matter is in how complicated those special
functions are. I.e. in your examples would it be possible to either
inline or wrap the Arrhenius calculation in a save/restore registers pair?

BTW, I have now read up on these new instructions and registers, and as
I suspected they are already supported by XSAVE, with no additional save
space needed (because they are reusing an area that is no longer needed).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Intel goes to 32 GPRs

<37887b62-b6f6-4abf-b242-d77785135b72n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33506&group=comp.arch#33506

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:189c:b0:403:fb10:28f8 with SMTP id v28-20020a05622a189c00b00403fb1028f8mr8552qtc.4.1690565597849;
Fri, 28 Jul 2023 10:33:17 -0700 (PDT)
X-Received: by 2002:a05:6808:2019:b0:3a2:2146:1e0 with SMTP id
q25-20020a056808201900b003a2214601e0mr5953392oiw.0.1690565597613; Fri, 28 Jul
2023 10:33:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 28 Jul 2023 10:33:17 -0700 (PDT)
In-Reply-To: <ua0b7f$2881a$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:14a8:e9fb:b889:45ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:14a8:e9fb:b889:45ab
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<u9r24c$1a2ag$1@newsreader4.netcologne.de> <u9s0m2$1jbl0$1@dont-email.me>
<ua087o$1dg5f$1@newsreader4.netcologne.de> <ua0b7f$2881a$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <37887b62-b6f6-4abf-b242-d77785135b72n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 28 Jul 2023 17:33:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4231

by: MitchAlsup - Fri, 28 Jul 2023 17:33 UTC

On Friday, July 28, 2023 at 7:10:27 AM UTC-5, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >> Thomas Koenig wrote:
> >>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >>>> Thomas Koenig wrote:
> >>>
> >>>>> One problem I see is that all the new registers are caller-saved,
> >>>>> for compatibility with existing ABIs. This is needed due to stack
> >>>>> unwinding and setjmp/longjmp, but restricts their benefit due to
> >>>>> having to spill them across function calls. It might be possible
> >>>>> to set __attribute__((nothrow)) on functions where this cannot
> >>>>> happen, and change some caller-saved to callee-saved registers
> >>>>> in that case, but that could be an interesting discussion.
> >>>>>
> >>>>
> >>>> I'm not worried at all about this point: The only places where I really
> >>>> want lots of registers are in big/complicated leaf functions!
> >>>>
> >>>> If a function both needs lots of registers _and_ have to call any
> >>>> non-inlined functions, then it really isn't that time critical.
> >>>
> >>> Fortran can use lots of registers for its array descriptors,
> >>> and also can use lots of library calls for mathematical functions
> >>> (because most CPUs don't have Mitch's single instructions for them).
> >>> Fortran library functions are typically __attribute__((nothrow)),
> >>> so in that field being able to use more registers across calls
> >>> would be a good thing, generally.
> >>>
> >> If said Fortran code is really performance critical, like in an FFT,
> >> then all the sin/cos function calls will be done up front and cached.
> >
> > If you're doing lots of chemical reaction calculation, it is
> > not possible to pre-compute the Arrhenius equation coefficients
> > (and their Jacobians).
> >
> > There's more to life than FFT :-)
> >
> I believe you.
>
> I still think the crux of the matter is in how complicated those special
> functions are. I.e. in your examples would it be possible to either
> inline or wrap the Arrhenius calculation in a save/restore registers pair?
>
> BTW, I have now read up on these new instructions and registers, and as
> I suspected they are already supported by XSAVE, with no additional save
> space needed (because they are reusing an area that is no longer needed).
<
Forgiveness for some of the sins of the past ?!?
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

On 7/27/2023 5:01 AM, Terje Mathisen wrote:
> Scott Lurndal wrote:
>> MitchAlsup <MitchAlsup@aol.com> writes:
>>> On Wednesday, July 26, 2023 at 4:06:48=E2=80=AFPM UTC-5, Scott
>>> Lurndal wrot=
>>> e:
>>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
>>>>> Anton Ertl wrote:=20
>>>>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
>>>>>>> If a function both needs lots of registers _and_ have to call any=20
>>>>>>> non-inlined functions, then it really isn't that time critical.=20
>>>>>> =20
>>>>>> Every interpreter calls non-inlined functions, and they often need a=
>>> =20
>>>>>> lot of registers, or can make good use of them.=20
>>>>>> =20
>>>>>> Now you may consider that to be not time-critical, but if you are=20
>>>>>> Intel and want to sell an interpreter user an Intel system, and it
>>>>>> is=
>>> =20
>>>>>> slow compared to the ARM64 or RISC-V systems, you will still lose
>>>>>> the=
>>> =20
>>>>>> sale.=20
>>>>> =20
>>>>> I'm willing to be shown otherwise, but util then I'd consider any
>>>>> code=
>>> =20
>>>>> that runs under an actual interpreter to be non-performance-critical.
>>> <
>>>> I would likely argue that an interpreter is inherently
>>>> performance-critic=
>>> al=20
>>>> in order to make the interpreted code useful.=20
>>>> =20
>>>> Take a machine simulator as an example of performance-critical=20
>>>> interpreter (interpreting machine instructions rather than bytecode=20
>>>> or some intermediate tree representation). Booting linux on the=20
>>>> simulator, for example, is definitely performance critical from the=20
>>>> standpoint of the user waiting for a login prompt[*].=20
>>> <
>>> As someone who has actually done this (1999)
>>
>> We had a simulator for the Burroughs mainframe in the 70's
>> that we used to test instruction set changes (it was in Burroughs
>> Algol and ran on a B7900).
>>
>> I currently work on an SoC simulator which simulates (functionally)
>> the entire SoC: ARM64 cores (dozens), peripheral controllers (e.g. SATA,
>> SPI, EMMC, I2C/I3C, networking), microcontrollers, accelerator blocks,
>> PCI, etc.
>>
>> Performance is key.
>>
>>
>> , what we did was to
>>> take portions (~=3D basic blocks) of code and compile them into traces
>>> complete with CPU/cache/TLB statistics updates, and then run
>>> 95%=C2=B1=20
>>> of the interpreter as native code.=20
>>> <
>>> Is this still simulation ? obviously
>>> but it is more like JIT than interpretation.
>>
>> Most modern simulators (e.g. qemu) use jit-like mechanisms,
>> as did AMDs SimNow! back in the 2000s.
>
> Exactly my point: Those JIT-style traces can obviously use all the regs,
> as can any other leaf code.
>
> It is only when you want to use the new regs for the core interpreter
> code that you run into trouble.
>

Yeah. I guess, once supported, it is a question of how long until
Windows/etc supports them, AMD supports them, support is widespread in
CPUs that people have, ...

I can note that this takes a while, and I wouldn't be entirely sure of
x86 doesn't start to lose market-share before this happens (say, if more
people start moving to ARM based systems or similar, *).

*: Though RISC-V is interesting, I don't really think it is "equipped"
at this stage to compete against either x86 or ARM in the PC space, but
does seem like a reasonable choice for the embedded space.

As for JITs, yeah, they can use whatever instructions or registers are
available to them...

Also ironically, my past VMs used JITs, but for the BJX2 emulator,
didn't really end up maintaining the JIT as the normal interpreter was
still fast enough to keep up with emulating the CPU core in real-time.

Similarly, the code for emulating things like cache misses and
branch-predictor costs and similar ends up costing more than "actually
emulating the code" does, but this is needed for "mostly cycle-accurate"
emulation.

And, for my uses, it does matter if the performance characteristics of
the emulator are roughly consistent with those of the Verilog
implementation (so, eg, when decoding traces, it also keeps track of
things like pipeline interlocks and similar to model how many
clock-cycles each instruction and trace would take, with any penalties
then being added for cache and branch-predictor misses).

There is a "--wallspeed" that basically means "disable modeling and run
interpreter as fast as it will go, using external wall-clock time as a
reference rather than cycle-count for the internal time".

Currently, this mode is fast enough that Quake and GLQuake run at
double-digit speeds. The Doom engine games also run (initially) pretty
fast, but are not giving a smooth experience as something is glitchy
with the timing.

With just the interpreter running at full speed, this is ~ 135 MHz / 163
MIPs.

Probing some more, for whatever reason disabling cache modeling
("--nomemcost") seems to result in decidedly "non-smooth" behavior in
the Doom-engine games (but Quake and ROTT and similar are unaffected).
This seems curious (may need to look into it).

Could put work on trying to get the JIT back into working condition if
there was more of a need for it, but as-is, it isn't really needed.

I can note that most of my past JITs also used a mostly similar register
allocation strategy to what is currently used by BGBCC (mostly; though a
lot of my JITs used a more round-robin register-allocation strategy, *).

....

*: Say, one has a rover for the last register assigned, and each time it
needs a register, it advances one position and (if needed) evicts
whatever was held in that register. Whereas BGBCC uses heuristics to
decide which register to evict, rather than simply advancing a rover,
there being pros/cons either way.

In both cases, registers are typically loaded and evicted "as needed".

> All this said, I strongly suspect that the variable sys save/restore
> opcodes wil be capable of handling everything, as long as the OS
> initializes it properly. I.e. promise that the save areas will be large
> enough.
>
> It would still probably be better performance wise to make those
> save/restore functions partially lazy, so that code which never touches
> the extended regs don't need the save/restore. OTOH, we have seen this
> tried multiple times over the years, and we typically end up with the
> much cleaner "grab everything" approach because it is easier to make
> bug-free.
>

Agreed.

For things like task switching, "save and restore everything" makes the
most sense.

Otherwise, one might end up with a situation like on Win9x, where
apparently programs using SSE registers could not coexist because the OS
did not save or restore these registers.

Similar applies in my ISA if trying to run a program built for 64 GPRs
on a kernel built to assume 32 GPRs. Though, in this case, this is more
likely to result in crashes rather than (merely) corrupted 3D graphics
or similar.

Looking around, is seems that there are no existing ISA's called
"Sigma", and "SIGMA-ISA" or "SIGMA1-ISA" doesn't entirely suck.

A lot of the other Greek letters seemingly have already been used for
ISA's, mostly 32-bit RISC variants. Could always use Hebrew letters as
names, but this seems like a bit more of a cultural issue (apart from a
few obscure "math things", about the only things named with Hebrew
letters are yeshivas and similar...).

Mostly considering alternatives to the existing BJX2 name (which turns
out to have been, unfortunate, ...).

Still TBD if this whole project is moot, but alas...

> Terje
>
>

On 7/26/2023 4:30 PM, MitchAlsup wrote:
> On Wednesday, July 26, 2023 at 4:06:48 PM UTC-5, Scott Lurndal wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>> Anton Ertl wrote:
>>>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>>> If a function both needs lots of registers _and_ have to call any
>>>>> non-inlined functions, then it really isn't that time critical.
>>>>
>>>> Every interpreter calls non-inlined functions, and they often need a
>>>> lot of registers, or can make good use of them.
>>>>
>>>> Now you may consider that to be not time-critical, but if you are
>>>> Intel and want to sell an interpreter user an Intel system, and it is
>>>> slow compared to the ARM64 or RISC-V systems, you will still lose the
>>>> sale.
>>>
>>> I'm willing to be shown otherwise, but util then I'd consider any code
>>> that runs under an actual interpreter to be non-performance-critical.
> <
>> I would likely argue that an interpreter is inherently performance-critical
>> in order to make the interpreted code useful.
>>
>> Take a machine simulator as an example of performance-critical
>> interpreter (interpreting machine instructions rather than bytecode
>> or some intermediate tree representation). Booting linux on the
>> simulator, for example, is definitely performance critical from the
>> standpoint of the user waiting for a login prompt[*].
> <
> As someone who has actually done this (1999), what we did was to
> take portions (~= basic blocks) of code and compile them into traces
> complete with CPU/cache/TLB statistics updates, and then run 95%±
> of the interpreter as native code.
> <
> Is this still simulation ? obviously
> but it is more like JIT than interpretation.
> and eliminated most of the overhead of interpretation.
> <
> Obviously this is easier of target and computer are of the same architecture
> but one could do Mc 88100 ISA on a SPARC V8 without much hassle {or
> x86-64 on a SPARC V9,...}.

I had done similar with past VMs.

I haven't really ended up doing it much, or maintaining the JIT in a
functional state, for my BJX2 emulator mostly as my PC can (mostly)
manage to keep up with emulating this stuff faster than the core would
run on an actual FPGA (and it was more useful mostly to invest things
towards making it cycle accurate than fast).

So, for the most part, it ends up running in an interpreter, which is
additionally weighed down by trying to model the costs of cache mixes
and the branch predictor and similar.

....

Both the interpreter or JIT would decode "traces" of instructions
usually terminated by a branch or once reaching a set limit (such as 32
or 64 instructions).

In the interpreter, each instruction is translated into a struct
(holding the instruction fields) along with a function pointer (called
for the logic of the instruction), with an unrolled sequence of function
calls to these function pointers.

For the JIT, the instructions would be translated into a machine-code
sequence, which would then replace the normal trace-dispatch function
pointer. Operations which were handled directly would be translated to
native instructions, with others "spilling" any conflicting registers
and then calling the function pointer (as in the interpreter).

As-is, a fair chunk of the total native CPU cycles mostly go into
updating the models for the L1 and L2 caches and similar (on every
memory access). But, not really a more efficient way to model things
like L1 and L2 cache misses.

>>
>> [*] Yet another thing systemd makes worse.

Re: Intel goes to 32 GPRs

<8fa79ac0-1c8e-417b-9f6d-0e395df125bdn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33509&group=comp.arch#33509

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:8ca:b0:63c:f3ea:1750 with SMTP id da10-20020a05621408ca00b0063cf3ea1750mr11431qvb.11.1690574276347;
Fri, 28 Jul 2023 12:57:56 -0700 (PDT)
X-Received: by 2002:a05:6870:3a29:b0:1bb:b8f0:5878 with SMTP id
du41-20020a0568703a2900b001bbb8f05878mr4090562oab.5.1690574276127; Fri, 28
Jul 2023 12:57:56 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 28 Jul 2023 12:57:55 -0700 (PDT)
In-Reply-To: <ua0vrl$2a3nb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:58ca:88d1:ccf0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:58ca:88d1:ccf0
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<EVfwM.141257$U3w1.14154@fx09.iad> <5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<yBgwM.206164$TPw2.184992@fx17.iad> <u9tf9l$1rqbp$1@dont-email.me> <ua0vrl$2a3nb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8fa79ac0-1c8e-417b-9f6d-0e395df125bdn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 28 Jul 2023 19:57:56 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Fri, 28 Jul 2023 19:57 UTC

On Friday, July 28, 2023 at 1:02:33 PM UTC-5, BGB wrote:
> On 7/27/2023 5:01 AM, Terje Mathisen wrote:
> > Scott Lurndal wrote:
> >> MitchAlsup <Mitch...@aol.com> writes:
> >>> On Wednesday, July 26, 2023 at 4:06:48=E2=80=AFPM UTC-5, Scott
> >>> Lurndal wrot=
> >>> e:
> >>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
> >>>>> Anton Ertl wrote:=20
> >>>>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
> >>>>>>> If a function both needs lots of registers _and_ have to call any=20
> >>>>>>> non-inlined functions, then it really isn't that time critical.=20
> >>>>>> =20
> >>>>>> Every interpreter calls non-inlined functions, and they often need a=
> >>> =20
> >>>>>> lot of registers, or can make good use of them.=20
> >>>>>> =20
> >>>>>> Now you may consider that to be not time-critical, but if you are=20
> >>>>>> Intel and want to sell an interpreter user an Intel system, and it
> >>>>>> is=
> >>> =20
> >>>>>> slow compared to the ARM64 or RISC-V systems, you will still lose
> >>>>>> the=
> >>> =20
> >>>>>> sale.=20
> >>>>> =20
> >>>>> I'm willing to be shown otherwise, but util then I'd consider any
> >>>>> code=
> >>> =20
> >>>>> that runs under an actual interpreter to be non-performance-critical.
> >>> <
> >>>> I would likely argue that an interpreter is inherently
> >>>> performance-critic=
> >>> al=20
> >>>> in order to make the interpreted code useful.=20
> >>>> =20
> >>>> Take a machine simulator as an example of performance-critical=20
> >>>> interpreter (interpreting machine instructions rather than bytecode=20
> >>>> or some intermediate tree representation). Booting linux on the=20
> >>>> simulator, for example, is definitely performance critical from the=20
> >>>> standpoint of the user waiting for a login prompt[*].=20
> >>> <
> >>> As someone who has actually done this (1999)
> >>
> >> We had a simulator for the Burroughs mainframe in the 70's
> >> that we used to test instruction set changes (it was in Burroughs
> >> Algol and ran on a B7900).
> >>
> >> I currently work on an SoC simulator which simulates (functionally)
> >> the entire SoC: ARM64 cores (dozens), peripheral controllers (e.g. SATA,
> >> SPI, EMMC, I2C/I3C, networking), microcontrollers, accelerator blocks,
> >> PCI, etc.
> >>
> >> Performance is key.
> >>
> >>
> >> , what we did was to
> >>> take portions (~=3D basic blocks) of code and compile them into traces
> >>> complete with CPU/cache/TLB statistics updates, and then run
> >>> 95%=C2=B1=20
> >>> of the interpreter as native code.=20
> >>> <
> >>> Is this still simulation ? obviously
> >>> but it is more like JIT than interpretation.
> >>
> >> Most modern simulators (e.g. qemu) use jit-like mechanisms,
> >> as did AMDs SimNow! back in the 2000s.
> >
> > Exactly my point: Those JIT-style traces can obviously use all the regs,
> > as can any other leaf code.
> >
> > It is only when you want to use the new regs for the core interpreter
> > code that you run into trouble.
> >
> Yeah. I guess, once supported, it is a question of how long until
> Windows/etc supports them, AMD supports them, support is widespread in
> CPUs that people have, ...
>
>
> I can note that this takes a while, and I wouldn't be entirely sure of
> x86 doesn't start to lose market-share before this happens (say, if more
> people start moving to ARM based systems or similar, *).
>
> *: Though RISC-V is interesting, I don't really think it is "equipped"
> at this stage to compete against either x86 or ARM in the PC space, but
> does seem like a reasonable choice for the embedded space.
>
>
> As for JITs, yeah, they can use whatever instructions or registers are
> available to them...
>
>
> Also ironically, my past VMs used JITs, but for the BJX2 emulator,
> didn't really end up maintaining the JIT as the normal interpreter was
> still fast enough to keep up with emulating the CPU core in real-time.
>
>
> Similarly, the code for emulating things like cache misses and
> branch-predictor costs and similar ends up costing more than "actually
> emulating the code" does, but this is needed for "mostly cycle-accurate"
> emulation.
>
> And, for my uses, it does matter if the performance characteristics of
> the emulator are roughly consistent with those of the Verilog
> implementation (so, eg, when decoding traces, it also keeps track of
> things like pipeline interlocks and similar to model how many
> clock-cycles each instruction and trace would take, with any penalties
> then being added for cache and branch-predictor misses).
>
>
> There is a "--wallspeed" that basically means "disable modeling and run
> interpreter as fast as it will go, using external wall-clock time as a
> reference rather than cycle-count for the internal time".
>
> Currently, this mode is fast enough that Quake and GLQuake run at
> double-digit speeds. The Doom engine games also run (initially) pretty
> fast, but are not giving a smooth experience as something is glitchy
> with the timing.
>
> With just the interpreter running at full speed, this is ~ 135 MHz / 163
> MIPs.
>
> Probing some more, for whatever reason disabling cache modeling
> ("--nomemcost") seems to result in decidedly "non-smooth" behavior in
> the Doom-engine games (but Quake and ROTT and similar are unaffected).
> This seems curious (may need to look into it).
>
>
> Could put work on trying to get the JIT back into working condition if
> there was more of a need for it, but as-is, it isn't really needed.
>
> I can note that most of my past JITs also used a mostly similar register
> allocation strategy to what is currently used by BGBCC (mostly; though a
> lot of my JITs used a more round-robin register-allocation strategy, *).
>
> ...
>
>
> *: Say, one has a rover for the last register assigned, and each time it
> needs a register, it advances one position and (if needed) evicts
> whatever was held in that register. Whereas BGBCC uses heuristics to
> decide which register to evict, rather than simply advancing a rover,
> there being pros/cons either way.
>
> In both cases, registers are typically loaded and evicted "as needed".
<
My 66000 performs register swaps {process <-> process} in HW as if
the register file were just 4 cache lines. Within a thread, it has ENTER
and EXIT instructions that perform prologue and epilogue sequences.
<
> > All this said, I strongly suspect that the variable sys save/restore
> > opcodes wil be capable of handling everything, as long as the OS
> > initializes it properly. I.e. promise that the save areas will be large
> > enough.
> >
> > It would still probably be better performance wise to make those
> > save/restore functions partially lazy, so that code which never touches
> > the extended regs don't need the save/restore. OTOH, we have seen this
> > tried multiple times over the years, and we typically end up with the
> > much cleaner "grab everything" approach because it is easier to make
> > bug-free.
> >
> Agreed.
<
It is a delicate balance when performed in HW. In HW one can fetch all
4 cache lines that make up a register file in as few as 1 cycle (more
typically 1) and start the inbound flow BEFORE one interrupts the
currently running thread, and save the currently running register file
as the new registers arrive.
<
For procedure call-do=Something-return-call=somebody else if the
EXIT ends up overlapping with the ENTER, the HW can access the
register range of each and short circuit both EXIT and ENTER. For
example::
<
EXIT R19,R0,#stack-deallocate
< to
CALL Somebody-else
< to
ENTER R24,R0,#stack-allocate
<
The EXIT can restore registers R19..R23 and quit, and the ENTER can
avoid storing R24..R31, but store R0 (new return address).
<
This seems both lazy and aggressive at the same time !!
>
> For things like task switching, "save and restore everything" makes the
> most sense.
<
The OS people will argue that there are some places where storing
"just a few registers" allows them to service the timer and return with
lower overhead than save-everything. How many fall into this category
is unknown to me.
>
> Otherwise, one might end up with a situation like on Win9x, where
> apparently programs using SSE registers could not coexist because the OS
> did not save or restore these registers.
>
>
> Similar applies in my ISA if trying to run a program built for 64 GPRs
> on a kernel built to assume 32 GPRs. Though, in this case, this is more
> likely to result in crashes rather than (merely) corrupted 3D graphics
> or similar.
>
>
>
> Looking around, is seems that there are no existing ISA's called
> "Sigma", and "SIGMA-ISA" or "SIGMA1-ISA" doesn't entirely suck.
<
Xerox data systems had a line of computers called SIGMA {5, 7, and 9}
>
> A lot of the other Greek letters seemingly have already been used for
> ISA's, mostly 32-bit RISC variants. Could always use Hebrew letters as
> names, but this seems like a bit more of a cultural issue (apart from a
> few obscure "math things", about the only things named with Hebrew
> letters are yeshivas and similar...).
<
It still surprises me that a Japanese programmer cannot write プリントフ
for printf.

Click here to read the complete article

Re: Intel goes to 32 GPRs

<abfde2a5-ff18-4129-8856-9db86dd962f0n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33510&group=comp.arch#33510

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1443:b0:403:ab15:17a0 with SMTP id v3-20020a05622a144300b00403ab1517a0mr12394qtx.12.1690574874267;
Fri, 28 Jul 2023 13:07:54 -0700 (PDT)
X-Received: by 2002:a05:6808:2028:b0:3a3:a704:6e40 with SMTP id
q40-20020a056808202800b003a3a7046e40mr6670303oiw.3.1690574874078; Fri, 28 Jul
2023 13:07:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 28 Jul 2023 13:07:53 -0700 (PDT)
In-Reply-To: <ua118p$2a7vm$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:58ca:88d1:ccf0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:58ca:88d1:ccf0
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<EVfwM.141257$U3w1.14154@fx09.iad> <5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<ua118p$2a7vm$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <abfde2a5-ff18-4129-8856-9db86dd962f0n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 28 Jul 2023 20:07:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Fri, 28 Jul 2023 20:07 UTC

On Friday, July 28, 2023 at 1:26:37 PM UTC-5, BGB wrote:
> On 7/26/2023 4:30 PM, MitchAlsup wrote:
> > On Wednesday, July 26, 2023 at 4:06:48 PM UTC-5, Scott Lurndal wrote:
> >> Terje Mathisen <terje.m...@tmsw.no> writes:
> >>> Anton Ertl wrote:
> >>>> Terje Mathisen <terje.m...@tmsw.no> writes:
> >>>>> If a function both needs lots of registers _and_ have to call any
> >>>>> non-inlined functions, then it really isn't that time critical.
> >>>>
> >>>> Every interpreter calls non-inlined functions, and they often need a
> >>>> lot of registers, or can make good use of them.
> >>>>
> >>>> Now you may consider that to be not time-critical, but if you are
> >>>> Intel and want to sell an interpreter user an Intel system, and it is
> >>>> slow compared to the ARM64 or RISC-V systems, you will still lose the
> >>>> sale.
> >>>
> >>> I'm willing to be shown otherwise, but util then I'd consider any code
> >>> that runs under an actual interpreter to be non-performance-critical.
> > <
> >> I would likely argue that an interpreter is inherently performance-critical
> >> in order to make the interpreted code useful.
> >>
> >> Take a machine simulator as an example of performance-critical
> >> interpreter (interpreting machine instructions rather than bytecode
> >> or some intermediate tree representation). Booting linux on the
> >> simulator, for example, is definitely performance critical from the
> >> standpoint of the user waiting for a login prompt[*].
> > <
> > As someone who has actually done this (1999), what we did was to
> > take portions (~= basic blocks) of code and compile them into traces
> > complete with CPU/cache/TLB statistics updates, and then run 95%±
> > of the interpreter as native code.
> > <
> > Is this still simulation ? obviously
> > but it is more like JIT than interpretation.
> > and eliminated most of the overhead of interpretation.
> > <
> > Obviously this is easier of target and computer are of the same architecture
> > but one could do Mc 88100 ISA on a SPARC V8 without much hassle {or
> > x86-64 on a SPARC V9,...}.
> I had done similar with past VMs.
>
> I haven't really ended up doing it much, or maintaining the JIT in a
> functional state, for my BJX2 emulator mostly as my PC can (mostly)
> manage to keep up with emulating this stuff faster than the core would
> run on an actual FPGA (and it was more useful mostly to invest things
> towards making it cycle accurate than fast).
<
We had a table of "native instructions" which would emulate the <possibly>
foreign ISA on the native compute. Each instruction was emulated by a series
of native instructions--augmented with instructions (or calls) to emulate the
{pipeline, cache, tag, TLB, miss buffer}.
<
As each instruction was decoded, the list of native instructions was concatenated
onto the current list in the current basic block, and at the end a simple peephole
optimizer was run over the list, and then the list was deposited in a "trace" cache.
I use ""s because we did not call it that.
<
There was a branch linkage table, and an instruction address tag on the trace.
As long as we were executing with the trace cache, it was all native and no
interpretation was being performed. We got simulation overhead down to about
DIV 7-to-DIV-10
<
We also special cased the IDLE state in the OS to simply advance RTC and
timers to the point of the next interrupt, so in the idle state real time passed
faster than (high resolution) wall clock time !!.
>
> So, for the most part, it ends up running in an interpreter, which is
> additionally weighed down by trying to model the costs of cache mixes
> and the branch predictor and similar.
<
So, for the most part, mine ran with the interpreter idle.
>
> ...
>
>
> Both the interpreter or JIT would decode "traces" of instructions
> usually terminated by a branch or once reaching a set limit (such as 32
> or 64 instructions).
<
We allocated several MB to the trace cache and allowed 2K-4K traces
of any size without arbitrary limitations.
>
> In the interpreter, each instruction is translated into a struct
> (holding the instruction fields) along with a function pointer (called
> for the logic of the instruction), with an unrolled sequence of function
> calls to these function pointers.
>
> For the JIT, the instructions would be translated into a machine-code
> sequence, which would then replace the normal trace-dispatch function
> pointer. Operations which were handled directly would be translated to
> native instructions, with others "spilling" any conflicting registers
> and then calling the function pointer (as in the interpreter).
>
>
> As-is, a fair chunk of the total native CPU cycles mostly go into
> updating the models for the L1 and L2 caches and similar (on every
> memory access). But, not really a more efficient way to model things
> like L1 and L2 cache misses.
<
We found a way to simulate all reasonable cache sizes (256 bytes-1MB)
and all numbers of sets simultaneously--you should too.
> >>
> >> [*] Yet another thing systemd makes worse.

Re: Intel goes to 32 GPRs

<ua1ila$2c1na$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33511&group=comp.arch#33511

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Fri, 28 Jul 2023 18:22:50 -0500
Organization: A noiseless patient Spider
Lines: 361
Message-ID: <ua1ila$2c1na$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <EVfwM.141257$U3w1.14154@fx09.iad>
<5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<yBgwM.206164$TPw2.184992@fx17.iad> <u9tf9l$1rqbp$1@dont-email.me>
<ua0vrl$2a3nb$1@dont-email.me>
<8fa79ac0-1c8e-417b-9f6d-0e395df125bdn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 28 Jul 2023 23:23:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4087e3991dcd3cf624f65a4160740fa2";
logging-data="2492138"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18MRT7GugInpW30pCNUX54A"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:eiKKl2MnYsP5LurCitAu0oRnE1s=
Content-Language: en-US
In-Reply-To: <8fa79ac0-1c8e-417b-9f6d-0e395df125bdn@googlegroups.com>

by: BGB - Fri, 28 Jul 2023 23:22 UTC

On 7/28/2023 2:57 PM, MitchAlsup wrote:
> On Friday, July 28, 2023 at 1:02:33 PM UTC-5, BGB wrote:
>> On 7/27/2023 5:01 AM, Terje Mathisen wrote:
>>> Scott Lurndal wrote:
>>>> MitchAlsup <Mitch...@aol.com> writes:
>>>>> On Wednesday, July 26, 2023 at 4:06:48=E2=80=AFPM UTC-5, Scott
>>>>> Lurndal wrot=
>>>>> e:
>>>>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
>>>>>>> Anton Ertl wrote:=20
>>>>>>>> Terje Mathisen <terje.m...@tmsw.no> writes:=20
>>>>>>>>> If a function both needs lots of registers _and_ have to call any=20
>>>>>>>>> non-inlined functions, then it really isn't that time critical.=20
>>>>>>>> =20
>>>>>>>> Every interpreter calls non-inlined functions, and they often need a=
>>>>> =20
>>>>>>>> lot of registers, or can make good use of them.=20
>>>>>>>> =20
>>>>>>>> Now you may consider that to be not time-critical, but if you are=20
>>>>>>>> Intel and want to sell an interpreter user an Intel system, and it
>>>>>>>> is=
>>>>> =20
>>>>>>>> slow compared to the ARM64 or RISC-V systems, you will still lose
>>>>>>>> the=
>>>>> =20
>>>>>>>> sale.=20
>>>>>>> =20
>>>>>>> I'm willing to be shown otherwise, but util then I'd consider any
>>>>>>> code=
>>>>> =20
>>>>>>> that runs under an actual interpreter to be non-performance-critical.
>>>>> <
>>>>>> I would likely argue that an interpreter is inherently
>>>>>> performance-critic=
>>>>> al=20
>>>>>> in order to make the interpreted code useful.=20
>>>>>> =20
>>>>>> Take a machine simulator as an example of performance-critical=20
>>>>>> interpreter (interpreting machine instructions rather than bytecode=20
>>>>>> or some intermediate tree representation). Booting linux on the=20
>>>>>> simulator, for example, is definitely performance critical from the=20
>>>>>> standpoint of the user waiting for a login prompt[*].=20
>>>>> <
>>>>> As someone who has actually done this (1999)
>>>>
>>>> We had a simulator for the Burroughs mainframe in the 70's
>>>> that we used to test instruction set changes (it was in Burroughs
>>>> Algol and ran on a B7900).
>>>>
>>>> I currently work on an SoC simulator which simulates (functionally)
>>>> the entire SoC: ARM64 cores (dozens), peripheral controllers (e.g. SATA,
>>>> SPI, EMMC, I2C/I3C, networking), microcontrollers, accelerator blocks,
>>>> PCI, etc.
>>>>
>>>> Performance is key.
>>>>
>>>>
>>>> , what we did was to
>>>>> take portions (~=3D basic blocks) of code and compile them into traces
>>>>> complete with CPU/cache/TLB statistics updates, and then run
>>>>> 95%=C2=B1=20
>>>>> of the interpreter as native code.=20
>>>>> <
>>>>> Is this still simulation ? obviously
>>>>> but it is more like JIT than interpretation.
>>>>
>>>> Most modern simulators (e.g. qemu) use jit-like mechanisms,
>>>> as did AMDs SimNow! back in the 2000s.
>>>
>>> Exactly my point: Those JIT-style traces can obviously use all the regs,
>>> as can any other leaf code.
>>>
>>> It is only when you want to use the new regs for the core interpreter
>>> code that you run into trouble.
>>>
>> Yeah. I guess, once supported, it is a question of how long until
>> Windows/etc supports them, AMD supports them, support is widespread in
>> CPUs that people have, ...
>>
>>
>> I can note that this takes a while, and I wouldn't be entirely sure of
>> x86 doesn't start to lose market-share before this happens (say, if more
>> people start moving to ARM based systems or similar, *).
>>
>> *: Though RISC-V is interesting, I don't really think it is "equipped"
>> at this stage to compete against either x86 or ARM in the PC space, but
>> does seem like a reasonable choice for the embedded space.
>>
>>
>> As for JITs, yeah, they can use whatever instructions or registers are
>> available to them...
>>
>>
>> Also ironically, my past VMs used JITs, but for the BJX2 emulator,
>> didn't really end up maintaining the JIT as the normal interpreter was
>> still fast enough to keep up with emulating the CPU core in real-time.
>>
>>
>> Similarly, the code for emulating things like cache misses and
>> branch-predictor costs and similar ends up costing more than "actually
>> emulating the code" does, but this is needed for "mostly cycle-accurate"
>> emulation.
>>
>> And, for my uses, it does matter if the performance characteristics of
>> the emulator are roughly consistent with those of the Verilog
>> implementation (so, eg, when decoding traces, it also keeps track of
>> things like pipeline interlocks and similar to model how many
>> clock-cycles each instruction and trace would take, with any penalties
>> then being added for cache and branch-predictor misses).
>>
>>
>> There is a "--wallspeed" that basically means "disable modeling and run
>> interpreter as fast as it will go, using external wall-clock time as a
>> reference rather than cycle-count for the internal time".
>>
>> Currently, this mode is fast enough that Quake and GLQuake run at
>> double-digit speeds. The Doom engine games also run (initially) pretty
>> fast, but are not giving a smooth experience as something is glitchy
>> with the timing.
>>
>> With just the interpreter running at full speed, this is ~ 135 MHz / 163
>> MIPs.
>>
>> Probing some more, for whatever reason disabling cache modeling
>> ("--nomemcost") seems to result in decidedly "non-smooth" behavior in
>> the Doom-engine games (but Quake and ROTT and similar are unaffected).
>> This seems curious (may need to look into it).
>>
>>
>> Could put work on trying to get the JIT back into working condition if
>> there was more of a need for it, but as-is, it isn't really needed.
>>
>> I can note that most of my past JITs also used a mostly similar register
>> allocation strategy to what is currently used by BGBCC (mostly; though a
>> lot of my JITs used a more round-robin register-allocation strategy, *).
>>
>> ...
>>
>>
>> *: Say, one has a rover for the last register assigned, and each time it
>> needs a register, it advances one position and (if needed) evicts
>> whatever was held in that register. Whereas BGBCC uses heuristics to
>> decide which register to evict, rather than simply advancing a rover,
>> there being pros/cons either way.
>>
>> In both cases, registers are typically loaded and evicted "as needed".
> <
> My 66000 performs register swaps {process <-> process} in HW as if
> the register file were just 4 cache lines. Within a thread, it has ENTER
> and EXIT instructions that perform prologue and epilogue sequences.
> <

I meant, for individual register allocation.

Say, in a JIT:
RBX, RSI, RDI, R12..R15: May be used for register use (callee save).
RBX/RSI/RDI: Fixed assignment (such as VM context).
R12..R15: Used for temporary assignment, round robin.
RAX/RCX/RDX: Unassigned scratch registers.
R8..R11: Allocated scratch registers.

So, for each instruction being JITted, if it finds the VM registers it
needs, it emits it directly. Else, mark whichever values match as "live"
and then use whichever register is (next) according to the rover as the
spill/reload register, skipping any that are "live" in the current
instruction.

One may want to mark the registers "live" before finding a register to
evict, since otherwise one may end up evicting a register just to end up
reloading it again.

Or, if calling into an interpreter function, spill the registers back to
their location in the VM context.

For prologs/epilogs in x86 or x64, it is usually "simpler" to not bother
figuring out which registers to save/restore, and just using all of the
callee save registers (and then add on however much space is needed for
local storage, ...).

Click here to read the complete article

On 7/28/2023 3:07 PM, MitchAlsup wrote:
> On Friday, July 28, 2023 at 1:26:37 PM UTC-5, BGB wrote:
>> On 7/26/2023 4:30 PM, MitchAlsup wrote:
>>> On Wednesday, July 26, 2023 at 4:06:48 PM UTC-5, Scott Lurndal wrote:
>>>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>>> Anton Ertl wrote:
>>>>>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>>>>> If a function both needs lots of registers _and_ have to call any
>>>>>>> non-inlined functions, then it really isn't that time critical.
>>>>>>
>>>>>> Every interpreter calls non-inlined functions, and they often need a
>>>>>> lot of registers, or can make good use of them.
>>>>>>
>>>>>> Now you may consider that to be not time-critical, but if you are
>>>>>> Intel and want to sell an interpreter user an Intel system, and it is
>>>>>> slow compared to the ARM64 or RISC-V systems, you will still lose the
>>>>>> sale.
>>>>>
>>>>> I'm willing to be shown otherwise, but util then I'd consider any code
>>>>> that runs under an actual interpreter to be non-performance-critical.
>>> <
>>>> I would likely argue that an interpreter is inherently performance-critical
>>>> in order to make the interpreted code useful.
>>>>
>>>> Take a machine simulator as an example of performance-critical
>>>> interpreter (interpreting machine instructions rather than bytecode
>>>> or some intermediate tree representation). Booting linux on the
>>>> simulator, for example, is definitely performance critical from the
>>>> standpoint of the user waiting for a login prompt[*].
>>> <
>>> As someone who has actually done this (1999), what we did was to
>>> take portions (~= basic blocks) of code and compile them into traces
>>> complete with CPU/cache/TLB statistics updates, and then run 95%±
>>> of the interpreter as native code.
>>> <
>>> Is this still simulation ? obviously
>>> but it is more like JIT than interpretation.
>>> and eliminated most of the overhead of interpretation.
>>> <
>>> Obviously this is easier of target and computer are of the same architecture
>>> but one could do Mc 88100 ISA on a SPARC V8 without much hassle {or
>>> x86-64 on a SPARC V9,...}.
>> I had done similar with past VMs.
>>
>> I haven't really ended up doing it much, or maintaining the JIT in a
>> functional state, for my BJX2 emulator mostly as my PC can (mostly)
>> manage to keep up with emulating this stuff faster than the core would
>> run on an actual FPGA (and it was more useful mostly to invest things
>> towards making it cycle accurate than fast).
> <
> We had a table of "native instructions" which would emulate the <possibly>
> foreign ISA on the native compute. Each instruction was emulated by a series
> of native instructions--augmented with instructions (or calls) to emulate the
> {pipeline, cache, tag, TLB, miss buffer}.
> <
> As each instruction was decoded, the list of native instructions was concatenated
> onto the current list in the current basic block, and at the end a simple peephole
> optimizer was run over the list, and then the list was deposited in a "trace" cache.
> I use ""s because we did not call it that.
> <
> There was a branch linkage table, and an instruction address tag on the trace.
> As long as we were executing with the trace cache, it was all native and no
> interpretation was being performed. We got simulation overhead down to about
> DIV 7-to-DIV-10
> <
> We also special cased the IDLE state in the OS to simply advance RTC and
> timers to the point of the next interrupt, so in the idle state real time passed
> faster than (high resolution) wall clock time !!.

My JITs typically fed instructions into an "assembler", which appended
onto the end of a "jit cache" (often producing intermediate ASM blobs
via "sprintf()" or similar). When the JIT cache got full, the emulator
would flush the entire trace-cache and start over with a clean slate
(similar for an I$ flush).

Pretty much all of this part was "plain old C".

>>
>> So, for the most part, it ends up running in an interpreter, which is
>> additionally weighed down by trying to model the costs of cache mixes
>> and the branch predictor and similar.
> <
> So, for the most part, mine ran with the interpreter idle.

The interpreter is the only thing going on when no JIT is used...

>>
>> ...
>>
>>
>> Both the interpreter or JIT would decode "traces" of instructions
>> usually terminated by a branch or once reaching a set limit (such as 32
>> or 64 instructions).
> <
> We allocated several MB to the trace cache and allowed 2K-4K traces
> of any size without arbitrary limitations.

Imposing a limit seemed better for memory use and fragmentation if one
expects that periodically the entire trace cache will be flushed.

If all the traces are fixed-size, memory management becomes simple.
Though, there is memory waste when most of the traces are significantly
shorter than the limit (but, performance loss if sized according to the
"average" trace length).

One could have multiple sizes, choosing which type of trace to use based
on length, but this adds complexity.

>>
>> In the interpreter, each instruction is translated into a struct
>> (holding the instruction fields) along with a function pointer (called
>> for the logic of the instruction), with an unrolled sequence of function
>> calls to these function pointers.
>>
>> For the JIT, the instructions would be translated into a machine-code
>> sequence, which would then replace the normal trace-dispatch function
>> pointer. Operations which were handled directly would be translated to
>> native instructions, with others "spilling" any conflicting registers
>> and then calling the function pointer (as in the interpreter).
>>
>>
>> As-is, a fair chunk of the total native CPU cycles mostly go into
>> updating the models for the L1 and L2 caches and similar (on every
>> memory access). But, not really a more efficient way to model things
>> like L1 and L2 cache misses.
> <
> We found a way to simulate all reasonable cache sizes (256 bytes-1MB)
> and all numbers of sets simultaneously--you should too.

Possibly.

In my case, it is a naive strategy:
Call a function which takes the address of a load/store, uses it to
index a lookup table (representing the cache), updating counters as
needed, and updating the table as-needed to reflect the missed address, ...

Given there are lots of memory accesses, these functions are called a
lot (along with the function to translate addresses via the TLB, ...).

It doesn't model or detect possible cache-coherency issues, this would
add a bit more cost to the modeling.

Not really sure how to reduce cost significantly (while still keeping
the model relatively accurate).

As for "wherever causes weird performance issues in Doom and friends
when cache-modeling is disabled", not figured this part out yet (neither
profiling the emulator nor the profile stats within the emulated ISA
make it particularly obvious what is going wrong). Seems to be an
obvious/significant stall once per second or so...

>>>>
>>>> [*] Yet another thing systemd makes worse.

Re: Intel goes to 32 GPRs

<6e0afafb-4579-4681-bf80-dcbc514a96den@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33513&group=comp.arch#33513

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:199f:b0:3fd:d29e:5d37 with SMTP id u31-20020a05622a199f00b003fdd29e5d37mr24011qtc.1.1690602770901;
Fri, 28 Jul 2023 20:52:50 -0700 (PDT)
X-Received: by 2002:a05:6808:1888:b0:3a4:1082:9e5 with SMTP id
bi8-20020a056808188800b003a4108209e5mr8147715oib.2.1690602770521; Fri, 28 Jul
2023 20:52:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 28 Jul 2023 20:52:50 -0700 (PDT)
In-Reply-To: <ua0vrl$2a3nb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:4:787e:4bcf:c677;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:4:787e:4bcf:c677
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<EVfwM.141257$U3w1.14154@fx09.iad> <5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<yBgwM.206164$TPw2.184992@fx17.iad> <u9tf9l$1rqbp$1@dont-email.me> <ua0vrl$2a3nb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6e0afafb-4579-4681-bf80-dcbc514a96den@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 29 Jul 2023 03:52:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 20

by: Quadibloc - Sat, 29 Jul 2023 03:52 UTC

On Friday, July 28, 2023 at 12:02:33 PM UTC-6, BGB wrote:

> Looking around, is seems that there are no existing ISA's called
> "Sigma", and "SIGMA-ISA" or "SIGMA1-ISA" doesn't entirely suck.
>
> A lot of the other Greek letters seemingly have already been used for
> ISA's, mostly 32-bit RISC variants.

It's true that the most famous and well-known architecture that
went by the name Sigma hardly qualifies as "existing" any longer.

This was a CISC architecture intended to provide functionality
closely similar to that of the IBM System/360, but with all instructions
32 bits in length. This line of computers was made by Scientific
Data Systems... and continued to be made by them after becoming
Xerox Data Systems shortly thereafter.

The widespread fame of that ISA is no doubt the reason why no
one else has used that letter, to avoid confusion.

John Savard

Re: Intel goes to 32 GPRs

<ua28d7$2i51d$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33514&group=comp.arch#33514

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sat, 29 Jul 2023 00:34:01 -0500
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <ua28d7$2i51d$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <EVfwM.141257$U3w1.14154@fx09.iad>
<5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<yBgwM.206164$TPw2.184992@fx17.iad> <u9tf9l$1rqbp$1@dont-email.me>
<ua0vrl$2a3nb$1@dont-email.me>
<6e0afafb-4579-4681-bf80-dcbc514a96den@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 29 Jul 2023 05:34:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4087e3991dcd3cf624f65a4160740fa2";
logging-data="2692141"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/kEOpAMULAw4kJF0tVJ1rF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:dbYxAr6LOtgw1ao/qi7/2RlRLSw=
Content-Language: en-US
In-Reply-To: <6e0afafb-4579-4681-bf80-dcbc514a96den@googlegroups.com>

by: BGB - Sat, 29 Jul 2023 05:34 UTC

On 7/28/2023 10:52 PM, Quadibloc wrote:
> On Friday, July 28, 2023 at 12:02:33 PM UTC-6, BGB wrote:
>
>> Looking around, is seems that there are no existing ISA's called
>> "Sigma", and "SIGMA-ISA" or "SIGMA1-ISA" doesn't entirely suck.
>>
>> A lot of the other Greek letters seemingly have already been used for
>> ISA's, mostly 32-bit RISC variants.
>
> It's true that the most famous and well-known architecture that
> went by the name Sigma hardly qualifies as "existing" any longer.
>

Ironically, my initial search for this name came up empty.

More searching, "Xerox Sigma" brings up a bunch of stuff related to it.
Had not heard anything about this system previously (and my previous
search attempts, without the "xerox", didn't seem to bring up much of
anything about this).

> This was a CISC architecture intended to provide functionality
> closely similar to that of the IBM System/360, but with all instructions
> 32 bits in length. This line of computers was made by Scientific
> Data Systems... and continued to be made by them after becoming
> Xerox Data Systems shortly thereafter.
>
> The widespread fame of that ISA is no doubt the reason why no
> one else has used that letter, to avoid confusion.
>

Possibly so. I had tried a bunch of other letters, most turning up
projects using that letter as a name...

It seems the space of Greek and Latin letters being used for project
names is basically already used up.

> John Savard

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>> Thomas Koenig wrote:
>>>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>>>> Thomas Koenig wrote:
>>>>
>>>>>> One problem I see is that all the new registers are caller-saved,
>>>>>> for compatibility with existing ABIs. This is needed due to stack
>>>>>> unwinding and setjmp/longjmp, but restricts their benefit due to
>>>>>> having to spill them across function calls. It might be possible
>>>>>> to set __attribute__((nothrow)) on functions where this cannot
>>>>>> happen, and change some caller-saved to callee-saved registers
>>>>>> in that case, but that could be an interesting discussion.
>>>>>>
>>>>>
>>>>> I'm not worried at all about this point: The only places where I really
>>>>> want lots of registers are in big/complicated leaf functions!
>>>>>
>>>>> If a function both needs lots of registers _and_ have to call any
>>>>> non-inlined functions, then it really isn't that time critical.
>>>>
>>>> Fortran can use lots of registers for its array descriptors,
>>>> and also can use lots of library calls for mathematical functions
>>>> (because most CPUs don't have Mitch's single instructions for them).
>>>> Fortran library functions are typically __attribute__((nothrow)),
>>>> so in that field being able to use more registers across calls
>>>> would be a good thing, generally.
>>>>
>>> If said Fortran code is really performance critical, like in an FFT,
>>> then all the sin/cos function calls will be done up front and cached.
>>
>> If you're doing lots of chemical reaction calculation, it is
>> not possible to pre-compute the Arrhenius equation coefficients
>> (and their Jacobians).
>>
>> There's more to life than FFT :-)
>>
> I believe you.
>
> I still think the crux of the matter is in how complicated those special
> functions are. I.e. in your examples would it be possible to either
> inline or wrap the Arrhenius calculation in a save/restore registers pair?

Possibly; this would need special treatment by the compiler.

Prompted by this thread, I asked on the gcc mailing list about
__attribute__((noreturn)), and it seems that this is not sufficient,
see https://gcc.gnu.org/pipermail/gcc/2023-July/242147.html and
replies. It seems the psABI for amd64 pretty much nailed down
things so new registers would be caller-saved :-(

One of the respondents wrote that he was had submitted a patch about
a function attribute which adds information that some registers
are not, in fact clobbered, by a function. This is a concept that
might be extended by recording information about which registers
are actually used, and using that, on the caller side, to determine
which registers need not be saved.

Thomas Koenig <tkoenig@netcologne.de> writes:
>One of the respondents wrote that he was had submitted a patch about
>a function attribute which adds information that some registers
>are not, in fact clobbered, by a function. This is a concept that
>might be extended by recording information about which registers
>are actually used, and using that, on the caller side, to determine
>which registers need not be saved.

At first blush this is not useful for general consumption. However,
it could point to a way to a gradual transition to a new calling
convention: This attribute should be a directive for compiling this
function as well as informing callers of the function that they don't
need to save these registers. Then you could compile a library for
the new calling convention and put that information in the .h file.
Calls from the library to the elsewhere (e.g., callbacks) would still
have to use the standard calling convention.

Another way to get the transition is to put two versions of each
function in each library: One for the old calling convention, one for
the new one; and use some linker magic to let old-convention binaries
link with the old-convention functions and new-convention binaries
with the new-convention functions (as is done with
functionality-reduced implementations of some functions (e.g.,
memcpy()) in glibc now).

Maybe the situation is more hopeful than I thought at first, but we
will see.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Peter Lund <peterfirefly@gmail.com> writes:
>On Wednesday, July 26, 2023 at 8:03:00=E2=80=AFAM UTC+2, Anton Ertl wrote:
>> So with the REX2 prefix you replace the 0F prefix of the=20
>> map1 instructions. That's the advantage of having the REX2 prefix=20
>> over having two REX prefixes as AMD considered according to Mitch=20
>> Alsup. The disadvantage is that it occupies the D5 byte of AMD64=20
>
>That is exactly as what the 2x REX scheme could have done. REX2 has a payl=
>oad of 8 bits. 2x REX has a payload of 4+4=3D8 bits. You need W + 2x3 ext=
>ra bits for the register numbers. Then you have one bit left over, which R=
>EX2 uses for the map selection. 2x REX could have done exactly the same --=
> what would you need two W bits for?

Yes, good point.

>PS: It's nice to get tidbits like that, even with the inevitable decade+ de=
>lay. I wonder what AMD had considered using the extra W bit for...

Yes. Apart from saving the F0 prefix, what alternative uses could
that bit have been used for?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>It is only when you want to use the new regs for the core interpreter
>code that you run into trouble.

This suggests the idea of turning the interpreter inside out: have a
part that's a leaf function and that processes all virtual-machine
(VM) instructions until it reaches one that requires a call. Then the
leaf part returns and the calling part performs calling VM
instructions until it reaches one that does not call; then it calls
the leaf part again.

The transitions between the parts will be expensive, but for code
where interpreter performance matters (which has a very low proportion
of calls) it is probably a win (and the rest spends much of its time
in the called functions anyway).

I doubt that there will be many interpreter writers who will impose
this burden on their code just to support an architecture with a bad
calling convention. We certainly did not do it for MIPS and Alpha,
and we won't do it for AMD64+Intel APX.

That's for context switches, not for calls and returns.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>Anton Ertl wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>> If a function both needs lots of registers _and_ have to call any
>>> non-inlined functions, then it really isn't that time critical.
>>
>> Every interpreter calls non-inlined functions, and they often need a
>> lot of registers, or can make good use of them.
>>
>> Now you may consider that to be not time-critical, but if you are
>> Intel and want to sell an interpreter user an Intel system, and it is
>> slow compared to the ARM64 or RISC-V systems, you will still lose the
>> sale.
>
>I'm willing to be shown otherwise, but util then I'd consider any code
>that runs under an actual interpreter to be non-performance-critical.

Every case of an application that's written in an interpreted language
being rewritten in some higher-performance language shows a case where
the interpreter performance used to be performance-critical. And
those cases where such projects failed are cases where it still is
performance-critical.

Every performance improvement project for an interpreter
(e.g. <https://lwn.net/Articles/930705/>) shows that the performance
of the interpreter matters. The fact that people perform performance
measurements shows this, too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

> Every performance improvement project for an interpreter
> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
> of the interpreter matters. The fact that people perform performance
> measurements shows this, too.

Widely-used interpeted languages such as Python or Matlab are
known to be as slow as molasses when they do not call highly-
efficient compiled code. Any improvement (such as the link above)
is welcome there.

A concept followed by languages like the one used by Julia, which
uses JIT compilation much more aggressively, seems to be a better
approach. I have barely glanced at it yet, but it certainly seems
worth looking into for scientific work.

Re: Intel goes to 32 GPRs

<f4e988e1-bad1-45d1-ad7b-102731e8ca0bn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33521&group=comp.arch#33521

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:fc03:0:b0:63c:fd45:7d69 with SMTP id z3-20020a0cfc03000000b0063cfd457d69mr53377qvo.2.1690768572931;
Sun, 30 Jul 2023 18:56:12 -0700 (PDT)
X-Received: by 2002:a05:6808:1801:b0:3a3:d677:9a8d with SMTP id
bh1-20020a056808180100b003a3d6779a8dmr17069269oib.0.1690768572702; Sun, 30
Jul 2023 18:56:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 30 Jul 2023 18:56:12 -0700 (PDT)
In-Reply-To: <ua6lu2$1hq28$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f8:1a9e:5c41:e172;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f8:1a9e:5c41:e172
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4e988e1-bad1-45d1-ad7b-102731e8ca0bn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 31 Jul 2023 01:56:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2505

by: MitchAlsup - Mon, 31 Jul 2023 01:56 UTC

On Sunday, July 30, 2023 at 4:49:58 PM UTC-5, Thomas Koenig wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> > Every performance improvement project for an interpreter
> > (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
> > of the interpreter matters. The fact that people perform performance
> > measurements shows this, too.
<
> Widely-used interpeted languages such as Python or Matlab are
> known to be as slow as molasses when they do not call highly-
> efficient compiled code. Any improvement (such as the link above)
> is welcome there.
<
FOCAL on the PDP-8 seemed fast enough for the day.............
>
> A concept followed by languages like the one used by Julia, which
> uses JIT compilation much more aggressively, seems to be a better
> approach. I have barely glanced at it yet, but it certainly seems
> worth looking into for scientific work.

Re: Intel goes to 32 GPRs

<ua7l9s$3797l$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33522&group=comp.arch#33522

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 31 Jul 2023 08:45:15 +0200
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <ua7l9s$3797l$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com>
<u9qo9l$1f22o$1@dont-email.me> <u9qrho$19u0h$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <2023Jul26.221142@mips.complang.tuwien.ac.at>
<u9s14g$1jed5$1@dont-email.me> <EVfwM.141257$U3w1.14154@fx09.iad>
<5a1c383b-ac9d-4069-955f-16adfd1626c6n@googlegroups.com>
<yBgwM.206164$TPw2.184992@fx17.iad> <u9tf9l$1rqbp$1@dont-email.me>
<2023Jul30.184253@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 31 Jul 2023 06:45:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b6e3a9521fb58f604aa52d0bcb6c371f";
logging-data="3384565"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19v8buYm0j9SXvnwTsV/jaxd/1cVLScaQzLBOF9mNql9Q=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:/oTW/MSzfhPmfm9oL3xZJmDf2mA=
In-Reply-To: <2023Jul30.184253@mips.complang.tuwien.ac.at>

by: Terje Mathisen - Mon, 31 Jul 2023 06:45 UTC

Anton Ertl wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>> It is only when you want to use the new regs for the core interpreter
>> code that you run into trouble.
>
> This suggests the idea of turning the interpreter inside out: have a
> part that's a leaf function and that processes all virtual-machine
> (VM) instructions until it reaches one that requires a call. Then the
> leaf part returns and the calling part performs calling VM
> instructions until it reaches one that does not call; then it calls
> the leaf part again.
>
> The transitions between the parts will be expensive, but for code
> where interpreter performance matters (which has a very low proportion
> of calls) it is probably a win (and the rest spends much of its time
> in the called functions anyway).
>
> I doubt that there will be many interpreter writers who will impose
> this burden on their code just to support an architecture with a bad
> calling convention. We certainly did not do it for MIPS and Alpha,
> and we won't do it for AMD64+Intel APX.
>
>> All this said, I strongly suspect that the variable sys save/restore
>> opcodes wil be capable of handling everything, as long as the OS
>> initializes it properly. I.e. promise that the save areas will be large
>> enough.
>
> That's for context switches, not for calls and returns.

Oops, Mea Culpa!

You are correct of course. They would have needed a new persion of
PUSHA/POPA and then gotten compiler writers to use them even though most
registers do not need to be saved.

Yeah, this is my asm background coming back to hurt me: My asm functions
have typically used custom save/restore lists for every function
depending upon what it needed, with no common ABI.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig wrote:
> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> Every performance improvement project for an interpreter
>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>> of the interpreter matters. The fact that people perform performance
>> measurements shows this, too.
>
> Widely-used interpeted languages such as Python or Matlab are
> known to be as slow as molasses when they do not call highly-
> efficient compiled code. Any improvement (such as the link above)
> is welcome there.
>
> A concept followed by languages like the one used by Julia, which
> uses JIT compilation much more aggressively, seems to be a better
> approach. I have barely glanced at it yet, but it certainly seems
> worth looking into for scientific work.
>
This is what I'm really arguing for: If any real interpreter is doomed
to always be much, much slower than inline binary code, then we need
better ways to (at runtime?) convert the former into the latter, i.e.
JIT optimization of the parts that need it.

Changing the hardware to make indirect function calls and returns far
faster would be nice, but they will always cost a lot more than zero.

I believe conditional branches is the one CPU feature where a naive
interpreter (calling a function for each opcode) can be relatively fast:
Calculate the new IP with a conditional/predicated move and return, with
no branch predictor issues.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup <MitchAlsup@aol.com> writes:
>FOCAL on the PDP-8 seemed fast enough for the day.............

Yes, that's typical. Programmers use an interpreted language
implementation that appears fast enough for the job at hand. In some
cases the program is later used in a context where the language
implementation is no longer fast enough, i.e., the language
implementation becomes performance-critical.

APX will hardly help the performance of these language implementations
as long as there are as few callee-saved registers as without APX.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>Thomas Koenig wrote:
>> A concept followed by languages like the one used by Julia, which
>> uses JIT compilation much more aggressively, seems to be a better
>> approach. I have barely glanced at it yet, but it certainly seems
>> worth looking into for scientific work.

Julia is interesting in many respects, but it uses a slow compiler
(IIRC LLVM) plus caching in a JIT-like setting. I.e., when the
compiler sees new code, you see delays far beyond what you see with
interpreters or more conventional JITs. This may be a good choice for
Julia, but I have my doubts that it would gained popularity for a
language with a different application area.

>This is what I'm really arguing for: If any real interpreter is doomed
>to always be much, much slower than inline binary code, then we need
>better ways to (at runtime?) convert the former into the latter, i.e.
>JIT optimization of the parts that need it.

You may think so, but the reality often is: When a language designer
wants to design an interactive language, the cheap and straightforward
way is to write an interpreter.

Will you finance the teams that are necessary to implement JIT
compilers for every newfangled language?

Will you finance the teams necessary to implement JIT compilers for
those languages that have found a following? That's of course a much
smaller number of languages, but then, each of these teams will have
to deal with entrenched usage and may not be able to replace the
interpreter. Note that Pypy has been started 16 years ago, and yet
CPython (the slow interpreter) is by far the most popular Python
implementation.

You can put your hopes in Truffle, and maybe Truffle will change the
way how aspiring language designers work, but if so, it will be a big
marketing success.

>Changing the hardware to make indirect function calls and returns far
>faster would be nice, but they will always cost a lot more than zero.

Indirect function calls and returns are fast on modern hardware in the
common case (except if you work around Spectre with retpolines, then
they are very slow).

>I believe conditional branches is the one CPU feature where a naive
>interpreter (calling a function for each opcode) can be relatively fast:
>Calculate the new IP with a conditional/predicated move and return, with
>no branch predictor issues.

I don't know many interpreters that call a function for each opcode
(actually only one: PFE); many use a big switch in one function; this
is better for performance, because more than one virtual-machine
registers can be passed in real registers. Some are more advanced and
use techniques like threaded code. In every one of these cases, you
have an indirect call or indirect branch for every VM instruction, and
you therefore need indirect branch prediction dearly; compiling an
interpreter with Spectre v2 to use a retpoline for the indirect
jumps/calls causes a huge slowdown.

Concerning a VM-level conditional branch, if you use a conditional
move to set the instruction pointer (IP), yes, you have eliminated
that branch prediction, but a few instructions later there will be a
fetch from that IP and then an indirect branch (or indirect call)
based on the fetched value, and that has to be predicted, and it will
tend to predict worse than the conditional branch predictor (which is
tuned for conditional branches). I am not an expert in indirect
branch predictors in current CPUs, but if the conditional-branch
history is also used for predicting indirect branches, it may be
detrimental to eliminate the conditional branch from the
implementation of the VM-level conditional branch.

There is one technique called selective inlining (aka dynamic
superinstructions) that eliminates the indirect branches for
straight-line VM code. In that case a VM-level conditional branch
only needs an indirect brach in the branch-taken case, but then you
need a native-code-level conditional branch to skip the indirect
branch.

E.g., for the Forth code

: foo dup if 1+ then ;

the resulting VM code plus the machine code is (gforth-fast on
RISC-V):

$3FAA160638 dup 1->2
0x0000003fa9e04ede: mv s0,s7
$3FAA160640 ?branch 2->1
$3FAA160648 <foo+$20>
0x0000003fa9e04ee0: addi s10,s10,24
0x0000003fa9e04ee2: ld a5,-8(s10)
0x0000003fa9e04ee6: bnez s0,0x3fa9e04eee
0x0000003fa9e04ee8: ld a4,0(a5)
0x0000003fa9e04eea: mv s10,a5
0x0000003fa9e04eec: jr a4
$3FAA160650 1+ 1->1
0x0000003fa9e04eee: addi s7,s7,1
0x0000003fa9e04ef0: addi s10,s10,8
$3FAA160658 ;s 1->1
0x0000003fa9e04ef2: ld a6,0(s2)
0x0000003fa9e04ef6: addi s2,s2,8
0x0000003fa9e04ef8: mv s10,a6
0x0000003fa9e04efa: ld a4,0(s10)
0x0000003fa9e04efe: jr a4

The two "jr" instructions are the indirect branches. The "bnez"
branches around the branch-taken case and its indirect branch to the
code for 1+.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Intel goes to 32 GPRs

<2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33528&group=comp.arch#33528

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1a91:b0:407:2c52:2861 with SMTP id s17-20020a05622a1a9100b004072c522861mr41722qtc.8.1690822848234;
Mon, 31 Jul 2023 10:00:48 -0700 (PDT)
X-Received: by 2002:a05:6870:ee08:b0:1bb:89c9:87f0 with SMTP id
ga8-20020a056870ee0800b001bb89c987f0mr19591571oab.1.1690822847783; Mon, 31
Jul 2023 10:00:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 31 Jul 2023 10:00:47 -0700 (PDT)
In-Reply-To: <ua7mn3$37cuf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1dd2:2482:63f7:65eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1dd2:2482:63f7:65eb
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de>
<ua7mn3$37cuf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2a7f42cd-f7f1-46dd-963b-047966cf1e28n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 31 Jul 2023 17:00:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3903

by: MitchAlsup - Mon, 31 Jul 2023 17:00 UTC

On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >
> >> Every performance improvement project for an interpreter
> >> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
> >> of the interpreter matters. The fact that people perform performance
> >> measurements shows this, too.
> >
> > Widely-used interpeted languages such as Python or Matlab are
> > known to be as slow as molasses when they do not call highly-
> > efficient compiled code. Any improvement (such as the link above)
> > is welcome there.
> >
> > A concept followed by languages like the one used by Julia, which
> > uses JIT compilation much more aggressively, seems to be a better
> > approach. I have barely glanced at it yet, but it certainly seems
> > worth looking into for scientific work.
> >
> This is what I'm really arguing for: If any real interpreter is doomed
> to always be much, much slower than inline binary code, then we need
> better ways to (at runtime?) convert the former into the latter, i.e.
> JIT optimization of the parts that need it.
<
{{
A real interpreter is going to average DIV 1000
Jitting instructions into a trace is going to average DIV 100.
A JIT trace is going to average DIV 10
}} per interpreted instruction.
But here JIT means converting simulated instructions into native
instructions with a touch of peepholeing.
>
> Changing the hardware to make indirect function calls and returns far
> faster would be nice, but they will always cost a lot more than zero.
<
Only cost the fetch delay of getting the new IP {As opposed to the delay
of AGENing a new IP = IP + relative-displacement<<2.}
<
I see no reason that return takes any longer than it currently does.
>
> I believe conditional branches is the one CPU feature where a naive
> interpreter (calling a function for each opcode) can be relatively fast:
> Calculate the new IP with a conditional/predicated move and return, with
> no branch predictor issues.
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Intel goes to 32 GPRs

<c0222fa1-4bce-4b3d-9a1d-e498b5b3f90an@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33529&group=comp.arch#33529

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:15e:b0:76c:9781:6133 with SMTP id e30-20020a05620a015e00b0076c97816133mr26670qkn.12.1690823327579;
Mon, 31 Jul 2023 10:08:47 -0700 (PDT)
X-Received: by 2002:a9d:6451:0:b0:6b9:667a:7211 with SMTP id
m17-20020a9d6451000000b006b9667a7211mr12146106otl.4.1690823327275; Mon, 31
Jul 2023 10:08:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 31 Jul 2023 10:08:46 -0700 (PDT)
In-Reply-To: <2023Jul31.172330@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1dd2:2482:63f7:65eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1dd2:2482:63f7:65eb
References: <u9o14h$183or$1@newsreader4.netcologne.de> <2023Jul25.175501@mips.complang.tuwien.ac.at>
<c12bf4b2-8844-46c1-bd5a-e4b060baec24n@googlegroups.com> <u9qo9l$1f22o$1@dont-email.me>
<u9qrho$19u0h$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<2023Jul26.221142@mips.complang.tuwien.ac.at> <u9s14g$1jed5$1@dont-email.me>
<2023Jul30.185326@mips.complang.tuwien.ac.at> <ua6lu2$1hq28$1@newsreader4.netcologne.de>
<ua7mn3$37cuf$1@dont-email.me> <2023Jul31.172330@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c0222fa1-4bce-4b3d-9a1d-e498b5b3f90an@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 31 Jul 2023 17:08:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7699

by: MitchAlsup - Mon, 31 Jul 2023 17:08 UTC

On Monday, July 31, 2023 at 11:34:46 AM UTC-5, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
> >Thomas Koenig wrote:
> >> A concept followed by languages like the one used by Julia, which
> >> uses JIT compilation much more aggressively, seems to be a better
> >> approach. I have barely glanced at it yet, but it certainly seems
> >> worth looking into for scientific work.
> Julia is interesting in many respects, but it uses a slow compiler
> (IIRC LLVM) plus caching in a JIT-like setting. I.e., when the
> compiler sees new code, you see delays far beyond what you see with
> interpreters or more conventional JITs. This may be a good choice for
> Julia, but I have my doubts that it would gained popularity for a
> language with a different application area.
> >This is what I'm really arguing for: If any real interpreter is doomed
> >to always be much, much slower than inline binary code, then we need
> >better ways to (at runtime?) convert the former into the latter, i.e.
> >JIT optimization of the parts that need it.
> You may think so, but the reality often is: When a language designer
> wants to design an interactive language, the cheap and straightforward
> way is to write an interpreter.
>
> Will you finance the teams that are necessary to implement JIT
> compilers for every newfangled language?
>
> Will you finance the teams necessary to implement JIT compilers for
> those languages that have found a following? That's of course a much
> smaller number of languages, but then, each of these teams will have
> to deal with entrenched usage and may not be able to replace the
> interpreter. Note that Pypy has been started 16 years ago, and yet
> CPython (the slow interpreter) is by far the most popular Python
> implementation.
>
> You can put your hopes in Truffle, and maybe Truffle will change the
> way how aspiring language designers work, but if so, it will be a big
> marketing success.
<
> >Changing the hardware to make indirect function calls and returns far
> >faster would be nice, but they will always cost a lot more than zero.
> Indirect function calls and returns are fast on modern hardware in the
> common case (except if you work around Spectre with retpolines, then
> they are very slow).
<
My 66000 fixed this--fast extern function linkage and immunity to Spectré.
<
> >I believe conditional branches is the one CPU feature where a naive
> >interpreter (calling a function for each opcode) can be relatively fast:
> >Calculate the new IP with a conditional/predicated move and return, with
> >no branch predictor issues.
<
> I don't know many interpreters that call a function for each opcode
> (actually only one: PFE); many use a big switch in one function; this
> is better for performance, because more than one virtual-machine
> registers can be passed in real registers. Some are more advanced and
> use techniques like threaded code. In every one of these cases, you
> have an indirect call or indirect branch for every VM instruction, and
> you therefore need indirect branch prediction dearly; compiling an
> interpreter with Spectre v2 to use a retpoline for the indirect
> jumps/calls causes a huge slowdown.
<
What you describe is both slow and prone to attack....
>
> Concerning a VM-level conditional branch, if you use a conditional
> move to set the instruction pointer (IP), yes, you have eliminated
> that branch prediction, but a few instructions later there will be a
> fetch from that IP and then an indirect branch (or indirect call)
> based on the fetched value, and that has to be predicted, and it will
> tend to predict worse than the conditional branch predictor (which is
> tuned for conditional branches). I am not an expert in indirect
> branch predictors in current CPUs, but if the conditional-branch
> history is also used for predicting indirect branches, it may be
> detrimental to eliminate the conditional branch from the
> implementation of the VM-level conditional branch.
<
If/when this is a concern, SW throws a 1MB predictor at the
problem and uses a modern predictor algorithm for high
predictability. This should have better prediction than modern
CPU predictors due merely to its size.
>
> There is one technique called selective inlining (aka dynamic
> superinstructions) that eliminates the indirect branches for
> straight-line VM code. In that case a VM-level conditional branch
> only needs an indirect brach in the branch-taken case, but then you
> need a native-code-level conditional branch to skip the indirect
> branch.
<
Complete inlining is called a trace cache.
>
> E.g., for the Forth code
>
> : foo dup if 1+ then ;
>
> the resulting VM code plus the machine code is (gforth-fast on
> RISC-V):
>
> $3FAA160638 dup 1->2
> 0x0000003fa9e04ede: mv s0,s7
> $3FAA160640 ?branch 2->1
> $3FAA160648 <foo+$20>
> 0x0000003fa9e04ee0: addi s10,s10,24
> 0x0000003fa9e04ee2: ld a5,-8(s10)
> 0x0000003fa9e04ee6: bnez s0,0x3fa9e04eee
> 0x0000003fa9e04ee8: ld a4,0(a5)
> 0x0000003fa9e04eea: mv s10,a5
> 0x0000003fa9e04eec: jr a4
> $3FAA160650 1+ 1->1
> 0x0000003fa9e04eee: addi s7,s7,1
> 0x0000003fa9e04ef0: addi s10,s10,8
> $3FAA160658 ;s 1->1
> 0x0000003fa9e04ef2: ld a6,0(s2)
> 0x0000003fa9e04ef6: addi s2,s2,8
> 0x0000003fa9e04ef8: mv s10,a6
> 0x0000003fa9e04efa: ld a4,0(s10)
> 0x0000003fa9e04efe: jr a4
>
> The two "jr" instructions are the indirect branches. The "bnez"
> branches around the branch-taken case and its indirect branch to the
> code for 1+.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

[...]

> in the case of having non-saved registers
> that you'd like to use across function calls, I would probably consider
> writing wrappers for those function calls: The wrapper would do the
> save, call the actual function, then restore and return.

The discussion on gcc has turned up an interesting option that I had
overlooked:

'-fipa-ra'
Use caller save registers for allocation if those registers are not
used by any called function. In that case it is not necessary to
save and restore them around calls. This is only possible if
called functions are part of same compilation unit as current
function and they are compiled before it.

Enabled at levels '-O2', '-O3', '-Os', however the option is
disabled if generated code will be instrumented for profiling
('-p', or '-pg') or if callee's register usage cannot be known
exactly (this happens on targets that do not expose prologues and
epilogues in RTL).

This should also be enabled with LTO.

On 7/31/2023 12:00 PM, MitchAlsup wrote:
> On Monday, July 31, 2023 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>
>>>> Every performance improvement project for an interpreter
>>>> (e.g. <https://lwn.net/Articles/930705/>) shows that the performance
>>>> of the interpreter matters. The fact that people perform performance
>>>> measurements shows this, too.
>>>
>>> Widely-used interpeted languages such as Python or Matlab are
>>> known to be as slow as molasses when they do not call highly-
>>> efficient compiled code. Any improvement (such as the link above)
>>> is welcome there.
>>>
>>> A concept followed by languages like the one used by Julia, which
>>> uses JIT compilation much more aggressively, seems to be a better
>>> approach. I have barely glanced at it yet, but it certainly seems
>>> worth looking into for scientific work.
>>>
>> This is what I'm really arguing for: If any real interpreter is doomed
>> to always be much, much slower than inline binary code, then we need
>> better ways to (at runtime?) convert the former into the latter, i.e.
>> JIT optimization of the parts that need it.
> <
> {{
> A real interpreter is going to average DIV 1000
> Jitting instructions into a trace is going to average DIV 100.
> A JIT trace is going to average DIV 10
> }} per interpreted instruction.
> But here JIT means converting simulated instructions into native
> instructions with a touch of peepholeing.

This seems a little pessimistic.

My own past stats were:
Naive bytecode: ~ 300..1000, depending on the design of the bytecode.
Dynamic types are slower than static types;
Stack vs 3R is another factor of 2.

So, for example, designs like the Quake VM's would be moderately fast as
far as plain interpreters go.

Static typed register IR with traces converted into structs and function
calls: Typically closer to around 30x IME.

JITs were sometimes within around 3x to 5x slowdown.

Dynamic typing adds a fair bit of a performance penalty though, and
(usually) the only way to reduce this overhead is by using "type
inference" to try to turn a dynamic language into a
semi-statically-typed form.

Though, at this point, one almost may as well switch over to static
typing and leave dynamic typing as a special feature, say:
variant: behaves as a dynamic type (may be optionally inferred in some
cases);
auto: inferred type (but telling compiler that it is expected to infer
to a static type);
other cases: explicit static types.

This is the direction my BGBScript languages eventually ended up going.

Though, ironically, at this stage there isn't that much of a difference
between a language like this, and what one needs for a C compiler. At
least, apart from C having a generally more complicated type-system when
one tries to deal with all of the edge cases.

One could have a simpler type-system if some cases were dropped:
Multidimensional arrays;
Arrays of structs;
Arrays of function pointers.

I would also have preferred it if type promotion were either 3-way (like
in some of my own languages), or a consistent "widen first, ask
questions later". The existing behaviors are a bit picky and it is
difficult to get all the edge cases right.

Well, and also it would be nice if the operator precedence hierarchy
were "less stupid".

Like, if it were up to me, say:
* / %
+ -
& | ^
<< >>
== != ...
&& ||
= += -= ...

Then again, it "almost doesn't matter" since in nearly every case where
the wonky precedence could come into effect, expressions tend to be
wrapped in a pile of parenthesis anyways in order to evaluate the
expressions in a way "that actually makes sense".

But, not much of a "selling point" in an "almost but not quite C" language.

>>
>> Changing the hardware to make indirect function calls and returns far
>> faster would be nice, but they will always cost a lot more than zero.
> <
> Only cost the fetch delay of getting the new IP {As opposed to the delay
> of AGENing a new IP = IP + relative-displacement<<2.}
> <
> I see no reason that return takes any longer than it currently does.

In my case, indirect call and return can be made moderately fast.
But, it mostly requires loading the target into R1 and scheduling the
load well before the branch itself.

Say:
MOV.Q (...), R1
op
op
op
op
JMP R1

Where, the rule is that the load needs to pass through the WB stage
before the branch instruction reaches ID1, in this case the branch
predictor can "see" the current value of R1, and can predict a direct
branch to this address (otherwise, it ends up being a "slow branch").

In this case, it is mostly a case of finding "something useful to do" in
the meantime. For an indirect call, this would likely be things like
loading arguments into the corresponding registers or similar.

For epilog sequences, the current practice is to load the saved return
address into R1 before reloading the other registers.

In some cases, as-is, indirect calls could theoretically also be sped up
slightly by making strategic use of NOP instructions if there isn't
enough instructions between the load and the jump.

Could also theoretically fake this behavior in the CPU core via some
additional interlock checks or similar, but this is debatable. It would
likely also require detecting the "JMP R1" in the 'IF' stage to allow R1
to pass through WB before we can advance into 'ID1'; this would be "not
ideal" in various ways (better, almost, to "eat the cost" in these cases).

Well, either that, or add additional branch-handling to the ID2 stage,
which is also "not free".

>>
>> I believe conditional branches is the one CPU feature where a naive
>> interpreter (calling a function for each opcode) can be relatively fast:
>> Calculate the new IP with a conditional/predicated move and return, with
>> no branch predictor issues.
>> Terje
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"

C is quirky, flawed, and an enormous success -- Dennis M. Ritchie

devel / comp.arch / Re: Intel goes to 32 GPRs

devel / comp.arch / Re: Intel goes to 32 GPRs

Subject	Author
Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Peter Lund
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Elijah Stone
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	BGB
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	JimBrakefield
Re: Intel goes to 32-bit general purpose registers	Michael S
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Stephen Fuld
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Quadibloc
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	JimBrakefield
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Ivan Godard
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	EricP
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Kent Dickey
Callee-saved registers (was: Intel goes to 32 GPRs)	Anton Ertl
Re: Intel goes to 32 GPRs	Mike Stump