Rocksolid Light - comp.arch - Re: lots of inline, Intel goes to 32 GPRs

On 9/9/2023 1:30 PM, John Dallman wrote:
> In article <2023Sep9.192231@mips.complang.tuwien.ac.at>,
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> One interesting aspect is that Tremont (and AFAIK Gracemont), i.e.,
>> the E-Cores, don't have a microcode cache, but instead two
>> three-wide decoders (decoding the next two predicted instruction
>> streams); supposedly this is more power-efficient; it certainly is
>> more area-efficient, which seems to be the main point of these cores.
>> Anyway, there does not really seem to be a decoding bottleneck.
>
> Aha, that's a sensible way round the decoding bottleneck.
>
>> One other interesting aspect is that ARM also uses a microcode
>> cache.
>>
>> One guess I have is that this is due to ARM supporting 2 or 3
>> different instruction sets. If that is true, they could do away
>> with the cache now that they eliminate A32/T32 in their ARMv9 cores.
>
> That seems plausible.
>
>> Another guess I have is that their many-register instructions
>> require more sophisticated "decoding" in their big cores, and
>> they cache that effort in the microcode cache.
>
> From memory, the 64-bit ARM instruction set does not have the
> many-register operations. It has instructions that use register pairs,
> but not the bit-masks of registers to operate on that the A32 instruction
> set inherited from the early days.
>

They are kind of a pain, and unless one has a fairly wide pipeline
internally, will not likely offer much benefit over load/store pair.

>> Both of those are also used by 64-bit code. The 387 stuff is used
>> when 80-bit floats are wanted; in particular, the biggest customer
>> of MPE has complained about FP precision when they switched from
>> IA-32+387 to AMD64+SSE2, and I think they now provide AMD64+387 as
>> an option.
>
> I don't know which MPE this is, but if it gets used, it gets used.
>
>> As for "self-modifying code", like IA-32, AMD64 has no instructions
>> for telling the CPU that the data just written are instructions, so
>> JIT compilers and the like just write the instructions and then
>> execute them, just as with IA-32. If you eliminate support for
>> that, all JIT compilers stop working (or worse, they might work by
>> luck in testing, and then fail in the field).
>
> Yup, that wrecks that part of the idea.
>

You will effectively need to implement a sort of write-barrier to trap
any writes into pages which contain previously executed code, and
trigger a sort of "JIT cache flush" in the emulation layer as-needed.
Annoying, but workable.

This is assuming that the emulator has access to the underlying MMU and
the MMU supports appropriate page-access controls (idea is still kinda
wrecked if one can't create emulator-managed write-barriers or similar).

>
> John

Re: Intel goes to 32 GPRs

<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33947&group=comp.arch#33947

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4f27:0:b0:655:babe:b3f8 with SMTP id fc7-20020ad44f27000000b00655babeb3f8mr165722qvb.3.1694299311585;
Sat, 09 Sep 2023 15:41:51 -0700 (PDT)
X-Received: by 2002:a63:2742:0:b0:577:460c:1d1f with SMTP id
n63-20020a632742000000b00577460c1d1fmr544325pgn.7.1694299311092; Sat, 09 Sep
2023 15:41:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!2.eu.feeder.erje.net!feeder.erje.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Sep 2023 15:41:50 -0700 (PDT)
In-Reply-To: <udikqu$71in$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0aa:b645:ad20:1f85;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0aa:b645:ad20:1f85
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me> <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sat, 09 Sep 2023 22:41:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sat, 9 Sep 2023 22:41 UTC

On Saturday, September 9, 2023 at 3:33:38 PM UTC-5, BGB wrote:
> On 9/9/2023 12:38 PM, MitchAlsup wrote:
>
> > Early in the development of My 66000 ABI, we experimented with always
> > having a frame pointer. In the piece of code we looked at, this cost us 20%
> > more maximum stack space (on the stack). This is likely to have been the
> > simulator compiling the compiler--so take that for what it is worth.
> >>
> >> But, yeah, 16 argument does end up burning more stack space.
> > <
> > My point was that My 66000 ABI has the callee allocate this space, and
> > only when needed, otherwise the compiler can use registers as it pleases.
<
> The space doesn't effect register usage; only stack-frame size.
<
A few more words on the stack hurts big time when it overflows the current
page and a new page has to be allocated to the stack.
>
> Stack-frames being a little bit larger than a theoretical minimum
> doesn't seem to have all that big of an effect on performance.
<
While I agree it is not big, it is something you CAN get rid of.
>
> It also doesn't effect things enough to significantly effect the
> hit/miss rate of a 9-bit load/store displacement.
>
>
> Needing to reserve this space is also N/A for leaf functions (but, can
> leave some small leaf functions space to save/restore registers without
> needing to adjust the stack pointer).
> >>
> >> Say, if we assume a function with, say, 5 arguments and 20 locals
> >> (primitive types only):
> >> Saves 16 registers (14 GPRs, LR and GBR);
> >> Space for around 20 locals;
> >> Space for 8 or 16 arguments.
> > <
> > Saves as many preserved registers as desired,
> > <optionally> allocates space for surviving arguments
> > allocates space for the un-register-allocated variables
> > <optionally> allocates space for caller arguments/results in excess of 8.
> > {{This, by the way, is 1 instruction.}}
<
> The number of registers to save/restored is not a mandate, rather a
> "performance tuned parameter" (need to try to counter-balance
> spill/refill with the cost of save/restore in the prolog and epilog).
<
While I agree that it is a performance tuning parameter, saving 1
register or saving 24 registers is still 1 instruction in my ABI, and you
have the option of allocating new stack space, and setting up FP as
desired (or not) still in 1 instruction.
> >>
> >> So, 44 or 52 spots, 352 or 416 bytes.
> >> Relative size delta: 18%
> > <
> > The compiler is in a position to know the lifespan of the arguments
> > and of the local variables; in the majority of cases, it allocates fewer
> > than expected containers, so the delta is greater than 20%.
> > <
> At least in BGBCC, typically *all* variables are assigned backing
> memory, but whether or not this backing memory is actually used, is more
> subject to interpretation.
<
Local variables are initially allocated stack space. Those that do not have their
address taken are <typically> allocated into registers and removed from the
stack frame.
<snip>
> > So, when you save all these registers you emit copious amounts of ST
> > instructions, then when you reload them you emit copious amounts of
> > LD instructions. I, on the other hand, emit 1.....We did have to teach the
> > compiler to allocate registers from R30 towards R16 to be compatible
> > with the ENTER and EXIT instructions.
> ...
>
> This is why "prolog/epilog compression" is a thing in my case...
<
I compressed them into 1 instruction, each, all possible variants, with out
adding branches to the workload.
>
> Like, say if one is effectively saving/reloading 33 values:
> R8..R14, R24..R31, R40..R47, R56..R63, GBR, LR
>
> Then it makes sense to consolidate this across multiple functions and
> reuse it.
>
You are still using an instruction to get to the consolidation routine
and another to get back. I have neither; the entire save, setup, allocate,
adjust SP and FP; is 1 instruction. Yours is not less than 5.
>
> Granted, noting as how they mostly only occur in fixed quantities, it is
> possible that the prologs/epilogs could have been handled by special
> purpose runtime functions in this case:
> ADD -272, SP //save area
> MOV LR, R1
> BSR __prolog_save_31
> ADD -384, SP //rest of the call-frame
<
This sequence is 1 instruction, without flow control, and a single manipulation
of SP.
<snip>
> Say:
> Called function only takes 6 or 8 arguments or similar, ABI falls back
> to 8 argument rules;
> Called function takes 12 arguments, ABI switches to 16 argument rules;
> Vararg functions would likely assume all 16 registers.
<
My 1 single instruction has a start register and a stop register and a 16-bit
constant that allocates/deallocates stack space. Thus you can save between
1 and all 32 registers, optionally updating FP as an FP or using it as a GPR.
1 instruction, compiler can save as many or as few as it likes. 1 instruction
(with no flow control, no excess register MOV code,.....) 1 instruction.

> > <
> > But when you design specific aspects of ISA to do all this for you, you
> > don't have to "follow" but design from first principles.
> It could have been possible to have had a hacky edge case like:
> ADD -128, SP
> (Spill all the arguments)
> (Start rest of prolog as normal).
> Then add an additional step in the epilog.
>
> I had decided against this as, when these cases came up, this would
> likely have had a worse than simply "always" having the caller provide a
> designated spill space for the arguments.
<
Oh, also note: When EXIT is executed, and it is reading restore data from
the stack, and the front end runs into an ENTER instruction, HW knows
that the registers it is popping off the stack will be pushed back and can
short circuit the excess work, saving cycles and power.
<
PLUS, Return Address is read first; so, while the rest of restoration is
happening, fetch can access and decode instructions at the return point
before all the restoration has been completed; saving more cycles. And
when we run into a subsequent call, HW can short circuit the restoration
and subse
>
> But, as noted, yeah, this works a little better for the 8-arg ABI than
> the 16-arg ABI, since in the 8-arg case the difference mostly gets lost
> in the noise; but making every stack frame, on average, around 64 bytes
> bigger, is a little more obvious.
>
>
> I don't really feel ABIs like the Win X64 ABI "got it wrong" in this area..
> >>>>
> >>>> Say:
> >>>> Argument spill space on stack, provided by caller;
> >>>> Structures are passed/returned via pointers;
> >>>> Return by copying into a pointer provided by the caller;
> >>>> Except when structure is smaller than 16 bytes, which use registers;
> >>> <
> >>> Argument spill space is provided by callee
> >>> Structures up to 8 registers are passed in registers both ways
> >> My case, it is 2 registers for passing/returning in registers.
> >>
> > And I remain suggesting this is wasting cycles here or there.
> Could have been worse:
> Copy passed structs into function argument space;
> Copy returned struct into function argument space.
>
> This would have effectively doubled the amount of memory copying needed
> to pass or return structs by value.
>
>
>
> Passing structs around by reference mostly works in the sense that one
> is typically loading/storing struct members from memory, and
> pass-by-reference allows for "lazy copying" (if you don't assign to a
> struct member, you don't need to copy it into the local stack-frame).
>
> ...
>
> Though, does mean that if one calls a function that returns a struct,
> but one ignores the return value, it is still necessary to provide
> storage for the temporary location for the struct to be returned into...
>
> And, calling such a function without a visible prototype is likely to
> corrupt memory or crash the program or similar...

Michael S <already5chosen@yahoo.com> writes:
>On Saturday, September 9, 2023 at 8:42:11=E2=80=AFPM UTC+3, Anton Ertl wrot=
>e:
>> One other interesting aspect is that ARM also uses a microcode cache.=20
>>=20
>> One guess I have is that this is due to ARM supporting 2 or 3=20
>> different instruction sets. If that is true, they could do away with=20
>> the cache now that they eliminate A32/T32 in their ARMv9 cores.=20
>>
>
>Out of Arm Inc. aarch64-only cores those have MOP cache:
>Cortex-X2, Cortex-X3, Neoverse-V2.
>And those don't:
>Cortex-A715, Cortex-A720, Coertex-X4.

So the latest big and middle cores both have dropped the MOP cache
(Neoverse will follow in time, IIRC it's a derivative of the others),
which supports the guess above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

jgd@cix.co.uk (John Dallman) writes:
>In article <2023Sep9.192231@mips.complang.tuwien.ac.at>,
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> Another guess I have is that their many-register instructions
>> require more sophisticated "decoding" in their big cores, and
>> they cache that effort in the microcode cache.
>
>From memory, the 64-bit ARM instruction set does not have the
>many-register operations.

It does not have load-multiple and store-multiple, but it has
instructions that read 4 registers (store pair with some reg+reg
addressing modes), and instructions that write 3 registers (load pair
with an autoincrement addressing mode), and IIRC they have crypto
instructions that read and write even more registers.

Ok, in OoO, the reads turn into scheduler resources, and the writes
also into scheduler resources and some retirement resources, and
apparently OoO designers are able to deal with a large number of
functional units these days, so these scheduler resources do not seem
to be a bottleneck.

And given that they eliminate the MOP cache in their recent
32-bit-less designs, the guess above obviously was not the reason for
the MOP cache.

<https://www.mpeforth.com/>

When you use long double in gcc on AMD64, you also get 387 code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

BGB wrote:
> On 9/9/2023 12:38 PM, MitchAlsup wrote:
>> On Saturday, September 9, 2023 at 12:17:53 AM UTC-5, BGB wrote:
>>> On 9/8/2023 11:36 AM, MitchAlsup wrote:
>>>> On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
>>>>> On 9/7/2023 12:53 PM, MitchAlsup wrote:
>>>>>
>>>>>>> The "pain" of spill space being more obvious with 16 arguments.
>>>>>>>
>>>>>> Spill space on My 66000 stacks is allocated by the spiller not
>>>>>> the caller. At entry there is no excess space on the stack, however
>>>>>> should callee be a varargs, register arguments are pushed onto
>>>>>> the standard stack using the same ENTER as anyone else, just
>>>>>> wrapping around R0 and into the argument registers. This creates
>>>>>> a dense vector from which vararg arguments can be easily extracted.
>>>> <
>>>>> Called function makes space for saving preserved registers.
>>>>> Caller leaves space for spilling function arguments.
>>>>>
>>>> This wastes space when not needed, and is so easy for callee to provide
>>>> when needed.
>> <
>>> The difference in stack-usage is "mostly negligible" in the 8 argument
>>> case, since the function would otherwise still need to provide backing
>>> memory for the function arguments.
>> <
>> Early in the development of My 66000 ABI, we experimented with always
>> having a frame pointer. In the piece of code we looked at, this cost
>> us 20%
>> more maximum stack space (on the stack). This is likely to have been the
>> simulator compiling the compiler--so take that for what it is worth.
>>>
>>> But, yeah, 16 argument does end up burning more stack space.
>> <
>> My point was that My 66000 ABI has the callee allocate this space, and
>> only when needed, otherwise the compiler can use registers as it pleases.
>
> The space doesn't effect register usage; only stack-frame size.
>
> Stack-frames being a little bit larger than a theoretical minimum
> doesn't seem to have all that big of an effect on performance.
>
> It also doesn't effect things enough to significantly effect the
> hit/miss rate of a 9-bit load/store displacement.
>
>
> Needing to reserve this space is also N/A for leaf functions (but, can
> leave some small leaf functions space to save/restore registers without
> needing to adjust the stack pointer).
>
>
>>>
>>> Say, if we assume a function with, say, 5 arguments and 20 locals
>>> (primitive types only):
>>> Saves 16 registers (14 GPRs, LR and GBR);
>>> Space for around 20 locals;
>>> Space for 8 or 16 arguments.
>> <
>> Saves as many preserved registers as desired,
>> <optionally> allocates space for surviving arguments
>> allocates space for the un-register-allocated variables
>> <optionally> allocates space for caller arguments/results in excess of 8.
>> {{This, by the way, is 1 instruction.}}
>
> The number of registers to save/restored is not a mandate, rather a
> "performance tuned parameter" (need to try to counter-balance
> spill/refill with the cost of save/restore in the prolog and epilog).
>
>
>>>
>>> So, 44 or 52 spots, 352 or 416 bytes.
>>> Relative size delta: 18%
>> <
>> The compiler is in a position to know the lifespan of the arguments
>> and of the local variables; in the majority of cases, it allocates fewer
>> than expected containers, so the delta is greater than 20%.
>> <
>
> At least in BGBCC, typically *all* variables are assigned backing
> memory, but whether or not this backing memory is actually used, is more
> subject to interpretation.
>
> This excludes pure-static and tiny-leaf functions, which may sidestep
> this part (tiny-leaf functions wont create a stack-frame in the first
> place, static-assigning every variable to a scratch register).
>
> Granted, it could be possible for the compiler to sidestep the need to
> assign backing memory for variables that don't need it.

One reason in x86/x64 ABI's the caller reserves stack space to spill
the register passed args is because the CALL instruction pushes the
return RIP address onto the stack, and we don't want it in the middle
of a varargs vector.

If an ISA uses BAL Branch And Link style calls then the ABI can defer
allocating the arg register spill space to the callee, without forcing
a new call frame be created, and still have a contiguous arg vector
if callee uses varargs.

Re: Intel goes to 32 GPRs

<15f25a20-3c86-401b-ae61-eb5b4995a4b9n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33956&group=comp.arch#33956

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:31a5:b0:76e:e858:3511 with SMTP id bi37-20020a05620a31a500b0076ee8583511mr178954qkb.6.1694364380652;
Sun, 10 Sep 2023 09:46:20 -0700 (PDT)
X-Received: by 2002:a17:90a:9b81:b0:273:e17b:859c with SMTP id
g1-20020a17090a9b8100b00273e17b859cmr1548101pjp.2.1694364380150; Sun, 10 Sep
2023 09:46:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Sep 2023 09:46:19 -0700 (PDT)
In-Reply-To: <GLlLM.749780$xMqa.41407@fx12.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:540c:effa:5a5b:966b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:540c:effa:5a5b:966b
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me> <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me> <GLlLM.749780$xMqa.41407@fx12.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <15f25a20-3c86-401b-ae61-eb5b4995a4b9n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sun, 10 Sep 2023 16:46:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sun, 10 Sep 2023 16:46 UTC

On Sunday, September 10, 2023 at 11:01:46 AM UTC-5, EricP wrote:
> BGB wrote:
> > On 9/9/2023 12:38 PM, MitchAlsup wrote:
> >> On Saturday, September 9, 2023 at 12:17:53 AM UTC-5, BGB wrote:
> >>> On 9/8/2023 11:36 AM, MitchAlsup wrote:
> >>>> On Thursday, September 7, 2023 at 11:02:31 PM UTC-5, BGB wrote:
> >>>>> On 9/7/2023 12:53 PM, MitchAlsup wrote:
> >>>>>
> >>>>>>> The "pain" of spill space being more obvious with 16 arguments.
> >>>>>>>
> >>>>>> Spill space on My 66000 stacks is allocated by the spiller not
> >>>>>> the caller. At entry there is no excess space on the stack, however
> >>>>>> should callee be a varargs, register arguments are pushed onto
> >>>>>> the standard stack using the same ENTER as anyone else, just
> >>>>>> wrapping around R0 and into the argument registers. This creates
> >>>>>> a dense vector from which vararg arguments can be easily extracted..
> >>>> <
> >>>>> Called function makes space for saving preserved registers.
> >>>>> Caller leaves space for spilling function arguments.
> >>>>>
> >>>> This wastes space when not needed, and is so easy for callee to provide
> >>>> when needed.
> >> <
> >>> The difference in stack-usage is "mostly negligible" in the 8 argument
> >>> case, since the function would otherwise still need to provide backing
> >>> memory for the function arguments.
> >> <
> >> Early in the development of My 66000 ABI, we experimented with always
> >> having a frame pointer. In the piece of code we looked at, this cost
> >> us 20%
> >> more maximum stack space (on the stack). This is likely to have been the
> >> simulator compiling the compiler--so take that for what it is worth.
> >>>
> >>> But, yeah, 16 argument does end up burning more stack space.
> >> <
> >> My point was that My 66000 ABI has the callee allocate this space, and
> >> only when needed, otherwise the compiler can use registers as it pleases.
> >
> > The space doesn't effect register usage; only stack-frame size.
> >
> > Stack-frames being a little bit larger than a theoretical minimum
> > doesn't seem to have all that big of an effect on performance.
> >
> > It also doesn't effect things enough to significantly effect the
> > hit/miss rate of a 9-bit load/store displacement.
> >
> >
> > Needing to reserve this space is also N/A for leaf functions (but, can
> > leave some small leaf functions space to save/restore registers without
> > needing to adjust the stack pointer).
> >
> >
> >>>
> >>> Say, if we assume a function with, say, 5 arguments and 20 locals
> >>> (primitive types only):
> >>> Saves 16 registers (14 GPRs, LR and GBR);
> >>> Space for around 20 locals;
> >>> Space for 8 or 16 arguments.
> >> <
> >> Saves as many preserved registers as desired,
> >> <optionally> allocates space for surviving arguments
> >> allocates space for the un-register-allocated variables
> >> <optionally> allocates space for caller arguments/results in excess of 8.
> >> {{This, by the way, is 1 instruction.}}
> >
> > The number of registers to save/restored is not a mandate, rather a
> > "performance tuned parameter" (need to try to counter-balance
> > spill/refill with the cost of save/restore in the prolog and epilog).
> >
> >
> >>>
> >>> So, 44 or 52 spots, 352 or 416 bytes.
> >>> Relative size delta: 18%
> >> <
> >> The compiler is in a position to know the lifespan of the arguments
> >> and of the local variables; in the majority of cases, it allocates fewer
> >> than expected containers, so the delta is greater than 20%.
> >> <
> >
> > At least in BGBCC, typically *all* variables are assigned backing
> > memory, but whether or not this backing memory is actually used, is more
> > subject to interpretation.
> >
> > This excludes pure-static and tiny-leaf functions, which may sidestep
> > this part (tiny-leaf functions wont create a stack-frame in the first
> > place, static-assigning every variable to a scratch register).
> >
> > Granted, it could be possible for the compiler to sidestep the need to
> > assign backing memory for variables that don't need it.
<
> One reason in x86/x64 ABI's the caller reserves stack space to spill
> the register passed args is because the CALL instruction pushes the
> return RIP address onto the stack, and we don't want it in the middle
> of a varargs vector.
<
This explains x86-64 but not MIPS or RISC-V
>
> If an ISA uses BAL Branch And Link style calls then the ABI can defer
> allocating the arg register spill space to the callee, without forcing
> a new call frame be created, and still have a contiguous arg vector
> if callee uses varargs.

MitchAlsup wrote:
> On Sunday, September 10, 2023 at 11:01:46 AM UTC-5, EricP wrote:
>> BGB wrote:
>>> At least in BGBCC, typically *all* variables are assigned backing
>>> memory, but whether or not this backing memory is actually used, is more
>>> subject to interpretation.
>>>
>>> This excludes pure-static and tiny-leaf functions, which may sidestep
>>> this part (tiny-leaf functions wont create a stack-frame in the first
>>> place, static-assigning every variable to a scratch register).
>>>
>>> Granted, it could be possible for the compiler to sidestep the need to
>>> assign backing memory for variables that don't need it.
> <
>> One reason in x86/x64 ABI's the caller reserves stack space to spill
>> the register passed args is because the CALL instruction pushes the
>> return RIP address onto the stack, and we don't want it in the middle
>> of a varargs vector.
> <
> This explains x86-64 but not MIPS or RISC-V

They want the same ABI for prototype-less and prototyped functions,
and they want the first N args always passed in registers,
and the callee doesn't know how many varargs were actually passed
so doesn't know how many registers contain actual arg values.

(Also they may want the varargs list to be a continuous allocation,
which the va_start/va_arg mechanism doesn't necessarily require but
they are probably paranoid about code that assumes contiguous.)

The only way to satisfy this is that the caller always allocates stack
space for at least N register args, and when callee declares va_list
that callee blindly dumps the N arg registers to the N stack slots.
Also if ISA has a separate float register set, float args must also be
passed in the integer registers so they spill correctly if va_list occurs.

If the caller passed > N varargs then they will be located
on the stack above the N slots that match the N arg registers.
If caller passes < N varargs then it waste a few extra stores
to spill unused arg registers but nothing gets stomped on.

On 9/9/2023 5:41 PM, MitchAlsup wrote:
> On Saturday, September 9, 2023 at 3:33:38 PM UTC-5, BGB wrote:
>> On 9/9/2023 12:38 PM, MitchAlsup wrote:
>>
>>> Early in the development of My 66000 ABI, we experimented with always
>>> having a frame pointer. In the piece of code we looked at, this cost us 20%
>>> more maximum stack space (on the stack). This is likely to have been the
>>> simulator compiling the compiler--so take that for what it is worth.
>>>>
>>>> But, yeah, 16 argument does end up burning more stack space.
>>> <
>>> My point was that My 66000 ABI has the callee allocate this space, and
>>> only when needed, otherwise the compiler can use registers as it pleases.
> <
>> The space doesn't effect register usage; only stack-frame size.
> <
> A few more words on the stack hurts big time when it overflows the current
> page and a new page has to be allocated to the stack.

These are likely one-off costs.

The bigger concern for me is cases where it expands the stack enough to
require a bigger stack (say, 128K->256K), or allow for a smaller stack
(128K->64K).

The deltas in this case are generally too small to allow for either case.

The much bigger concern for this sort of thing is mostly functions with
pass-by-value structs and local arrays. These are going to eat the stack
far faster than register spill space could ever hope to achieve (and one
could probably get along OK with a 16K or 32K stack if all the arrays
and structs were banished to the heap).

I did exclude the caller-provided spill-space for configurations built
to assume a 32-bit pointer size, but this is more because I am often
trying to run the code in these configurations on a 4K stack (in these
cases, the stack is assumed to have a logical 32-bit element size rather
than 64-bits; but would still need to provide 64B for the spill-space on
account of the use of 64-bit registers).

If one assumes 32-bit stack elements, and 32-bit pointers, this does
more visibly reduce the size of the stack frame. Granted, this case is
more like a 32-bit machine that just so happens to have 64-bit registers
and ALU ops.

Where, I had profiles designated by letters:
A: 64-bit pointers (48-bit virtual address)
B..F: 32-bit pointers (most assume physical-only addressing)
G: 128-bit pointers (96-bit virtual address)
H/I: 64-bit pointers (48-bit virtual address)

Profile A assumes 32 GPRs, Baseline encoding, ...
This uses an 8 register argument ABI
64 bytes of caller-provided spill space.

Profiles B..F assumes 32 GPRs, Baseline encoding, ...
This uses an 8 register argument ABI
No spill space.
These profiles assume RISC-like operation.
Most differences relate to presence/absence of an FPU and similar.
The 'D' profile is basically like RV64I, but with 32-bit pointers.

Profiles G and H assume 64 GPRs:
XGPR and XG2 encoding
Uses an 16 register argument ABI
128 bytes of caller-provided spill space.

I: Basically similar to A, but omits some features, and assumes a 2-wide
pipeline rather than 3-wide.

At present, the low physical memory map is roughly:
00000000..00007FFF: Boot ROM
00008000..0000BFFF: Boot ROM (Optional)
0000C000..0000DFFF: Boot SRAM
0000E000..0000FFFF: Boot SRAM (Optional)
00010000..00FFFFFF: Fixed-pattern pages and similar.
Pages full of Zeroes, NOPs, BREAK instructions, ...
These are mostly for sake of the virtual memory implementation (1).
01000000..1FFFFFFF: RAM
20000000..3FFFFFFF: RAM (Repeat)
Includes an additional 16MB shadowed by the other mapping.
40000000..EFFFFFFF: Reserved
F0000000..FFFFFFFF: MMIO Space (2)

1: Say, for untouched memory, don't necessarily want to waste actual RAM
on pages full of zeroes and similar...

2: 0000_Fzzzzzzz is only accessible (via memory ops) when MMU is
disabled or in 32-bit address mode.
Otherwise, it is mapped to FFFF_Fzzzzzzz in 48-bit mode.
Only exists in "Quadrant 0" in 96-bit modes:
0000_00000000_FFFF_Fzzzzzzz

Well, and also now there is a 64-bit address sub-mode of 96-bit mode,
where the pointer 64-bit pointer is interpreted as:
yyyy_yyyy_zzzz_zzzz_zzzzzzzz
z: bits from pointer
y: bits from PCH/GBH
There are no tag bits in this mode, so it can mimic something more
directly comparable to the x86-64 virtual address space (but, with the
quirk at present that addresses will wrap across the 48-bit mark).

At present, there are not any plans to have user programs operate
directly with 96-bit addresses, as using 128-bit pointers is kinda a
"lead balloon".

>>
>> Stack-frames being a little bit larger than a theoretical minimum
>> doesn't seem to have all that big of an effect on performance.
> <
> While I agree it is not big, it is something you CAN get rid of.

I could get rid of it, in theory...

Except that its visible effects on performance appear to be mostly
negligible.

But, has other merits:
Simplifies faking x86 style argument-list behavior;
Simplifies the implementation of COM style wrappers and RPC (1);
Allows simpler/cheaper handling of varargs / va_list;
...

1: The wrapper thunks need not care about the contents of the argument
list, and we also don't need the thunks to deal with the return path in
this case. Where, say, the wrapper spills the argument list to memory,
and then redirects the call through a syscall, which then lands in
another task set up to handle the object method call (reloading the
arguments from memory and calling the associated method). Though, code
(on both ends) does need to deal with the mechanics of shared memory
objects (so, say, the "bare" COM style interface would be wrapped in a
C-like interface, which would also deal with any data marshaling).

I wouldn't have done it this way if it was simply wasting stack-space
for no reason.

If it really mattered, could also go back to compiling programs to use
32-bit pointers.

Granted, there are other tradeoffs, for example:
'this' uses a dedicated register (rather than being passed as the first
argument), which makes lambdas and similar easier, but does mean that
class methods and COM-object methods are not equivalent (where COM
objects always pass 'this' as an argument).

However, as I saw it, making lambdas able to be equivalent to C function
pointers, was a bigger gain than being able to make class-objects and
COM objects equivalent. But, I guess nothing prevents a compiler from
compiling classes as-if they were COM objects.

But, for the most part, this doesn't really effect C either way.

Well, and in cases where COM style objects were exposed as C++ classes,
this was typically by providing a wrapper class which wraps calls via
the COM interface, rather than by assuming the ability to cast-convert
between COM objects and C++ classes or similar.

>>
>> It also doesn't effect things enough to significantly effect the
>> hit/miss rate of a 9-bit load/store displacement.
>>
>>
>> Needing to reserve this space is also N/A for leaf functions (but, can
>> leave some small leaf functions space to save/restore registers without
>> needing to adjust the stack pointer).
>>>>
>>>> Say, if we assume a function with, say, 5 arguments and 20 locals
>>>> (primitive types only):
>>>> Saves 16 registers (14 GPRs, LR and GBR);
>>>> Space for around 20 locals;
>>>> Space for 8 or 16 arguments.
>>> <
>>> Saves as many preserved registers as desired,
>>> <optionally> allocates space for surviving arguments
>>> allocates space for the un-register-allocated variables
>>> <optionally> allocates space for caller arguments/results in excess of 8.
>>> {{This, by the way, is 1 instruction.}}
> <
>> The number of registers to save/restored is not a mandate, rather a
>> "performance tuned parameter" (need to try to counter-balance
>> spill/refill with the cost of save/restore in the prolog and epilog).
> <
> While I agree that it is a performance tuning parameter, saving 1
> register or saving 24 registers is still 1 instruction in my ABI, and you
> have the option of allocating new stack space, and setting up FP as
> desired (or not) still in 1 instruction.

Fair enough.

I don't do "1 instruction" because there would need to be a pretty hairy
piece of machinery behind that instruction.

Well, at least assuming it doesn't effectively just get implemented as a
special case of a "Branch Subroutine" instruction or similar...

Or, say, a special instruction that performs the equivalent of, say:
MOV LR, R1 | ADD -xxx, SP | BSR label
....

>>>>
>>>> So, 44 or 52 spots, 352 or 416 bytes.
>>>> Relative size delta: 18%
>>> <
>>> The compiler is in a position to know the lifespan of the arguments
>>> and of the local variables; in the majority of cases, it allocates fewer
>>> than expected containers, so the delta is greater than 20%.
>>> <
>> At least in BGBCC, typically *all* variables are assigned backing
>> memory, but whether or not this backing memory is actually used, is more
>> subject to interpretation.
> <
> Local variables are initially allocated stack space. Those that do not have their
> address taken are <typically> allocated into registers and removed from the
> stack frame.

Click here to read the complete article

Re: Intel goes to 32 GPRs

<4fb5c730-70dc-41fa-ac4c-5ee7cd1bc140n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33959&group=comp.arch#33959

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:598d:0:b0:410:839d:942c with SMTP id e13-20020ac8598d000000b00410839d942cmr183927qte.12.1694374313980;
Sun, 10 Sep 2023 12:31:53 -0700 (PDT)
X-Received: by 2002:a05:6a00:9a0:b0:68e:3c6d:da62 with SMTP id
u32-20020a056a0009a000b0068e3c6dda62mr3311151pfg.6.1694374313329; Sun, 10 Sep
2023 12:31:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Sep 2023 12:31:52 -0700 (PDT)
In-Reply-To: <4goLM.749798$xMqa.716013@fx12.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:540c:effa:5a5b:966b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:540c:effa:5a5b:966b
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me> <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me> <GLlLM.749780$xMqa.41407@fx12.iad>
<15f25a20-3c86-401b-ae61-eb5b4995a4b9n@googlegroups.com> <4goLM.749798$xMqa.716013@fx12.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4fb5c730-70dc-41fa-ac4c-5ee7cd1bc140n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sun, 10 Sep 2023 19:31:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sun, 10 Sep 2023 19:31 UTC

On Sunday, September 10, 2023 at 1:52:52 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Sunday, September 10, 2023 at 11:01:46 AM UTC-5, EricP wrote:
> >> BGB wrote:
> >>> At least in BGBCC, typically *all* variables are assigned backing
> >>> memory, but whether or not this backing memory is actually used, is more
> >>> subject to interpretation.
> >>>
> >>> This excludes pure-static and tiny-leaf functions, which may sidestep
> >>> this part (tiny-leaf functions wont create a stack-frame in the first
> >>> place, static-assigning every variable to a scratch register).
> >>>
> >>> Granted, it could be possible for the compiler to sidestep the need to
> >>> assign backing memory for variables that don't need it.
> > <
> >> One reason in x86/x64 ABI's the caller reserves stack space to spill
> >> the register passed args is because the CALL instruction pushes the
> >> return RIP address onto the stack, and we don't want it in the middle
> >> of a varargs vector.
> > <
> > This explains x86-64 but not MIPS or RISC-V
<
> They want the same ABI for prototype-less and prototyped functions,
> and they want the first N args always passed in registers,
> and the callee doesn't know how many varargs were actually passed
> so doesn't know how many registers contain actual arg values.
<
These are good properties.
>
> (Also they may want the varargs list to be a continuous allocation,
> which the va_start/va_arg mechanism doesn't necessarily require but
> they are probably paranoid about code that assumes contiguous.)
<
Another good property.
>
> The only way to satisfy this is that the caller always allocates stack
> space for at least N register args, and when callee declares va_list
> that callee blindly dumps the N arg registers to the N stack slots.
> Also if ISA has a separate float register set, float args must also be
> passed in the integer registers so they spill correctly if va_list occurs..
<
And yet, I have pulled this off with My 66000 ABI; even when some of the
arguments are FP and others are Int with still others being pointers and
it is the callee that performs the allocation. So, your assertion that this
is the only way is without merit--keyword only.
>
> If the caller passed > N varargs then they will be located
> on the stack above the N slots that match the N arg registers.
> If caller passes < N varargs then it waste a few extra stores
> to spill unused arg registers but nothing gets stomped on.
<
My 66000 ABI; Varargs merely causes all 8 argument registers to be
pushed on the stack above return address and preserved registers
and local variables, and outbound calling argument/return area.
<
Caller places arguments[8..k] onto the top of the stack
SP -> argument[8]
caller places register arguments[0..7] on the stack immediately below
argument[8], then proceeds to put return address on stack at effective
position argument[-1] then puts preserved registers on the stack at
effective argument[-2..m]; so the stack looks like:
>
argument[k]
.... // these arguments are passed in memory
argument[8] // on arrival at callee SP points here
argument[7]
.... // these arguments are passed in registers
argument[0]
return address
R29
.... // these arguments are put on the safe stack when enabled
R16
local valuables
SP -> maximal call/return stack extent
<
This works a lot better when there is no FP register file. And all of this from
1 instruction. AND this meets all of your desired properties.
<
So::
caller does not have to know if callee is varargs or not
caller does not have to have callee prototype in scope
caller does not have to allocate excess stack space
caller does not discriminate between INT and FP registers
callee does not have to have caller prototype in scope
callee does not have to use space allocated by caller
callee does not discriminate between INT and FP registers
<
I submit, they could have done similarly and got rid of the excess stack
allocation.

On 9/10/2023 1:52 PM, EricP wrote:
> MitchAlsup wrote:
>> On Sunday, September 10, 2023 at 11:01:46 AM UTC-5, EricP wrote:
>>> BGB wrote:
>>>> At least in BGBCC, typically *all* variables are assigned backing
>>>> memory, but whether or not this backing memory is actually used, is
>>>> more subject to interpretation.
>>>> This excludes pure-static and tiny-leaf functions, which may
>>>> sidestep this part (tiny-leaf functions wont create a stack-frame in
>>>> the first place, static-assigning every variable to a scratch
>>>> register).
>>>> Granted, it could be possible for the compiler to sidestep the need
>>>> to assign backing memory for variables that don't need it.
>> <
>>> One reason in x86/x64 ABI's the caller reserves stack space to spill
>>> the register passed args is because the CALL instruction pushes the
>>> return RIP address onto the stack, and we don't want it in the middle
>>> of a varargs vector.
>> <
>> This explains x86-64 but not MIPS or RISC-V
>
> They want the same ABI for prototype-less and prototyped functions,
> and they want the first N args always passed in registers,
> and the callee doesn't know how many varargs were actually passed
> so doesn't know how many registers contain actual arg values.
>

Yes.

Though, in my ABI, it doesn't quite work out this way for functions that
return structs by value, but at the same time, one can't use a function
that returns a struct by value without a prototype, so... we can mostly
just sorta gloss over this case.

> (Also they may want the varargs list to be a continuous allocation,
> which the va_start/va_arg mechanism doesn't necessarily require but
> they are probably paranoid about code that assumes contiguous.)
>
> The only way to satisfy this is that the caller always allocates stack
> space for at least N register args, and when callee declares va_list
> that callee blindly dumps the N arg registers to the N stack slots.
> Also if ISA has a separate float register set, float args must also be
> passed in the integer registers so they spill correctly if va_list occurs.
>

Pretty much.

Luckily, I only have a single register set, so having an "OK, we need to
dump all the arguments to the stack now" special case is straightforward
(and does not require effectively adding additional stack-offset
adjustments to the mix).

> If the caller passed > N varargs then they will be located
> on the stack above the N slots that match the N arg registers.
> If caller passes < N varargs then it waste a few extra stores
> to spill unused arg registers but nothing gets stomped on.
>

Yes.

The scheme works nicely in general, so this is part of why I did it this
way.

The one annoyance is mostly in the relative tradeoff between 8 and 16
arguments, because only a small percentage of functions (or function
calls) exceed the 8 argument limit; does mean that "most of the time",
the 16 argument ABI spends more space than needed, and spills more
registers than needed in the case of RPC calls or vararg handling or
similar.

However... The 16 argument ABI mostly "pays for itself" in cases where
one is using functions that accept multiple 128-bit SIMD vectors, or
other 128-bit values.

Say, if one does a function like:
__vec4f FooHLerp(
__vec4f a, __vec4f b, __vec4f c, __vec4f d,
float xfrac, float yfrac);

Despite being only 6 arguments, it will use 10 register slots.

So, another case that makes things like my OpenGL implementation faster,
at the expense of making "printf()" and friends marginally slower and
increasing the average-case stack-frame size by around 10% or so.

For code built to use 128-bit pointers, this ABI variant is a more solid
win, but the use of 128-bit pointers in general... is not.

Re: Intel goes to 32 GPRs

<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33961&group=comp.arch#33961

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:46a4:b0:770:7460:c426 with SMTP id bq36-20020a05620a46a400b007707460c426mr369643qkb.0.1694375220901;
Sun, 10 Sep 2023 12:47:00 -0700 (PDT)
X-Received: by 2002:a17:90a:c912:b0:271:abb6:6902 with SMTP id
v18-20020a17090ac91200b00271abb66902mr2077184pjt.1.1694375220055; Sun, 10 Sep
2023 12:47:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Sep 2023 12:46:59 -0700 (PDT)
In-Reply-To: <udl3u4$m8af$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:540c:effa:5a5b:966b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:540c:effa:5a5b:966b
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me> <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me> <009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sun, 10 Sep 2023 19:47:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6650

by: MitchAlsup - Sun, 10 Sep 2023 19:46 UTC

On Sunday, September 10, 2023 at 2:03:37 PM UTC-5, BGB wrote:
> On 9/9/2023 5:41 PM, MitchAlsup wrote:
>
> > While I agree that it is a performance tuning parameter, saving 1
> > register or saving 24 registers is still 1 instruction in my ABI, and you
> > have the option of allocating new stack space, and setting up FP as
> > desired (or not) still in 1 instruction.
> Fair enough.
>
> I don't do "1 instruction" because there would need to be a pretty hairy
> piece of machinery behind that instruction.
<
A memory unit sequencer that runs::
<
for( i = Rstart; i != Rstop; i = (i+1)&*31 ){ *--SP = R[i] };
<
Yep, that is too big and harry to even consider.
>
> Well, at least assuming it doesn't effectively just get implemented as a
> special case of a "Branch Subroutine" instruction or similar...
<
While the save or restore is off doing its memory thing, all other non-memory
function units are capable of performing instructions.
>

> Branches were the most workable option in my case... (No additional
> complicated mechanism needed to deal with using a branch).
>
> Saves a reasonable amount in ".text" mostly from avoiding a bunch of
> bulky messes of load/store sequences.
<
Do you share these sequences with DLLs ?? How does the DLL know you are
not malicious SW ?? How do you know DLL is not malicious SW ??
>
>
> But, does it matter all that much?...
>
Code footprint is nothing else, and security if you want to go there.
>
> Pretty much the only mainstream ISA I am aware of that messed with
> something similar (namely, 32-bit ARM), later ended up abandoning it
> when going to 64 bits.
>
> Most others ended up not bothering at all (x86-64, ARMv8), or using a
> similar mechanism to what I am using (RISC-V).
>
x86 grew from the foundation of not having enough registers to use them
in passing arguments--so it gets an excuse.
>
> I am personally not inclined to deal with needing to add a mechanism to
> turn one instruction into a whole instruction sequence in the pipeline,
<
It is a single FU not the whole pipeline.
<
> to maybe save a fraction of a percent of the total clock-cycle budget.
<
It saves ~45% of the cycle count and saves ~52% of the power of prologue
and epilogue sequences.
<
> One needs at least 2 stack adjustments to deal with the separate sizes
> of the register-save area and all the space for local variables.
<
There is a computable number of registers pushed on the stack, then there
is a 16-bit immediate used to further adjust SP.
>
> The cost of the extra instructions is likely dwarfed by the big blobs of
> load and store instructions at the destination.
>
It is clear--you just don't get it.
>
> And, in the "small N" cases, it can be done inline (often with only a
> single stack adjustment being needed). Seems like, good enough...
<
They are ALL INLINE and they only take one 32-bit instruction in mine.
<
> >> Say:
> >> Called function only takes 6 or 8 arguments or similar, ABI falls back
> >> to 8 argument rules;
> >> Called function takes 12 arguments, ABI switches to 16 argument rules;
> >> Vararg functions would likely assume all 16 registers.
> > <
> > My 1 single instruction has a start register and a stop register and a 16-bit
> > constant that allocates/deallocates stack space. Thus you can save between
> > 1 and all 32 registers, optionally updating FP as an FP or using it as a GPR.
> > 1 instruction, compiler can save as many or as few as it likes. 1 instruction
> > (with no flow control, no excess register MOV code,.....) 1 instruction..
> >
> That likely unleashes a great evil upon the design of the instruction
> pipeline...
>
Nope, it adds a tiny amount of logic to the memory pipeline, and almost
nothing to the instruction pipeline.
>

> > PLUS, Return Address is read first; so, while the rest of restoration is
> > happening, fetch can access and decode instructions at the return point
> > before all the restoration has been completed; saving more cycles. And
> > when we run into a subsequent call, HW can short circuit the restoration
> > and subse
<
> Reloading the return address before the other instructions is already
> done in the epilog sequences in my case. This allows the
> branch-predictor to deal with the return path.
<
But you cannot jump-back to the return address until all your LDs have
been issued; whereas I can.
>

Re: Intel goes to 32 GPRs

<udldjq$nlt3$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33962&group=comp.arch#33962

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Sun, 10 Sep 2023 16:48:39 -0500
Organization: A noiseless patient Spider
Lines: 249
Message-ID: <udldjq$nlt3$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 10 Sep 2023 21:48:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c3c063d5cb0982915a13d963db534ae7";
logging-data="776099"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19cwG0hScR5TraGcgOgGJvs"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:pnGCDtLnHqdNIm9c/RuHxhrHC60=
In-Reply-To: <2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
Content-Language: en-US

by: BGB - Sun, 10 Sep 2023 21:48 UTC

On 9/10/2023 2:46 PM, MitchAlsup wrote:
> On Sunday, September 10, 2023 at 2:03:37 PM UTC-5, BGB wrote:
>> On 9/9/2023 5:41 PM, MitchAlsup wrote:
>>
>>> While I agree that it is a performance tuning parameter, saving 1
>>> register or saving 24 registers is still 1 instruction in my ABI, and you
>>> have the option of allocating new stack space, and setting up FP as
>>> desired (or not) still in 1 instruction.
>> Fair enough.
>>
>> I don't do "1 instruction" because there would need to be a pretty hairy
>> piece of machinery behind that instruction.
> <
> A memory unit sequencer that runs::
> <
> for( i = Rstart; i != Rstop; i = (i+1)&*31 ){ *--SP = R[i] };
> <
> Yep, that is too big and harry to even consider.
>>
>> Well, at least assuming it doesn't effectively just get implemented as a
>> special case of a "Branch Subroutine" instruction or similar...
> <
> While the save or restore is off doing its memory thing, all other non-memory
> function units are capable of performing instructions.

Everything in my pipeline is lockstep, so asynchronous handling of work
isn't really a thing.

I suspect probably most other in-order pipelines are similar in this regard.

>>
>
>
>> Branches were the most workable option in my case... (No additional
>> complicated mechanism needed to deal with using a branch).
>>
>> Saves a reasonable amount in ".text" mostly from avoiding a bunch of
>> bulky messes of load/store sequences.
> <
> Do you share these sequences with DLLs ?? How does the DLL know you are
> not malicious SW ?? How do you know DLL is not malicious SW ??

Compiler checks:
Do I need to call this thing?
Was an instance emitted within the last 1MB?
If no:
Emit the prolog/epilog blobs.
Emit calls/branches to blob.
Else:
Do it inline.

If they were handled with normal runtime calls (rather than
auto-generated blobs), they would be static linked along with the
relevant parts of the C runtime and similar.

Meanwhile, if ones' DLLs are malware, they have worse problems than the
use of prolog/epilog blobs.

>>
>>
>> But, does it matter all that much?...
>>
> Code footprint is nothing else, and security if you want to go there.
>>
>> Pretty much the only mainstream ISA I am aware of that messed with
>> something similar (namely, 32-bit ARM), later ended up abandoning it
>> when going to 64 bits.
>>
>> Most others ended up not bothering at all (x86-64, ARMv8), or using a
>> similar mechanism to what I am using (RISC-V).
>>
> x86 grew from the foundation of not having enough registers to use them
> in passing arguments--so it gets an excuse.
>>
>> I am personally not inclined to deal with needing to add a mechanism to
>> turn one instruction into a whole instruction sequence in the pipeline,
> <
> It is a single FU not the whole pipeline.
> <
>> to maybe save a fraction of a percent of the total clock-cycle budget.
> <
> It saves ~45% of the cycle count and saves ~52% of the power of prologue
> and epilogue sequences.
> <

Not sure how:
One would still be bottlenecked by how quickly they can move data
between the register file and memory.

If one has a 128-bit memory port, and three 64-bit register write ports,
only so much can be done here...

Memory access would also still be limited by how quickly it can
load/store stuff without running into cache misses, ... Or, other cases,
like, what happens if a TLB-miss fault happens partway through? ...

>> One needs at least 2 stack adjustments to deal with the separate sizes
>> of the register-save area and all the space for local variables.
> <
> There is a computable number of registers pushed on the stack, then there
> is a 16-bit immediate used to further adjust SP.

OK.

>>
>> The cost of the extra instructions is likely dwarfed by the big blobs of
>> load and store instructions at the destination.
>>
> It is clear--you just don't get it.
>>
>> And, in the "small N" cases, it can be done inline (often with only a
>> single stack adjustment being needed). Seems like, good enough...
> <
> They are ALL INLINE and they only take one 32-bit instruction in mine.
> <
>>>> Say:
>>>> Called function only takes 6 or 8 arguments or similar, ABI falls back
>>>> to 8 argument rules;
>>>> Called function takes 12 arguments, ABI switches to 16 argument rules;
>>>> Vararg functions would likely assume all 16 registers.
>>> <
>>> My 1 single instruction has a start register and a stop register and a 16-bit
>>> constant that allocates/deallocates stack space. Thus you can save between
>>> 1 and all 32 registers, optionally updating FP as an FP or using it as a GPR.
>>> 1 instruction, compiler can save as many or as few as it likes. 1 instruction
>>> (with no flow control, no excess register MOV code,.....) 1 instruction.
>>>
>> That likely unleashes a great evil upon the design of the instruction
>> pipeline...
>>
> Nope, it adds a tiny amount of logic to the memory pipeline, and almost
> nothing to the instruction pipeline.

I am not sure how.

The routing of things into and out of the register file would presumably
be via the pipeline, with the register file and L1 cache as two
independent units.

Only obvious way to do it in my case would be to have a mechanism in the
ID1 or ID2 stage to effectively synthesize a sequence of Load or Store
operations (while otherwise keeping the Fetch and Decode stages stalled).

I wouldn't really want to go there.

I think some other people's cores have a similar mechanism, typically
for fetching and decoding instructions from an internal "Microcode ROM";
I don't have a microcode mechanism either.

Well, or possibly slightly less hacky if the microcode mechanism were
handled in the IF stage, running in parallel with the normal I-Chache,
and handled as a sort of pseudo-subroutine-call to an "arcane magic"
address.

Could put it in a more normal ROM area, but this would effectively make
the feature almost entirely moot.

Like, at one point, I was going to put the page-table handling in ROM
and then pretend like there was a hardware page walker, but ended up not
bothering.

>>
>
>>> PLUS, Return Address is read first; so, while the rest of restoration is
>>> happening, fetch can access and decode instructions at the return point
>>> before all the restoration has been completed; saving more cycles. And
>>> when we run into a subsequent call, HW can short circuit the restoration
>>> and subse
> <
>> Reloading the return address before the other instructions is already
>> done in the epilog sequences in my case. This allows the
>> branch-predictor to deal with the return path.
> <
> But you cannot jump-back to the return address until all your LDs have
> been issued; whereas I can.

It can start fetching instructions at the point of return before all of
the loads have completed, so, probably good enough...

The sequence just needs to be long enough that the load for the return
address hits the WB stage before the "JMP R1" passes into the ID1 stage.

So, say:
MOV.Q (SP, 56), R1
MOV.X (SP, 0), R8
MOV.X (SP, 16), R10
MOV.X (SP, 32), R12
JMP R1

Fails here:
PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.Q (SP, 56), R1
PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 0), R8
PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 16), R10
PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 32), R12
PF IF ID1 ID2 EX1 EX2 EX3 WB JMP R1
^^!

So, Loading R1 has not finished in time, and the slow-case branch would
need to be used (triggering a pipeline flush, etc).

Would need to add roughly 2 more instructions to avoid the slow branch...

But:
MOV.Q (SP, 56), R1
MOV.X (SP, 0), R8
MOV.Q (SP, 16), R10
MOV.Q (SP, 24), R11
MOV.Q (SP, 32), R12
MOV.Q (SP, 40), R13
JMP R1

Should be roughly optimal. Branch predictor sees the newly loaded R1, so
all is well.

Short of adding special-purpose hot-paths, not likely any good way to
get much better latency than this.

As-is, this will be detected by the compiler and dealt with via NOPs, as
burning a few cycles on NOPs is better than spending a larger number of
cycles on a pipeline flush (costing ~ 9 cycles or so, *).

*: EX1 does address calculation, flags that a branch has occurred;
EX2 initiates the branch proper
(address sent to PF, flush initiated);
EX3 (address reaches IF), pipeline flush takes effect.

Click here to read the complete article

Re: Intel goes to 32 GPRs

<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33963&group=comp.arch#33963

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:5d12:b0:655:dad8:378b with SMTP id me18-20020a0562145d1200b00655dad8378bmr85506qvb.9.1694392027959;
Sun, 10 Sep 2023 17:27:07 -0700 (PDT)
X-Received: by 2002:a17:902:e749:b0:1c3:5e26:33f4 with SMTP id
p9-20020a170902e74900b001c35e2633f4mr3559305plf.5.1694392027760; Sun, 10 Sep
2023 17:27:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Sep 2023 17:27:07 -0700 (PDT)
In-Reply-To: <udldjq$nlt3$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:540c:effa:5a5b:966b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:540c:effa:5a5b:966b
References: <u9o14h$183or$1@newsreader4.netcologne.de> <u9r0jd$1ftnb$1@dont-email.me>
<ublni6$3s8o9$1@dont-email.me> <07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at> <AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com> <ubrca2$2jvsf$1@newsreader4.netcologne.de>
<udcvo7$32enh$1@dont-email.me> <59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me> <492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me> <20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me> <009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me> <2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
Subject: Re: Intel goes to 32 GPRs
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 11 Sep 2023 00:27:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8079

by: MitchAlsup - Mon, 11 Sep 2023 00:27 UTC

On Sunday, September 10, 2023 at 4:48:46 PM UTC-5, BGB wrote:
> On 9/10/2023 2:46 PM, MitchAlsup wrote:
> > On Sunday, September 10, 2023 at 2:03:37 PM UTC-5, BGB wrote:
> >> On 9/9/2023 5:41 PM, MitchAlsup wrote:
> >>
> >>> While I agree that it is a performance tuning parameter, saving 1
> >>> register or saving 24 registers is still 1 instruction in my ABI, and you
> >>> have the option of allocating new stack space, and setting up FP as
> >>> desired (or not) still in 1 instruction.
> >> Fair enough.
> >>
> >> I don't do "1 instruction" because there would need to be a pretty hairy
> >> piece of machinery behind that instruction.
> > <
> > A memory unit sequencer that runs::
> > <
> > for( i = Rstart; i != Rstop; i = (i+1)&*31 ){ *--SP = R[i] };
> > <
> > Yep, that is too big and harry to even consider.
> >>
> >> Well, at least assuming it doesn't effectively just get implemented as a
> >> special case of a "Branch Subroutine" instruction or similar...
> > <
> > While the save or restore is off doing its memory thing, all other non-memory
> > function units are capable of performing instructions.
<
> Everything in my pipeline is lockstep, so asynchronous handling of work
> isn't really a thing.
>
> I suspect probably most other in-order pipelines are similar in this regard.
<
I suspect so, also, but even my 88100 did not have that property (1983).
<
> > It saves ~45% of the cycle count and saves ~52% of the power of prologue
> > and epilogue sequences.
> > <
> Not sure how:
> One would still be bottlenecked by how quickly they can move data
> between the register file and memory.
<
a) there is a distribution function of the number of registers saved per function.
b) you only access all of the sets when you cross a line boundary
c) you only access the TLB when you cross a page boundary
d) accessing 3 cache tags an 1 TLB entry is more expensive than accessing
.....128-bits of data (by about 30%)
e) after accessing 3 cache tags and the TLB on the first cycle, one gets
.....the rest of the line without a tag or TLB access
f) throw that data in an eXcel spreadsheet along with power info for the
.....data versus tag versus TLB and it spits out the above data.
<
If you do these one memory ref at a time, it is exceedingly hard to economize
like that.
>
> If one has a 128-bit memory port, and three 64-bit register write ports,
> only so much can be done here...
<
The 3R1W RF is converted into a 4R when ENTERing and 4W when EXITing.
>
> Memory access would also still be limited by how quickly it can
> load/store stuff without running into cache misses, ... Or, other cases,
> like, what happens if a TLB-miss fault happens partway through? ...
<
The stack rarely takes misses on ENTER and EXIT.
<
<snip>
> > Nope, it adds a tiny amount of logic to the memory pipeline, and almost
> > nothing to the instruction pipeline.
> I am not sure how.
>
LoL
>
> The routing of things into and out of the register file would presumably
> be via the pipeline, with the register file and L1 cache as two
> independent units.
>
reconfigure the RF porting.
>
> Only obvious way to do it in my case would be to have a mechanism in the
> ID1 or ID2 stage to effectively synthesize a sequence of Load or Store
> operations (while otherwise keeping the Fetch and Decode stages stalled).
<
As I said before, it is all done in the MU.
>
> I wouldn't really want to go there.
>
>
> I think some other people's cores have a similar mechanism, typically
> for fetching and decoding instructions from an internal "Microcode ROM";
> I don't have a microcode mechanism either.
<
Hardwired sequencers are not Microcode. It is basically no different than
Goldschmidt division in a FMAC unit.
>
> Well, or possibly slightly less hacky if the microcode mechanism were
> handled in the IF stage, running in parallel with the normal I-Chache,
> and handled as a sort of pseudo-subroutine-call to an "arcane magic"
> address.
>
I showed you how its done, you don't listen.

> > <
> > But you cannot jump-back to the return address until all your LDs have
> > been issued; whereas I can.
> It can start fetching instructions at the point of return before all of
> the loads have completed, so, probably good enough...
>
> The sequence just needs to be long enough that the load for the return
> address hits the WB stage before the "JMP R1" passes into the ID1 stage.
>
> So, say:
> MOV.Q (SP, 56), R1
> MOV.X (SP, 0), R8
> MOV.X (SP, 16), R10
> MOV.X (SP, 32), R12
> JMP R1
>
> Fails here:
> PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.Q (SP, 56), R1
> PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 0), R8
> PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 16), R10
> PF IF ID1 ID2 EX1 EX2 EX3 WB MOV.X (SP, 32), R12
> PF IF ID1 ID2 EX1 EX2 EX3 WB JMP R1
> ^^!
>
> So, Loading R1 has not finished in time, and the slow-case branch would
> need to be used (triggering a pipeline flush, etc).
<
I should not, but I will point, out that when LDing IP one does not need the
byte alignment network one does need for normal LDs. This saves cycles.
>
>
> Would need to add roughly 2 more instructions to avoid the slow branch...
>
> But:
> MOV.Q (SP, 56), R1
> MOV.X (SP, 0), R8
> MOV.Q (SP, 16), R10
> MOV.Q (SP, 24), R11
> MOV.Q (SP, 32), R12
> MOV.Q (SP, 40), R13
> JMP R1
>
> Should be roughly optimal. Branch predictor sees the newly loaded R1, so
> all is well.
<
But it cannot redirect control flow until it sees the JMP; whereas HW knows
at the point of LD R0 that control will flow to the return address.
>

Re: Intel goes to 32 GPRs

<udm914$uhr9$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33964&group=comp.arch#33964

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 00:36:33 -0500
Organization: A noiseless patient Spider
Lines: 315
Message-ID: <udm914$uhr9$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 05:36:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9fa05b0d2a521d4d10f904eca70f709c";
logging-data="1001321"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+2EcXTZeV8caSxO3zn02qf"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:lIbKGqcJ4o+ZHXsx46t0et+3IXQ=
In-Reply-To: <07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 11 Sep 2023 05:36 UTC

On 9/10/2023 7:27 PM, MitchAlsup wrote:
> On Sunday, September 10, 2023 at 4:48:46 PM UTC-5, BGB wrote:
>> On 9/10/2023 2:46 PM, MitchAlsup wrote:
>>> On Sunday, September 10, 2023 at 2:03:37 PM UTC-5, BGB wrote:
>>>> On 9/9/2023 5:41 PM, MitchAlsup wrote:
>>>>
>>>>> While I agree that it is a performance tuning parameter,
>>>>> saving 1 register or saving 24 registers is still 1
>>>>> instruction in my ABI, and you have the option of allocating
>>>>> new stack space, and setting up FP as desired (or not) still
>>>>> in 1 instruction.
>>>> Fair enough.
>>>>
>>>> I don't do "1 instruction" because there would need to be a
>>>> pretty hairy piece of machinery behind that instruction.
>>> < A memory unit sequencer that runs:: < for( i = Rstart; i !=
>>> Rstop; i = (i+1)&*31 ){ *--SP = R[i] }; < Yep, that is too big
>>> and harry to even consider.
>>>>
>>>> Well, at least assuming it doesn't effectively just get
>>>> implemented as a special case of a "Branch Subroutine"
>>>> instruction or similar...
>>> < While the save or restore is off doing its memory thing, all
>>> other non-memory function units are capable of performing
>>> instructions.
> <
>> Everything in my pipeline is lockstep, so asynchronous handling of
>> work isn't really a thing.
>>
>> I suspect probably most other in-order pipelines are similar in
>> this regard.
> < I suspect so, also, but even my 88100 did not have that property
> (1983). <
>>> It saves ~45% of the cycle count and saves ~52% of the power of
>>> prologue and epilogue sequences. <
>> Not sure how: One would still be bottlenecked by how quickly they
>> can move data between the register file and memory.
> < a) there is a distribution function of the number of registers
> saved per function. b) you only access all of the sets when you cross
> a line boundary c) you only access the TLB when you cross a page
> boundary d) accessing 3 cache tags an 1 TLB entry is more expensive
> than accessing ....128-bits of data (by about 30%) e) after accessing
> 3 cache tags and the TLB on the first cycle, one gets ....the rest of
> the line without a tag or TLB access f) throw that data in an eXcel
> spreadsheet along with power info for the ....data versus tag versus
> TLB and it spits out the above data. < If you do these one memory ref
> at a time, it is exceedingly hard to economize like that.

I don't really follow.

I suspect my memory cache system works somewhat differently here.

Typically, TLB only gets involved in L1 misses:
There is a small 1-way TLB, which may be used if it hits for a given
request (when generating a request on the bus), else it sends out a
request with a virtual address;
The MMU basically sits on the ringbus, translating any requests which
pass through it, it generating a TLB miss fault if it encounters a
request that can't be handled as-is (it then modifies the offending
request into something that signals to the L1 cache that this request
had resulted in a TLB miss).

Within the L1's, accesses are keyed using the virtual address, and the
L1 cache retains information about both the virtual and physical address
for a given cache line (so when evicting a dirty line, it can send it
out to the corresponding physical address).

>>
>> If one has a 128-bit memory port, and three 64-bit register write
>> ports, only so much can be done here...
> < The 3R1W RF is converted into a 4R when ENTERing and 4W when
> EXITing.

?...

I think this would need tristate logic, which is (generally) only
available for external IO pins AFAICT (and not supported by Verilator at
all).

>>
>> Memory access would also still be limited by how quickly it can
>> load/store stuff without running into cache misses, ... Or, other
>> cases, like, what happens if a TLB-miss fault happens partway
>> through? ...
> < The stack rarely takes misses on ENTER and EXIT.

My cache modeling seems to be showing a roughly 70% miss-rate, but this
is for blobs dealing with 31 GPRs.

Granted, it is likely that the saved registers are getting knocked out
of the L1 cache sometime between function entry and exit for the
functions which are using this blob (mostly the ones with 80+ variables).

One of the main users in this case seems to be the function for
projecting and tessellating primitives, which are then handed off to the
functions which (ultimately) submit them to the rasterizer module (in
place of the original edge-walking and span-drawing loops).

There is a lot of stuff in here that made sense for the software
rasterizer, but I am ending up needing to side-step if using the
hardware module.

> < <snip>
>>> Nope, it adds a tiny amount of logic to the memory pipeline, and
>>> almost nothing to the instruction pipeline.
>> I am not sure how.
>>
> LoL
>>
>> The routing of things into and out of the register file would
>> presumably be via the pipeline, with the register file and L1 cache
>> as two independent units.
>>
> reconfigure the RF porting.
>>
>> Only obvious way to do it in my case would be to have a mechanism
>> in the ID1 or ID2 stage to effectively synthesize a sequence of
>> Load or Store operations (while otherwise keeping the Fetch and
>> Decode stages stalled).
> < As I said before, it is all done in the MU.

My L1 caches have no way to initiate any logic on their own. They merely
handle requests that pass through them (or generate stall signals if
they need more time, such as to resolve a cache-miss).

There is a special case though where they will behave as if the cache
has hit, if they see that the a TLB miss has happened (and there is a
need to unstall the pipeline to enter the TLB-miss ISR)

So, in this case, the main pipeline keeps track of where the result
should go once it comes back from the L1 (possibly feeding it through a
converter before it is handed off to the WB stage; where the result is
presented to the register file so that it can be written back to a
register).

It is basically the main pipeline that drives all of the logic forward
(and pulling outputs from the various function-units at the appropriate
stages; all the units needing to operate in lock-step with the pipeline).

Generally, bad stuff happens if a unit does not operate in lockstep with
the CPU.

Though, it was worse early on, as originally memory accesses were not
pipelined and had used the "OPM/OK" signaling scheme (which is now
"mostly banished" and only really used in places like in the MMIO bus)

Where, say:
Present inputs, and an OPM representing the request;
Wait for OK signal to go OK;
Switch OPM to IDLE;
Wait for OK to go to READY.

Early on, both the L1 caches and FPU had done this.

Then these units were integrated with the pipeline, and this signaling
was relegated to the bus. Then this bus was replaced with the ringbus,
and it was mostly relegated to the DDR PHY and MMIO bus, with the DDR
PHY then later switching to a modified signaling scheme (based instead
on the use of sequence-numbers).

>>
>> I wouldn't really want to go there.
>>
>>
>> I think some other people's cores have a similar mechanism,
>> typically for fetching and decoding instructions from an internal
>> "Microcode ROM"; I don't have a microcode mechanism either.
> < Hardwired sequencers are not Microcode. It is basically no
> different than Goldschmidt division in a FMAC unit.

I didn't do Goldschmidt either.

The FDIV operator was handled with a Shift-ADD unit. So, say, the unit
basically spins at one position every clock cycle. One can inject a
request into the unit, run a counter, and then grab the result at a
fixed number of clock-cycles in the future (with the inputs set up to
determine whether it performs a multiply or divide).

It was originally written for integer multiply and divide, but then I
realized I could set up the inputs in a way to make it do FDIV and
similar as well.

Partly this was based on the observation that "binary long division"
could be easily reconfigured into Shift-Add form (basically the same
algorithm just with the logic transposed). Using it for FDIV was also
basically the same logic, just throw in exponents and adjust the cycle
delay.

Well, along with some "quick and dirty" approximations (if one doesn't
need much accuracy, they can roughly approximate a floating-point divide
with an integer subtraction). With some adjustments, a similar algo can
be used for fixed-point division (in software, using a CLZ instruction
and some bit-shift operations).

Click here to read the complete article

Re: Intel goes to 32 GPRs

<udn7su$1305m$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33966&group=comp.arch#33966

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 07:23:26 -0700
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <udn7su$1305m$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<u9r0jd$1ftnb$1@dont-email.me> <ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Sep 2023 14:23:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="914ceb0dbdca568382253c8e3bbfe421";
logging-data="1147062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/XPdNYlpNVniPNZfS7H7JwvILLuMuuoNY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:vcO+F6grLPSmNNUMvbAQ48GqLIA=
In-Reply-To: <udm914$uhr9$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Mon, 11 Sep 2023 14:23 UTC

On 9/10/2023 10:36 PM, BGB wrote:

snip

> I suspect my memory cache system works somewhat differently here.
>
> Typically, TLB only gets involved in L1 misses:

Your terminology may be confusing. When you use the term "L1" (level
one) cache, that implies the existence of a multi-level cache, i.e. at
least an L2, and perhaps an L3. If, as implied by the rest of your
post, you have only a single level of cache, you would just use the term
"cache", without the level number (You may still use I$ or D$ if, as you
seem to, have separate caches for instructions and data). You say your
cache uses virtual addresses. In that case, your description of when a
TLB lookup is required fits. But some designs, particularly those with
multiple levels use physical addressing (primarily to prevent aliasing
problems) for at least the lower (i.e. L2 or L3) levels, in which case,
a TLB lookup is required before the cache lookup.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>You say your
>cache uses virtual addresses. In that case, your description of when a
>TLB lookup is required fits. But some designs, particularly those with
>multiple levels use physical addressing (primarily to prevent aliasing
>problems) for at least the lower (i.e. L2 or L3) levels, in which case,
>a TLB lookup is required before the cache lookup.

Most designs use virtually-indexed physically tagged L1 caches; this
allows to perform the cache access and TLB access in parallel. It
does mean that every cache access also requires a TLB access.

The advantage of physically tagged caches is that two processes can
map the same page without requiring OS and application shenanigans.
Makes me wonder how Linux on HPPA or Power implements mmap with
MAP_SHARED.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

On 9/11/2023 8:27 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> You say your
>> cache uses virtual addresses. In that case, your description of when a
>> TLB lookup is required fits. But some designs, particularly those with
>> multiple levels use physical addressing (primarily to prevent aliasing
>> problems) for at least the lower (i.e. L2 or L3) levels, in which case,
>> a TLB lookup is required before the cache lookup.
>
> Most designs use virtually-indexed physically tagged L1 caches; this
> allows to perform the cache access and TLB access in parallel. It
> does mean that every cache access also requires a TLB access.

Thanks for the clarifications/corrections.

> The advantage of physically tagged caches is that two processes can
> map the same page without requiring OS and application shenanigans.

Yes. I think this is another way of saying what I said about aliases.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<udnftb$14f6e$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33969&group=comp.arch#33969

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 11:40:07 -0500
Organization: A noiseless patient Spider
Lines: 246
Message-ID: <udnftb$14f6e$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<ublni6$3s8o9$1@dont-email.me>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 16:40:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9fa05b0d2a521d4d10f904eca70f709c";
logging-data="1195214"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bVUQW1Bs83A5KQdjtvpMV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:/gvttKlCOkflCScOamUWyR4fKy4=
In-Reply-To: <udn7su$1305m$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 11 Sep 2023 16:40 UTC

On 9/11/2023 9:23 AM, Stephen Fuld wrote:
> On 9/10/2023 10:36 PM, BGB wrote:
>
> snip
>
>> I suspect my memory cache system works somewhat differently here.
>>
>> Typically, TLB only gets involved in L1 misses:
>
> Your terminology may be confusing. When you use the term "L1" (level
> one) cache, that implies the existence of a multi-level cache, i.e. at
> least an L2, and perhaps an L3. If, as implied by the rest of your
> post, you have only a single level of cache, you would just use the term
> "cache", without the level number (You may still use I$ or D$ if, as you
> seem to, have separate caches for instructions and data). You say your
> cache uses virtual addresses. In that case, your description of when a
> TLB lookup is required fits. But some designs, particularly those with
> multiple levels use physical addressing (primarily to prevent aliasing
> problems) for at least the lower (i.e. L2 or L3) levels, in which case,
> a TLB lookup is required before the cache lookup.
>

There are multiple cache levels.
L1 Instruction and Data caches:
Usually configured as 16K or 32K at present.
16K is easier for FPGA timing.
Direct Mapped.
Tagged using virtual address.

L2 Cache:
Currently configured as 256K on the XC7A100T, 512K on the XC7A200T.
Also currently configured as direct-mapped.
Tagged using physical address.
L2 Talks directly / exclusively with the DDR RAM module.

The caches generally use a write-back strategy.
There is not currently any mechanism to implicitly synchronize caches
(this is an eventual TODO item).

The current bus is roughly organized as a ring, with various signals:
Data(In/Out), 128-bits
Address(In/Out), 48 or 96 bits
96-bits mostly only between L1 caches and TLB.
48-bit everywhere else.
(47:44): Categorizes the type of address:
0..7: Userland virtual address
8..B: Supervisor virtual address
C: Cached physical address
D: Non-Cached physical address
E: Reserved
F: MMIO or similar
OPM(In/Out): 16-bits, Command-Code and some flags.
(15: 8): Command-specific flags
( 7: 4): Major request/response/command code.
( 3: 0): Command-specifier or data-size bits.
SEQ(In/Out): 16-bits, NodeID and a Sequence Number.
Loosely:
(15:10): Core or Unit for this Request.
( 9: 8): Which cache has made the request.
( 7: 6): The way within the cache.
( 5: 0): Request Sequence Number

Typically, each location forwards its inputs to the output for each
clock cycle, or may modify the request if it is directed to the unit in
question.

Generally the ring also serves as an implicit buffer for in-flight
requests, which may circle the ring until they can be handled.

To reduce ring latency, some requests and responses are handled as
special-case and may jump the ring to locations closer to their
destination (such as keeping RAM requests near the L2, or direct
responses closer to the core in question).

It is possible to operate without this special-case forwarding, but
memory access latency is worse (and bandwidth suffers).

Within the CPU cores, the ring is currently partially pinched-off, with
the units organized as:
In -> L1-D$ -> L1-I$ -> TLB -> Out
_ -> ............ -> Out
In -> ..................... -> Out (Anything N/A for this core)

Where '...' designates the optional shortcut paths.
Requests to RAM may skip over the TLB if they are already using a
physical addresses.

While using a star topology and FIFO buffers could potentially allow for
lower latency, it would add a lot of complexity and likely have a steep
LUT cost. Also seemed a lot simpler than something like AXI.

It is also quite significantly faster than my original OPM/OK signaling
scheme (mostly for sake of allowing multiple in-flight requests to be
active at the same time).

The DDR-RAM interface now uses a modified OPM-SEQ scheme:
OPM_SEQ (12-bits):
(11: 6): Request sequence number
( 5: 4): IDLE/LOAD/STORE/SWAP
( 3: 0): Request sub-type (If IDLE, it is a command-code).
OK_SEQ (8 bits):
( 7: 2): Response sequence number
( 1: 0): READY/OK/HOLD/FAULT
LoadAddress (32 bits): Address we are loading from
StroreAddress (32 bits): Address we are storing to
LoadData (512 bits): Data loaded from RAM
StoreData (512 bits): Data stored to RAM

Here, it is no longer necessary for the bus to transition back to an
IDLE/READY state between each request, but rather the Request/Response
sequence numbers can be used to see whenever a new request has arrived
or the response from the request has come back over.

This design is used mostly because it is more tolerant of crossing
between clock domains (the DDR RAM controller operating internally on a
faster clock than the rest of the core).

Typically, the L2 cache will use SWAP requests, which combine both a
Load and Store into a single request (at two different addresses).

Generally, this is done (along with the 512-bit cache lines) to try to
get closer to the theoretical RAM bandwidth.

Though, usable bandwidth when going through the L2 cache seems to be a
fair bit lower than what is measured directly off the RAM module's
interface (~ 100MB/s full-duplex). So, the combination of the L2 cache's
state-machine logic and ringbus interface does seem to negatively effect
the theoretical bandwidth.

Full-duplex to the L2 cache over the ringbus only achieves ~ 60 MB/sec
at present (and only ~ 24MB/s to RAM), which is, as can be noted, lower
than the RAM bandwidth measured off the modules' ports.

Note that this is generally running the RAM chips at 50MHz with DLL
disabled, generally able to make use of bust transfers (with the 512-bit
lines), and able to put load-and-store directly end-to-end if they are
in the same row (otherwise, one needs to close the row and open a new
row). Typically, the Load is initiated first, followed by the store.

While theoretically, this RAM could be faster (by running it at the
clock-speeds it was designed to run at, via SERDES), some testing
implies this will not gain much as-is (even if the RAM module could
respond to requests immediately, the L2 will still respond at roughly
the same speeds as before).

Note, that "bypassing" the L2 is not really an option in this case, and
even with its limitations, it is faster than the alternatives.

But, not really a good way to make it "obviously faster".

Though, one possibility would be if the L2 cache removed requests from
the ring and added them to a FIFO, dumping responses back out onto the
ring once they have been serviced (and falling back to the
request-ignoring strategy if the FIFO gets full).

This would also be immune to a few adverse edge-cases which can come up
on the ring, such as something I am calling "Pogo Bouncing", where
multiple in-flight requests on the ring try to knock each other out of
the cache, each trying to knock the other's data out of the cache before
the request can be handled.

As-is the L2 cache has logic to detect and mitigate these scenarios
(well, otherwise this situation will effectively result in a deadlock).
Mostly, if a "Pogo" is detected, it will selectively ignore the other
requests to that same cache index until the first request has been handled.

But, this is likely another "TODO" item.

For video-display, there is a VRAM Module, which at present is basically
a small cache that directs requests through the L2 cache (over the
ringbus). Generally, as the raster sweep goes around, it throws out a
stream of "prefetch" requests for anything that misses in the internal
cache (with some logic to try to limit how much it "spams the bus").

The fetch address runs ahead of the currently displayed address, mostly
so that the response will hopefully have gotten back before it is time
to draw the pixels in question (so, the cache is big enough for roughly
several scanlines; its contents just sort of endlessly cycling through
during the raster sweep).

When a prefetch hits the L2, regardless of whether it hits or misses,
the L2 will send a response (containing whatever data so happened to be
at that location in the cache). Generally, whatever the VRAM module gets
back, is what is displayed on screen.

It mostly works sufficiently well for 320x200 and color-cell modes.
Experiments with 640x480 or 800x600 bitmapped modes generally result in
an ugly-looking and broken mess though (this design not dealing
particularly well with L2 misses; which become progressively more common
as framebuffer size increases).

Color-cell works better, but doesn't look as good, and requires
re-encoding the screen contents on every update. But, running RGB555
through a color-cell encoder still gives arguably better looking results
than a 16-color or 4-color mode would have done (Say: Ye Olde
Black/White/Cyan/Magenta, looks like awful crap).

Click here to read the complete article

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>You say your
>>cache uses virtual addresses. In that case, your description of when a
>>TLB lookup is required fits. But some designs, particularly those with
>>multiple levels use physical addressing (primarily to prevent aliasing
>>problems) for at least the lower (i.e. L2 or L3) levels, in which case,
>>a TLB lookup is required before the cache lookup.
>
>Most designs use virtually-indexed physically tagged L1 caches; this
>allows to perform the cache access and TLB access in parallel. It
>does mean that every cache access also requires a TLB access.
>
>The advantage of physically tagged caches is that two processes can
>map the same page without requiring OS and application shenanigans.
>Makes me wonder how Linux on HPPA or Power implements mmap with
>MAP_SHARED.

Or on MIPS. IRIX had a bunch of hacks internally to support MAP_SHARED, IIRC.

On 9/11/2023 10:27 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> You say your
>> cache uses virtual addresses. In that case, your description of when a
>> TLB lookup is required fits. But some designs, particularly those with
>> multiple levels use physical addressing (primarily to prevent aliasing
>> problems) for at least the lower (i.e. L2 or L3) levels, in which case,
>> a TLB lookup is required before the cache lookup.
>
> Most designs use virtually-indexed physically tagged L1 caches; this
> allows to perform the cache access and TLB access in parallel. It
> does mean that every cache access also requires a TLB access.
>

As noted, TLB only gets involved with L1 misses in this case (and allows
the main TLB to be an entity which exists independently of the L1 caches).

Pros/cons exist either way.

> The advantage of physically tagged caches is that two processes can
> map the same page without requiring OS and application shenanigans.
> Makes me wonder how Linux on HPPA or Power implements mmap with
> MAP_SHARED.
>

This is partly avoided in my case with 16K caches and a 16K page size.
Does become an issue with 32K caches though, or with 4K pages.

Conflicts can be reduced by avoiding starting a virtual-memory mapping
on an odd-numbered page if it might be shared (say, so any shared
mappings will most likely land in the same place in the L1 cache).

Though, this does require the L1 caches to treat the index as a modulo
of the address (no hashing allowed in this case).

This would also be a worse situation with associative caching, but is
less of an issue with direct-mapped L1 caches.

> - anton

Re: Intel goes to 32 GPRs

<udngnp$1305m$3@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33972&group=comp.arch#33972

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 09:54:17 -0700
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <udngnp$1305m$3@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<07145f85-deff-44ca-bf24-0b754ba07594n@googlegroups.com>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
<udnftb$14f6e$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 16:54:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="914ceb0dbdca568382253c8e3bbfe421";
logging-data="1147062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19gerltaRGB9XeW9FPBLC1GRasYJ4ke0MQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YGD6VajgoAvh8qLmIurns+CJWGk=
In-Reply-To: <udnftb$14f6e$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Mon, 11 Sep 2023 16:54 UTC

On 9/11/2023 9:40 AM, BGB wrote:
> On 9/11/2023 9:23 AM, Stephen Fuld wrote:
>> On 9/10/2023 10:36 PM, BGB wrote:
>>
>> snip
>>
>>> I suspect my memory cache system works somewhat differently here.
>>>
>>> Typically, TLB only gets involved in L1 misses:
>>
>> Your terminology may be confusing. When you use the term "L1" (level
>> one) cache, that implies the existence of a multi-level cache, i.e. at
>> least an L2, and perhaps an L3. If, as implied by the rest of your
>> post, you have only a single level of cache, you would just use the
>> term "cache", without the level number (You may still use I$ or D$ if,
>> as you seem to, have separate caches for instructions and data). You
>> say your cache uses virtual addresses. In that case, your description
>> of when a TLB lookup is required fits. But some designs, particularly
>> those with multiple levels use physical addressing (primarily to
>> prevent aliasing problems) for at least the lower (i.e. L2 or L3)
>> levels, in which case, a TLB lookup is required before the cache lookup.
>>
>
> There are multiple cache levels.
> L1 Instruction and Data caches:
>     Usually configured as 16K or 32K at present.
>       16K is easier for FPGA timing.
>     Direct Mapped.
>     Tagged using virtual address.
>
> L2 Cache:
>     Currently configured as 256K on the XC7A100T, 512K on the XC7A200T.
>     Also currently configured as direct-mapped.
>     Tagged using physical address.
>     L2 Talks directly / exclusively with the DDR RAM module.

OK, my mistake. If there is an Li miss that hits in the L2, is the ring
bus involved? Perhaps that was the source of my confusion.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<udnk06$153qr$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33973&group=comp.arch#33973

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 12:49:54 -0500
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <udnk06$153qr$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
<udnftb$14f6e$1@dont-email.me> <udngnp$1305m$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 17:49:58 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9fa05b0d2a521d4d10f904eca70f709c";
logging-data="1216347"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18YUaVJoG098YHkteTUbGgz"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:xXNKhFbha/5sBw+US+/2e8r+nno=
In-Reply-To: <udngnp$1305m$3@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 11 Sep 2023 17:49 UTC

On 9/11/2023 11:54 AM, Stephen Fuld wrote:
> On 9/11/2023 9:40 AM, BGB wrote:
>> On 9/11/2023 9:23 AM, Stephen Fuld wrote:
>>> On 9/10/2023 10:36 PM, BGB wrote:
>>>
>>> snip
>>>
>>>> I suspect my memory cache system works somewhat differently here.
>>>>
>>>> Typically, TLB only gets involved in L1 misses:
>>>
>>> Your terminology may be confusing. When you use the term "L1" (level
>>> one) cache, that implies the existence of a multi-level cache, i.e.
>>> at least an L2, and perhaps an L3. If, as implied by the rest of
>>> your post, you have only a single level of cache, you would just use
>>> the term "cache", without the level number (You may still use I$ or
>>> D$ if, as you seem to, have separate caches for instructions and
>>> data). You say your cache uses virtual addresses. In that case,
>>> your description of when a TLB lookup is required fits. But some
>>> designs, particularly those with multiple levels use physical
>>> addressing (primarily to prevent aliasing problems) for at least the
>>> lower (i.e. L2 or L3) levels, in which case, a TLB lookup is required
>>> before the cache lookup.
>>>
>>
>> There are multiple cache levels.
>>    L1 Instruction and Data caches:
>>      Usually configured as 16K or 32K at present.
>>        16K is easier for FPGA timing.
>>      Direct Mapped.
>>      Tagged using virtual address.
>>
>>    L2 Cache:
>>      Currently configured as 256K on the XC7A100T, 512K on the XC7A200T.
>>      Also currently configured as direct-mapped.
>>      Tagged using physical address.
>>      L2 Talks directly / exclusively with the DDR RAM module.
>
>
> OK, my mistake. If there is an Li miss that hits in the L2, is the ring
> bus involved? Perhaps that was the source of my confusion.
>

Yes.

The ringbus is what holds all of the memory-related modules together:
It is how the L1's send requests to the TLB;
It is how the requests and responses make their way between the L1 and
L2 caches;
....

Each CPU core basically only has a limited number of IO connections:
Its Ringbus inputs/outputs;
Clock, Reset inputs;
Some "Debug Status LED" outputs.

Beyond just memory requests, things like IRQs and Hardware-RNG stuff are
also handled via the ringbus.

Implicitly, the rings are divided into two subsets:
L1 Ring: Holds L1 caches and TLB, within the CPU core;
L2 Ring: Everything external to the CPU core.
L2 Cache, ROM, VRAM module, MMIO Bus interface, ... all sit here.

Because messages flow along the bus basically 1 position per
clock-cycle, the number of objects on the bus does effect its overall
latency and performance; which is the main tradeoff with this design.

Direct point-to-point signaling would have been cost prohibitive.

Its predecessor (the OPM/OK bus) was horridly slow (hard pressed to get
much over a few MB/sec; made it hard to get playable frame-rates even in
Doom).

A "more advanced" bus would likely use a star topology, where everything
plugs into a "switch" which potentially queues requests as needed and
then forwards them out of the appropriate ports based on where they need
to go. However, this would be complicated/expensive.

Note that within each core, all of the function units are held together
via the "main pipeline", which is effectively the central organizing
structure for everything. So, the L1 caches, Register file,
Decoder/ALUs/FPU/etc, are all glued onto this pipeline (and everything
proceeds along with the pipeline in a sort of strictly lock-step
fashion, with a few of the units able to raise "stall" signals if they
need more clock-cycles to complete their work).

The MMU/TLB is partially independent, as it exists mostly on the
ring-bus. Its main connection (to the CPU core proper) is mostly in its
role/ability to signal the pipeline whenever an exception/interrupt has
been raised.

Effectively, "TLB Load" commands and similar are forwarded to the TLB
via the ringbus (and are directed through the L1 Cache, as a sort of
"funky memory request", along with requests for things like
cache-flushes and similar, ...).

In contrast, most things that happen on the ringbus are inherently
asynchronous, and responses to requests will often not arrive in the
same order they were sent (so, the L1 caches need to be able to deal
with things like out-of-order responses, and may only add a request to
the bus if there is not already something there, etc).

....

Re: Intel goes to 32 GPRs

<udnl45$1305m$4@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33974&group=comp.arch#33974

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 11:09:08 -0700
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <udnl45$1305m$4@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<ubopm8$dn07$1@dont-email.me> <ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
<udnftb$14f6e$1@dont-email.me> <udngnp$1305m$3@dont-email.me>
<udnk06$153qr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 18:09:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="914ceb0dbdca568382253c8e3bbfe421";
logging-data="1147062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wOdkO/8qSUjNTFK2/mwf8oz/P1JzvIOs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:F79jcDKJ3gfHyUBo+ALuqpuL6OE=
Content-Language: en-US
In-Reply-To: <udnk06$153qr$1@dont-email.me>

by: Stephen Fuld - Mon, 11 Sep 2023 18:09 UTC

On 9/11/2023 10:49 AM, BGB wrote:
> On 9/11/2023 11:54 AM, Stephen Fuld wrote:
>> On 9/11/2023 9:40 AM, BGB wrote:
>>> On 9/11/2023 9:23 AM, Stephen Fuld wrote:
>>>> On 9/10/2023 10:36 PM, BGB wrote:
>>>>
>>>> snip
>>>>
>>>>> I suspect my memory cache system works somewhat differently here.
>>>>>
>>>>> Typically, TLB only gets involved in L1 misses:
>>>>
>>>> Your terminology may be confusing. When you use the term "L1"
>>>> (level one) cache, that implies the existence of a multi-level
>>>> cache, i.e. at least an L2, and perhaps an L3. If, as implied by
>>>> the rest of your post, you have only a single level of cache, you
>>>> would just use the term "cache", without the level number (You may
>>>> still use I$ or D$ if, as you seem to, have separate caches for
>>>> instructions and data). You say your cache uses virtual addresses.
>>>> In that case, your description of when a TLB lookup is required
>>>> fits. But some designs, particularly those with multiple levels use
>>>> physical addressing (primarily to prevent aliasing problems) for at
>>>> least the lower (i.e. L2 or L3) levels, in which case, a TLB lookup
>>>> is required before the cache lookup.
>>>>
>>>
>>> There are multiple cache levels.
>>>    L1 Instruction and Data caches:
>>>      Usually configured as 16K or 32K at present.
>>>        16K is easier for FPGA timing.
>>>      Direct Mapped.
>>>      Tagged using virtual address.
>>>
>>>    L2 Cache:
>>>      Currently configured as 256K on the XC7A100T, 512K on the XC7A200T.
>>>      Also currently configured as direct-mapped.
>>>      Tagged using physical address.
>>>      L2 Talks directly / exclusively with the DDR RAM module.
>>
>>
>> OK, my mistake. If there is an Li miss that hits in the L2, is the
>> ring bus involved? Perhaps that was the source of my confusion.
>>
>
> Yes.

OK, that explains it.

> The ringbus is what holds all of the memory-related modules together:
> It is how the L1's send requests to the TLB;
> It is how the requests and responses make their way between the L1 and
> L2 caches;
> ...
>
> Each CPU core basically only has a limited number of IO connections:
> Its Ringbus inputs/outputs;
> Clock, Reset inputs;
> Some "Debug Status LED" outputs.
>
> Beyond just memory requests, things like IRQs and Hardware-RNG stuff are
> also handled via the ringbus.

Makes sense.

> Implicitly, the rings are divided into two subsets:
> L1 Ring: Holds L1 caches and TLB, within the CPU core;
> L2 Ring: Everything external to the CPU core.
> L2 Cache, ROM, VRAM module, MMIO Bus interface, ... all sit here.

Now I am confused again, by the word "subsets". Is there a single ring
with everything attached, or two separate rings, the L1 and the L2 rings
that each presumably has a slot on the ring used to pass data between
the two rings. There are advantages and disadvantages with each choice.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Intel goes to 32 GPRs

<udnn2p$15jk2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33975&group=comp.arch#33975

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 13:42:29 -0500
Organization: A noiseless patient Spider
Lines: 134
Message-ID: <udnn2p$15jk2$1@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<ubpr3d$2iugr$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
<udnftb$14f6e$1@dont-email.me> <udngnp$1305m$3@dont-email.me>
<udnk06$153qr$1@dont-email.me> <udnl45$1305m$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 18:42:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9fa05b0d2a521d4d10f904eca70f709c";
logging-data="1232514"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qmB19ncwl8rZlXEovlvgy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:0PlwveSpHFQmsbxqJa3kQh7gAL4=
Content-Language: en-US
In-Reply-To: <udnl45$1305m$4@dont-email.me>

by: BGB - Mon, 11 Sep 2023 18:42 UTC

On 9/11/2023 1:09 PM, Stephen Fuld wrote:
> On 9/11/2023 10:49 AM, BGB wrote:
>> On 9/11/2023 11:54 AM, Stephen Fuld wrote:
>>> On 9/11/2023 9:40 AM, BGB wrote:
>>>> On 9/11/2023 9:23 AM, Stephen Fuld wrote:
>>>>> On 9/10/2023 10:36 PM, BGB wrote:
>>>>>
>>>>> snip
>>>>>
>>>>>> I suspect my memory cache system works somewhat differently here.
>>>>>>
>>>>>> Typically, TLB only gets involved in L1 misses:
>>>>>
>>>>> Your terminology may be confusing. When you use the term "L1"
>>>>> (level one) cache, that implies the existence of a multi-level
>>>>> cache, i.e. at least an L2, and perhaps an L3. If, as implied by
>>>>> the rest of your post, you have only a single level of cache, you
>>>>> would just use the term "cache", without the level number (You may
>>>>> still use I$ or D$ if, as you seem to, have separate caches for
>>>>> instructions and data). You say your cache uses virtual addresses.
>>>>> In that case, your description of when a TLB lookup is required
>>>>> fits. But some designs, particularly those with multiple levels
>>>>> use physical addressing (primarily to prevent aliasing problems)
>>>>> for at least the lower (i.e. L2 or L3) levels, in which case, a TLB
>>>>> lookup is required before the cache lookup.
>>>>>
>>>>
>>>> There are multiple cache levels.
>>>>    L1 Instruction and Data caches:
>>>>      Usually configured as 16K or 32K at present.
>>>>        16K is easier for FPGA timing.
>>>>      Direct Mapped.
>>>>      Tagged using virtual address.
>>>>
>>>>    L2 Cache:
>>>>      Currently configured as 256K on the XC7A100T, 512K on the
>>>> XC7A200T.
>>>>      Also currently configured as direct-mapped.
>>>>      Tagged using physical address.
>>>>      L2 Talks directly / exclusively with the DDR RAM module.
>>>
>>>
>>> OK, my mistake. If there is an Li miss that hits in the L2, is the
>>> ring bus involved? Perhaps that was the source of my confusion.
>>>
>>
>> Yes.
>
> OK, that explains it.
>
>
>> The ringbus is what holds all of the memory-related modules together:
>> It is how the L1's send requests to the TLB;
>> It is how the requests and responses make their way between the L1 and
>> L2 caches;
>> ...
>>
>> Each CPU core basically only has a limited number of IO connections:
>>    Its Ringbus inputs/outputs;
>>    Clock, Reset inputs;
>>    Some "Debug Status LED" outputs.
>>
>> Beyond just memory requests, things like IRQs and Hardware-RNG stuff
>> are also handled via the ringbus.
>
> Makes sense.
>
>
>> Implicitly, the rings are divided into two subsets:
>>    L1 Ring: Holds L1 caches and TLB, within the CPU core;
>>    L2 Ring: Everything external to the CPU core.
>>      L2 Cache, ROM, VRAM module, MMIO Bus interface, ... all sit here.
>
> Now I am confused again, by the word "subsets". Is there a single ring
> with everything attached, or two separate rings, the L1 and the L2 rings
> that each presumably has a slot on the ring used to pass data between
> the two rings. There are advantages and disadvantages with each choice.
>

I tried both.

I ended up using a single ring, but with a special-case "shortcut path"
around the CPU core (that may be taken if conditions are appropriate).

In this case, it is a single ring in a larger scale sense, but there are
some differences:
For the section covering the "L1 ring", the address widens internally to
96 bits, with incoming messages having the address zero-extended, and
outgoing messages are truncated (no request with a 96-bit virtual
address may leave the L1 ring; it either gets translated to a 48-bit
physical address, or is redirected to a predetermined "TLB Missed"
configuration; mostly marking the request page as non cached, and
pointing it at a designated "ROM page full of zeroes"; as a result, no
valid information is lost once this truncation occurs).

But the analogy in this case would be more like a horse-shoe or the
Omega symbol, where the inner part of the loop holds the L1 caches, but
at the bottom, there is a shortcut where a request can skip over the
core entirely (if the request has no reason to pass through the L1 ring).

Unrelated requests may still end up flowing through the L1 ring though
if there was already a message on the output side (messages can't jump
forward along the ring in cases where there is already something
arriving at the destination via the "normal" path).

The reason I didn't end up going with having two separate rings with a
"bridge point" between them was mostly that this created a few issues:
One effectively ends up needing 2 cycles of extra latency to get a
message from one side to the other (0 cycles with no skip-path, or 1
cycle with a skip-path, but on average, the skip-paths help more than
they hurt).

This strategy suffers from significant degradation in cases where one
(or both rings) begin to suffer from congestion (if either ring gets
more than about 50% full, performance tanks as messages can't cross over
effectively; but even at 25-30% the degradation can start to get
noticeable).

In the case of a single ring with skip-paths, the "worst case" is merely
that the messages need to take the long way around the ring (so the bus
is much less adversely effected by message congestion, merely increasing
the average latency).

This causes a more gradual slowdown as traffic increases (though, only
real reason this is much of an issue in this case, is because the VRAM
module tends to effectively spam the bus with an endless stream of
prefetch requests during its raster sweep).

....

Re: Intel goes to 32 GPRs

<udnns6$1305m$5@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33976&group=comp.arch#33976

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Intel goes to 32 GPRs
Date: Mon, 11 Sep 2023 11:56:06 -0700
Organization: A noiseless patient Spider
Lines: 94
Message-ID: <udnns6$1305m$5@dont-email.me>
References: <u9o14h$183or$1@newsreader4.netcologne.de>
<2023Aug19.123116@mips.complang.tuwien.ac.at>
<AA6EM.518412$TCKc.373270@fx13.iad>
<710aedd8-3e42-46be-824a-c0c783cba2f4n@googlegroups.com>
<ubrca2$2jvsf$1@newsreader4.netcologne.de> <udcvo7$32enh$1@dont-email.me>
<59a98eb4-2bdb-4dbb-86bf-09de232ba0d3n@googlegroups.com>
<ude6ci$3b25m$1@dont-email.me>
<492dc45e-62b1-4444-b4fc-7cddc93744ben@googlegroups.com>
<udgv5t$3uiis$1@dont-email.me>
<20f4721b-f441-486f-b76d-a45ecc98c144n@googlegroups.com>
<udikqu$71in$1@dont-email.me>
<009c84b9-7566-43e0-a1f2-64aa458d95f1n@googlegroups.com>
<udl3u4$m8af$1@dont-email.me>
<2ebedb7a-8e51-4bb5-89e9-6052338c629bn@googlegroups.com>
<udldjq$nlt3$1@dont-email.me>
<07147898-cb66-4c5c-bb2b-de38c96b1969n@googlegroups.com>
<udm914$uhr9$1@dont-email.me> <udn7su$1305m$1@dont-email.me>
<udnftb$14f6e$1@dont-email.me> <udngnp$1305m$3@dont-email.me>
<udnk06$153qr$1@dont-email.me> <udnl45$1305m$4@dont-email.me>
<udnn2p$15jk2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Sep 2023 18:56:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="914ceb0dbdca568382253c8e3bbfe421";
logging-data="1147062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197L4fpxiOXJnH0RxCsV4D+q/erkVYyRag="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:iiqQ61G78HExhNze4K0V7kjq9AA=
In-Reply-To: <udnn2p$15jk2$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Mon, 11 Sep 2023 18:56 UTC

On 9/11/2023 11:42 AM, BGB wrote:
> On 9/11/2023 1:09 PM, Stephen Fuld wrote:
>> On 9/11/2023 10:49 AM, BGB wrote:
>>> On 9/11/2023 11:54 AM, Stephen Fuld wrote:
>>>> On 9/11/2023 9:40 AM, BGB wrote:
>>>>> On 9/11/2023 9:23 AM, Stephen Fuld wrote:
>>>>>> On 9/10/2023 10:36 PM, BGB wrote:
>>>>>>
>>>>>> snip
>>>>>>
>>>>>>> I suspect my memory cache system works somewhat differently here.
>>>>>>>
>>>>>>> Typically, TLB only gets involved in L1 misses:
>>>>>>
>>>>>> Your terminology may be confusing. When you use the term "L1"
>>>>>> (level one) cache, that implies the existence of a multi-level
>>>>>> cache, i.e. at least an L2, and perhaps an L3. If, as implied by
>>>>>> the rest of your post, you have only a single level of cache, you
>>>>>> would just use the term "cache", without the level number (You may
>>>>>> still use I$ or D$ if, as you seem to, have separate caches for
>>>>>> instructions and data). You say your cache uses virtual
>>>>>> addresses. In that case, your description of when a TLB lookup is
>>>>>> required fits. But some designs, particularly those with multiple
>>>>>> levels use physical addressing (primarily to prevent aliasing
>>>>>> problems) for at least the lower (i.e. L2 or L3) levels, in which
>>>>>> case, a TLB lookup is required before the cache lookup.
>>>>>>
>>>>>
>>>>> There are multiple cache levels.
>>>>>    L1 Instruction and Data caches:
>>>>>      Usually configured as 16K or 32K at present.
>>>>>        16K is easier for FPGA timing.
>>>>>      Direct Mapped.
>>>>>      Tagged using virtual address.
>>>>>
>>>>>    L2 Cache:
>>>>>      Currently configured as 256K on the XC7A100T, 512K on the
>>>>> XC7A200T.
>>>>>      Also currently configured as direct-mapped.
>>>>>      Tagged using physical address.
>>>>>      L2 Talks directly / exclusively with the DDR RAM module.
>>>>
>>>>
>>>> OK, my mistake. If there is an Li miss that hits in the L2, is the
>>>> ring bus involved? Perhaps that was the source of my confusion.
>>>>
>>>
>>> Yes.
>>
>> OK, that explains it.
>>
>>
>>> The ringbus is what holds all of the memory-related modules together:
>>> It is how the L1's send requests to the TLB;
>>> It is how the requests and responses make their way between the L1
>>> and L2 caches;
>>> ...
>>>
>>> Each CPU core basically only has a limited number of IO connections:
>>>    Its Ringbus inputs/outputs;
>>>    Clock, Reset inputs;
>>>    Some "Debug Status LED" outputs.
>>>
>>> Beyond just memory requests, things like IRQs and Hardware-RNG stuff
>>> are also handled via the ringbus.
>>
>> Makes sense.
>>
>>
>>> Implicitly, the rings are divided into two subsets:
>>>    L1 Ring: Holds L1 caches and TLB, within the CPU core;
>>>    L2 Ring: Everything external to the CPU core.
>>>      L2 Cache, ROM, VRAM module, MMIO Bus interface, ... all sit here.
>>
>> Now I am confused again, by the word "subsets". Is there a single
>> ring with everything attached, or two separate rings, the L1 and the
>> L2 rings that each presumably has a slot on the ring used to pass data
>> between the two rings. There are advantages and disadvantages with
>> each choice.
>>
>
> I tried both.
>
> I ended up using a single ring, but with a special-case "shortcut path"
> around the CPU core (that may be taken if conditions are appropriate).

Got it. Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

"Were there no women, men might live like gods." -- Thomas Dekker

devel / comp.arch / Re: lots of inline, Intel goes to 32 GPRs

devel / comp.arch / Re: lots of inline, Intel goes to 32 GPRs

Subject	Author
Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Peter Lund
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	Elijah Stone
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Thomas Koenig
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	Quadibloc
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Scott Lurndal
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	BGB
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	JimBrakefield
Re: Intel goes to 32-bit general purpose registers	Michael S
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Re: Intel goes to 32-bit general purpose registers	Stephen Fuld
Re: Intel goes to 32-bit general purpose registers	MitchAlsup
Re: Intel goes to 32-bit general purpose registers	Anton Ertl
Re: Intel goes to 32-bit general purpose registers	John Dallman
Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Quadibloc
Re: Intel goes to 32 GPRs (was: Intel goes to 32-bit ...)	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Scott Lurndal
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	JimBrakefield
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Niklas Holsti
Re: Intel goes to 32 GPRs	Ivan Godard
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Quadibloc
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Kent Dickey
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	Anton Ertl
Re: Intel goes to 32 GPRs	EricP
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	Thomas Koenig
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	MitchAlsup
Re: Intel goes to 32 GPRs	BGB
Re: Intel goes to 32 GPRs	Terje Mathisen
Re: Intel goes to 32 GPRs	Stephen Fuld
Re: Intel goes to 32 GPRs	Kent Dickey
Callee-saved registers (was: Intel goes to 32 GPRs)	Anton Ertl
Re: Intel goes to 32 GPRs	Mike Stump