Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

<wiggy> in a stunning new move I actually tested this upload


devel / comp.arch / Re: Concertina II Progress

SubjectAuthor
* Concertina II ProgressQuadibloc
+- Re: Concertina II ProgressBGB
+* Re: Concertina II ProgressThomas Koenig
|+* Re: Concertina II ProgressBGB-Alt
||`* Re: Concertina II ProgressQuadibloc
|| `* Re: Concertina II ProgressBGB-Alt
||  +* Re: Concertina II ProgressQuadibloc
||  |+* Re: Concertina II ProgressBGB
||  ||`- Re: Concertina II ProgressMitchAlsup
||  |+* Re: Concertina II ProgressScott Lurndal
||  ||`* Re: Concertina II ProgressBGB
||  || +* Re: Concertina II ProgressStephen Fuld
||  || |`* Re: Concertina II ProgressMitchAlsup
||  || | +- Re: Concertina II ProgressBGB-Alt
||  || | `* Re: Concertina II ProgressStephen Fuld
||  || |  `* Re: Concertina II ProgressMitchAlsup
||  || |   `* Re: Concertina II ProgressStephen Fuld
||  || |    `* Re: Concertina II ProgressMitchAlsup
||  || |     `* Re: Concertina II ProgressStephen Fuld
||  || |      `* Re: Concertina II ProgressBGB
||  || |       `* Re: Concertina II ProgressMitchAlsup
||  || |        +* Re: Concertina II ProgressBGB
||  || |        |`* Re: Concertina II ProgressMitchAlsup
||  || |        | +* Re: Concertina II ProgressStefan Monnier
||  || |        | |`* Re: Concertina II ProgressMitchAlsup
||  || |        | | `* Re: Concertina II ProgressScott Lurndal
||  || |        | |  `* Re: Concertina II ProgressMitchAlsup
||  || |        | |   +- Re: Concertina II ProgressPaul A. Clayton
||  || |        | |   `* Re: Concertina II ProgressStefan Monnier
||  || |        | |    +- Re: Concertina II ProgressMitchAlsup
||  || |        | |    `* Re: Concertina II ProgressScott Lurndal
||  || |        | |     `* Re: Concertina II ProgressBGB
||  || |        | |      +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |`* Re: Concertina II ProgressBGB
||  || |        | |      | +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | |+* Re: Concertina II ProgressBGB
||  || |        | |      | ||`* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | || `* Re: Concertina II ProgressBGB
||  || |        | |      | ||  +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | ||  |+- Re: Concertina II ProgressMitchAlsup
||  || |        | |      | ||  |`* Re: Concertina II ProgressBGB
||  || |        | |      | ||  | `- Re: Concertina II ProgressScott Lurndal
||  || |        | |      | ||  `* Re: Concertina II ProgressRobert Finch
||  || |        | |      | ||   `- Re: Concertina II ProgressBGB
||  || |        | |      | |`* Re: Concertina II ProgressMitchAlsup
||  || |        | |      | | `* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | |  `* Re: Concertina II ProgressMitchAlsup
||  || |        | |      | |   +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | |   |`- Re: Concertina II ProgressMitchAlsup
||  || |        | |      | |   `* Re: Concertina II ProgressScott Lurndal
||  || |        | |      | |    `- Re: Concertina II ProgressMitchAlsup
||  || |        | |      | `- Re: Concertina II ProgressMitchAlsup
||  || |        | |      `* Re: Concertina II ProgressMitchAlsup
||  || |        | |       +- Re: Concertina II ProgressRobert Finch
||  || |        | |       `* Re: Concertina II ProgressScott Lurndal
||  || |        | |        `* Re: Concertina II ProgressMitchAlsup
||  || |        | |         `* Re: Concertina II ProgressChris M. Thomasson
||  || |        | |          `* Re: Concertina II ProgressMitchAlsup
||  || |        | |           `* Re: Concertina II ProgressMitchAlsup
||  || |        | |            `- Re: Concertina II ProgressChris M. Thomasson
||  || |        | `* Re: Concertina II ProgressBGB
||  || |        |  `* Re: Concertina II ProgressMitchAlsup
||  || |        |   `* Re: Concertina II ProgressBGB
||  || |        |    `* Re: Concertina II ProgressMitchAlsup
||  || |        |     +* Re: Concertina II ProgressRobert Finch
||  || |        |     |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     | +- Re: Concertina II ProgressRobert Finch
||  || |        |     | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  +* Re: Concertina II ProgressQuadibloc
||  || |        |     |  |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | +* Re: Concertina II ProgressScott Lurndal
||  || |        |     |  | |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | | +- Re: Concertina II ProgressScott Lurndal
||  || |        |     |  | | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  | |  `* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | |   `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  | |    `- Re: Concertina II ProgressQuadibloc
||  || |        |     |  | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  |  `- Re: Concertina II ProgressMitchAlsup
||  || |        |     |  `- Re: Concertina II ProgressMitchAlsup
||  || |        |     +- Re: Concertina II ProgressBGB
||  || |        |     `* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      +* Re: Concertina II ProgressRobert Finch
||  || |        |      |`* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      | +* Re: Concertina II ProgressMitchAlsup
||  || |        |      | |`* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      | | `- Re: Concertina II ProgressBGB
||  || |        |      | `* Computer architecture (was: Concertina II Progress)Anton Ertl
||  || |        |      |  +* Re: Computer architectureEricP
||  || |        |      |  |`* Re: Computer architectureAnton Ertl
||  || |        |      |  | `* Re: Computer architectureScott Lurndal
||  || |        |      |  |  +* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |`* Re: Computer architectureScott Lurndal
||  || |        |      |  |  | `* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  +* Re: Computer architectureScott Lurndal
||  || |        |      |  |  |  |`* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  | `* Re: Computer architectureBGB
||  || |        |      |  |  |  |  `- Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  `* Re: Computer architectureBGB
||  || |        |      |  |  |   `- Re: Computer architectureScott Lurndal
||  || |        |      |  |  `* Re: Computer architectureAnton Ertl
||  || |        |      |  `* Re: Computer architecturePaul A. Clayton
||  || |        |      `* Re: Concertina II ProgressMitchAlsup
||  || |        `* Re: Concertina II ProgressRobert Finch
||  || `* Re: Concertina II ProgressMitchAlsup
||  |+- Re: Concertina II ProgressMitchAlsup
||  |`* Re: Concertina II ProgressThomas Koenig
||  +- Re: Concertina II ProgressQuadibloc
||  `* Re: Concertina II ProgressQuadibloc
|`* Re: Concertina II ProgressQuadibloc
`* Re: Concertina II ProgressMitchAlsup

Pages:1234567891011121314151617181920212223242526272829303132333435363738
Re: Concertina II Progress

<ujrnq0$2lqen$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35207&group=comp.arch#35207

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 20:57:03 -0600
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <ujrnq0$2lqen$2@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <ujg54v$c6r4$1@dont-email.me>
<ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Nov 2023 02:57:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7453bc91ed922c1bfb3262e310d4156c";
logging-data="2812375"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hA+36IMFEVPjFrIKpgt2l"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mL21DBCO38E4i0Kbypw3Rv+Pzjs=
Content-Language: en-US
In-Reply-To: <0Pa8N.2233$PJoc.1323@fx04.iad>
 by: BGB - Sat, 25 Nov 2023 02:57 UTC

On 11/24/2023 6:01 PM, Scott Lurndal wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>
>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>
> Would require priority decoders to differeniate rather
> than simple gates, probably.
>
> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>
> ARM added support for nested hypervisors without adding a
> new exception level. Although interesting, there isn't much
> evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode
(potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".

Then again, it is possible that doing everything in software could lead
to people working in inner levels being jealous of those working in the
outer levels for being closer to the hardware (and thus presumably
having lower performance overheads).

....

Re: Concertina II Progress

<6Gp8N.22586$yAie.1862@fx44.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35213&group=comp.arch#35213

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me>
Lines: 45
Message-ID: <6Gp8N.22586$yAie.1862@fx44.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 25 Nov 2023 16:55:30 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 25 Nov 2023 16:55:30 GMT
X-Received-Bytes: 2730
 by: Scott Lurndal - Sat, 25 Nov 2023 16:55 UTC

BGB <cr88192@gmail.com> writes:
>On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>
>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>
>> Would require priority decoders to differeniate rather
>> than simple gates, probably.
>>
>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>
>> ARM added support for nested hypervisors without adding a
>> new exception level. Although interesting, there isn't much
>> evidence of it being used in production. Yet anyway.
>
>
>It seems to me, one should be able to get away with 3 modes:
> Machine / ISR;
> Supervisor;
> User.
>
>With pretty much anything that isn't "bare metal" being put in User Mode
>(potentially using emulation traps as needed).
>
>Something like a Soft-TLB or Inverted-Page-Table does not need any
>special hardware to support nested translation (whereas hardware
>page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.

>
>Not entirely sure how multi-level virtualization works with page-tables,
>but works "somehow".

But not well, nor performant.

Re: Concertina II Progress

<ujtei9$2t545$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35219&group=comp.arch#35219

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sat, 25 Nov 2023 12:31:36 -0600
Organization: A noiseless patient Spider
Lines: 84
Message-ID: <ujtei9$2t545$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Nov 2023 18:31:38 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7453bc91ed922c1bfb3262e310d4156c";
logging-data="3052677"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19v2WBr7DYBZ1DJ8BTJNDwx"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/MHBzKYhkQltdXRfQdfGnLjORvk=
Content-Language: en-US
In-Reply-To: <6Gp8N.22586$yAie.1862@fx44.iad>
 by: BGB - Sat, 25 Nov 2023 18:31 UTC

On 11/25/2023 10:55 AM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>>
>>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>>
>>> Would require priority decoders to differeniate rather
>>> than simple gates, probably.
>>>
>>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>>
>>> ARM added support for nested hypervisors without adding a
>>> new exception level. Although interesting, there isn't much
>>> evidence of it being used in production. Yet anyway.
>>
>>
>> It seems to me, one should be able to get away with 3 modes:
>> Machine / ISR;
>> Supervisor;
>> User.
>>
>> With pretty much anything that isn't "bare metal" being put in User Mode
>> (potentially using emulation traps as needed).
>>
>> Something like a Soft-TLB or Inverted-Page-Table does not need any
>> special hardware to support nested translation (whereas hardware
>> page-walking would require dedicated support).
>
> It's been tried. And performance sucked big-time. The reason
> that AMD added back support for the DS limit register in AMD64
> was to support xen (and vmware) before Pacifica (the AMD project
> that became Secure Virtual Machine (SVM) known now as AMD-V).
>

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of
multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Granted, there is still the annoyance that the OS's tend to deal with
page-tables, and one needs to translate to inverted page tables, which
typically have a finite associativity (such as 4 or 8 way).

Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking
abstraction and faking the use of hardware page walking for the guest OS's).

Granted, full soft TLB isn't ideal for performance either (in general),
my workaround was mostly making the TLB big enough that the average-case
miss rate is kept fairly low (well, and for now, putting the whole OS in
one big address space).

But, multiple address spaces is sort of the whole point of VMs, so...

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

> Both intel and AMD use a block of memory to record guest state
> and have instructions to enter and leave VM mode (e.g. vmenter);
> ARM stores guest state in system registers - less overhead
> when switching from guest to host or guest to guest.
>

OK.

>>
>> Not entirely sure how multi-level virtualization works with page-tables,
>> but works "somehow".
>
> But not well, nor performant.
>

As far as I know, the whole "nested page tables" was the core of how
virtualization worked on x86...

Re: Concertina II Progress

<SVr8N.40009$zh16.19369@fx48.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35220&group=comp.arch#35220

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me>
Lines: 131
Message-ID: <SVr8N.40009$zh16.19369@fx48.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 25 Nov 2023 19:28:50 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 25 Nov 2023 19:28:50 GMT
X-Received-Bytes: 6913
 by: Scott Lurndal - Sat, 25 Nov 2023 19:28 UTC

BGB <cr88192@gmail.com> writes:
>On 11/25/2023 10:55 AM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>>>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>>>
>>>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>>>
>>>> Would require priority decoders to differeniate rather
>>>> than simple gates, probably.
>>>>
>>>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>>>
>>>> ARM added support for nested hypervisors without adding a
>>>> new exception level. Although interesting, there isn't much
>>>> evidence of it being used in production. Yet anyway.
>>>
>>>
>>> It seems to me, one should be able to get away with 3 modes:
>>> Machine / ISR;
>>> Supervisor;
>>> User.
>>>
>>> With pretty much anything that isn't "bare metal" being put in User Mode
>>> (potentially using emulation traps as needed).
>>>
>>> Something like a Soft-TLB or Inverted-Page-Table does not need any
>>> special hardware to support nested translation (whereas hardware
>>> page-walking would require dedicated support).
>>
>> It's been tried. And performance sucked big-time. The reason
>> that AMD added back support for the DS limit register in AMD64
>> was to support xen (and vmware) before Pacifica (the AMD project
>> that became Secure Virtual Machine (SVM) known now as AMD-V).
>>
>
>OK.
>
>I wouldn't expect nested inverted-page-table translation to be *that*
>much slower than normal inverted page tables. Though, would add a bit of
>multi-level translation wonk in the top-level miss handler (and likely
>still better than multi-level soft-TLB, where a miss in the outer TLB
>level means needing to propagate the interrupt inwards and then
>emulating it the whole way up).

Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.

Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.

Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly
if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).

The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.

>Would mean that multi-level interrupt handling would still be needed
>whenever the page isn't in the guest's TLB or VIPT (short of breaking
>abstraction and faking the use of hardware page walking for the guest OS's).

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

>
>Seems like one might need a mechanism to remap the VM from real CR's to
>a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).

[*] And also to provide VFIO support for non-SRIOV hardware devices.

>> But not well, nor performant.
>>
>
>As far as I know, the whole "nested page tables" was the core of how
>virtualization worked on x86...

Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.

>

Re: Concertina II Progress

<4b4da7d2d8a3ab26e98b838cf5deb9a9@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35221&group=comp.arch#35221

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sat, 25 Nov 2023 19:27:04 +0000
Organization: novaBBS
Message-ID: <4b4da7d2d8a3ab26e98b838cf5deb9a9@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2097450"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$y57xrkj.KuDkqfgrTAO7ROvjS6cj8.xf89zazkQJq7ZN6Qnj2l1x2
 by: MitchAlsup - Sat, 25 Nov 2023 19:27 UTC

BGB wrote:

> On 11/25/2023 10:55 AM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>>>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>>>
>>>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>>>
>>>> Would require priority decoders to differeniate rather
>>>> than simple gates, probably.
>>>>
>>>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>>>
>>>> ARM added support for nested hypervisors without adding a
>>>> new exception level. Although interesting, there isn't much
>>>> evidence of it being used in production. Yet anyway.
>>>
>>>
>>> It seems to me, one should be able to get away with 3 modes:
>>> Machine / ISR;
>>> Supervisor;
>>> User.
>>>
>>> With pretty much anything that isn't "bare metal" being put in User Mode
>>> (potentially using emulation traps as needed).
>>>
>>> Something like a Soft-TLB or Inverted-Page-Table does not need any
>>> special hardware to support nested translation (whereas hardware
>>> page-walking would require dedicated support).
>>
>> It's been tried. And performance sucked big-time. The reason
>> that AMD added back support for the DS limit register in AMD64
>> was to support xen (and vmware) before Pacifica (the AMD project
>> that became Secure Virtual Machine (SVM) known now as AMD-V).
>>

> OK.

> I wouldn't expect nested inverted-page-table translation to be *that*
> much slower than normal inverted page tables. Though, would add a bit of
> multi-level translation wonk in the top-level miss handler (and likely
> still better than multi-level soft-TLB, where a miss in the outer TLB
> level means needing to propagate the interrupt inwards and then
> emulating it the whole way up).

Think of it like this:: Privilege inversion::

If HV is performing table walks on behalf of Guest OS, HV is having to
rummage through Guest OS tables and then rummage through HV own tables.
Here having HV rummage through Guest OS tables is more than a hassle,
nothing in HV should directly touch anything in Guest OS unless Guest
OS grants access (and not implicitly as is herein).

What you REALLY want is for Guest OS to manage its own tables and HV to
manage its own tables. Thereby no particular piece of SW is capable of
operating at the lowest privilege of {Guest OS, HV} it can be 1 or the
other.

The above holds for any kind of tables, nested, inverted, nested inverted,
...

> Granted, there is still the annoyance that the OS's tend to deal with
> page-tables, and one needs to translate to inverted page tables, which
> typically have a finite associativity (such as 4 or 8 way).

> Would mean that multi-level interrupt handling would still be needed
> whenever the page isn't in the guest's TLB or VIPT (short of breaking
> abstraction and faking the use of hardware page walking for the guest OS's).

> Granted, full soft TLB isn't ideal for performance either (in general),
> my workaround was mostly making the TLB big enough that the average-case
> miss rate is kept fairly low (well, and for now, putting the whole OS in
> one big address space).

> But, multiple address spaces is sort of the whole point of VMs, so...

> Seems like one might need a mechanism to remap the VM from real CR's to
> a partially emulated set of CR's (VCR's ?...).

>> Both intel and AMD use a block of memory to record guest state
>> and have instructions to enter and leave VM mode (e.g. vmenter);
>> ARM stores guest state in system registers - less overhead
>> when switching from guest to host or guest to guest.
>>

> OK.

>>>
>>> Not entirely sure how multi-level virtualization works with page-tables,
>>> but works "somehow".
>>
>> But not well, nor performant.
>>

> As far as I know, the whole "nested page tables" was the core of how
> virtualization worked on x86...

Re: Concertina II Progress

<ujtpjb$2unm2$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35231&group=comp.arch#35231

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sat, 25 Nov 2023 15:39:53 -0600
Organization: A noiseless patient Spider
Lines: 212
Message-ID: <ujtpjb$2unm2$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
<ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Nov 2023 21:39:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7453bc91ed922c1bfb3262e310d4156c";
logging-data="3104450"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+IHYMaTKsHGC37iGCxkWsP"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:W0JgLFMvhNAUrv681CW806FzLfk=
Content-Language: en-US
In-Reply-To: <SVr8N.40009$zh16.19369@fx48.iad>
 by: BGB - Sat, 25 Nov 2023 21:39 UTC

On 11/25/2023 1:28 PM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 11/25/2023 10:55 AM, Scott Lurndal wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>>>>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>>>>
>>>>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>>>>
>>>>> Would require priority decoders to differeniate rather
>>>>> than simple gates, probably.
>>>>>
>>>>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>>>>
>>>>> ARM added support for nested hypervisors without adding a
>>>>> new exception level. Although interesting, there isn't much
>>>>> evidence of it being used in production. Yet anyway.
>>>>
>>>>
>>>> It seems to me, one should be able to get away with 3 modes:
>>>> Machine / ISR;
>>>> Supervisor;
>>>> User.
>>>>
>>>> With pretty much anything that isn't "bare metal" being put in User Mode
>>>> (potentially using emulation traps as needed).
>>>>
>>>> Something like a Soft-TLB or Inverted-Page-Table does not need any
>>>> special hardware to support nested translation (whereas hardware
>>>> page-walking would require dedicated support).
>>>
>>> It's been tried. And performance sucked big-time. The reason
>>> that AMD added back support for the DS limit register in AMD64
>>> was to support xen (and vmware) before Pacifica (the AMD project
>>> that became Secure Virtual Machine (SVM) known now as AMD-V).
>>>
>>
>> OK.
>>
>> I wouldn't expect nested inverted-page-table translation to be *that*
>> much slower than normal inverted page tables. Though, would add a bit of
>> multi-level translation wonk in the top-level miss handler (and likely
>> still better than multi-level soft-TLB, where a miss in the outer TLB
>> level means needing to propagate the interrupt inwards and then
>> emulating it the whole way up).
>
> Let's look at a hardware example of a nested page table walk,
> using the AMD nested page table feature as a guide. The AMD
> version uses the same PTE format as the non-virtualized page
> tables (which reduces the amount of kernel code required to
> manage the page tables) unlike Intel's EPT.
>
> Assuming 4k-byte pages in both the primary and nested page tables,
> a page table walk must make 22 memory accesses to satisfy a
> VA to PA translation, versus only four in a non-virtualized
> table walk. This can be reduced to 11 if you have the luxury
> of using 1GB mappings in the nested page table.
>
> Performing all those accesses in a kernel fault handler would
> consume a great deal more time than a hardware table walker will (particularly
> if the hardware table walkers can cache the intermediate results
> of the higher-level blocks in the walk in the walk hardware).
>

OK.

> The downsides of IPT pretty much preclude their use in most
> modern operating systems where shared memory between processes
> is common (explicitly -or- implicitly (such as VDSO on linux));
> some of goals listed as benefits for IPT (e.g. easier whole
> process swapping) are made irrelevent by modern operating
> systems that don't do that. There's a rather incoherent
> description of IPT at geeksforgeeks - I'd not recommend it
> as a useful resource.
>

I was thinking of an IPT where one basically keeps stuff from all of the
currently running process in a shared IPT, mostly treating it like a big
memory-backed form of the TLB.

Though, sharing is a concern:
If you hash entries based on ASID, then there are fewer collisions, but
no sharing;
Sharing requires addressing to effectively be plain modulo within the
areas that can be shared.

Initially, I had assumed non-hashed modulo indexing, but this does mean
a potentially higher collision rate if different ASIDs have different
pages in the same overlapping address ranges.

Something like 8-way associativity would be "better" here at reducing
this issue, but more expensive to deal with in hardware than 4-way.

>
>> Would mean that multi-level interrupt handling would still be needed
>> whenever the page isn't in the guest's TLB or VIPT (short of breaking
>> abstraction and faking the use of hardware page walking for the guest OS's).
>
> If you're taking an interrupt, to resolve guest TLB misses,
> performance is clearly not high priority.
>

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).

But, doing traditional "every process gets its own address space" takes
a hit here (no good option other than to limit the task-switching
frequency, but this may become obvious to the user if the task switching
is too slow).

So, for something like a 50MHz core, this might mean, say, allowing a
process to run for up to 250ms before the preemptive task-switch
mechanism kicks in. But, 250ms is slow enough to become obvious to a
user (or, at least, much more so than, say, 100ms).

Though, probably still better than a purely cooperative scheduler, where
a process failing to call "thrd_yeild()" effectively locks up the whole
rest of the system (in my GUI experiments, this effect results in, say,
Doom effectively locking up the whole GUI until it running the game
proper, where it then starts calling "thrd_yeild()").

Though, might make sense to consider also being able to forcibly yield
threads on system calls and/or in some other C library calls.

Though, in these cases, will likely still need to start adding mutex
locks in some areas.

>
>>
>> Seems like one might need a mechanism to remap the VM from real CR's to
>> a partially emulated set of CR's (VCR's ?...).
>
> ARM does this by adding a layer above the OS ring that can trap
> accesses to certain control registers used by the OS to the
> hypervisor for resolution. But for the most part, the guest just
> uses the same control registers as if it were running bare metal with
> no trapping - they're just loaded by the hypervisor before the guest
> is dispatched and saved by the hypervisor when scheduling a new
> guest. Thats an advantage of the exception level scheme, where
> each level has its own set of control registers.
>
> However, shortcoming of the initial implementation was if the
> hypervisor was type II, the hypervisor needed to have a special
> privileged guest to run standard user-mode code[*]. So they
> added (in V8.1) the virtual host extensions (VHE) which allowed
> the hypervisor exception level (EL2) to directly dispatch
> user-mode code to EL0 (with the normal traps from usermode
> to the OS directed to the hypervisor instead of a guest OS). This
> let the hypervisor (e.g. KVM) to act both as a hypervisor and
> a guest OS with out the context switches required to support a
> privileged guest).
>
> [*] And also to provide VFIO support for non-SRIOV hardware devices.
>

OK.

>
>>> But not well, nor performant.
>>>
>>
>> As far as I know, the whole "nested page tables" was the core of how
>> virtualization worked on x86...
>
> Before AMD added NPT (Nested Page Tables), the hypervisor needed to
> be able to recognize and trap any accesses from the guest OS to
> its own page tables and update the real page tables accordingly.
> To do that, they had several options:
> 1) Paravirtualization (i.e. all guest page table ops call the
> hypervisor rather than changing the page tables directly);
> Xen did this.
> 2) Write-protecting the page tables and trapping any writes in
> the hypervisor. Difficult to do since the page tables in
> common OS are not allocated contiguously and they are updated
> using normal loads and stores (the HV does know them, however,
> as it can trap writes to CR3 and from there can write-protect
> the entire table in the real page tables).
> 3) Binary patch the guest operating system. This was the approach used
> by VMware before AMD introduced NPT.
>


Click here to read the complete article
Re: Concertina II Progress

<Fhu8N.110789$BbXa.14700@fx16.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35233&group=comp.arch#35233

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <ujtpjb$2unm2$1@dont-email.me>
Lines: 50
Message-ID: <Fhu8N.110789$BbXa.14700@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 25 Nov 2023 22:10:45 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 25 Nov 2023 22:10:45 GMT
X-Received-Bytes: 2904
 by: Scott Lurndal - Sat, 25 Nov 2023 22:10 UTC

BGB <cr88192@gmail.com> writes:
>On 11/25/2023 1:28 PM, Scott Lurndal wrote:

>>
>> If you're taking an interrupt, to resolve guest TLB misses,
>> performance is clearly not high priority.
>>
>
>If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>CPU), then the cost of the TLB miss handling is on par with other things
>like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

>
>But, what one does need, is a way to perform context switches without
>also triggering a huge wave of TLB misses in the process.

Why?

Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).

Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.

For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.

>
>Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
>full TLB flush on context-switch would suck pretty bad).

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason
that those entries need to be flushed on context-switch.

Re: Concertina II Progress

<1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35244&group=comp.arch#35244

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 01:50:39 +0000
Organization: novaBBS
Message-ID: <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2124721"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$x0ahwT573gfzPvDM0PZrMuSY6DCbkb7ST5POM8rJxIj5OsDqmfTlG
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sun, 26 Nov 2023 01:50 UTC

Scott Lurndal wrote:

> BGB <cr88192@gmail.com> writes:
>>On 11/25/2023 10:55 AM, Scott Lurndal wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 11/24/2023 6:01 PM, Scott Lurndal wrote:
>>>>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>>>>>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>>>>>>
>>>>>> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
>>>>>
>>>>> Would require priority decoders to differeniate rather
>>>>> than simple gates, probably.
>>>>>
>>>>> Although I wonder at the missing firmware privilege level, a la SMM or EL3.
>>>>>
>>>>> ARM added support for nested hypervisors without adding a
>>>>> new exception level. Although interesting, there isn't much
>>>>> evidence of it being used in production. Yet anyway.
>>>>
>>>>
>>>> It seems to me, one should be able to get away with 3 modes:
>>>> Machine / ISR;
>>>> Supervisor;
>>>> User.
>>>>
>>>> With pretty much anything that isn't "bare metal" being put in User Mode
>>>> (potentially using emulation traps as needed).
>>>>
>>>> Something like a Soft-TLB or Inverted-Page-Table does not need any
>>>> special hardware to support nested translation (whereas hardware
>>>> page-walking would require dedicated support).
>>>
>>> It's been tried. And performance sucked big-time. The reason
>>> that AMD added back support for the DS limit register in AMD64
>>> was to support xen (and vmware) before Pacifica (the AMD project
>>> that became Secure Virtual Machine (SVM) known now as AMD-V).
>>>
>>
>>OK.
>>
>>I wouldn't expect nested inverted-page-table translation to be *that*
>>much slower than normal inverted page tables. Though, would add a bit of
>>multi-level translation wonk in the top-level miss handler (and likely
>>still better than multi-level soft-TLB, where a miss in the outer TLB
>>level means needing to propagate the interrupt inwards and then
>>emulating it the whole way up).

> Let's look at a hardware example of a nested page table walk,
> using the AMD nested page table feature as a guide. The AMD
> version uses the same PTE format as the non-virtualized page
> tables (which reduces the amount of kernel code required to
> manage the page tables) unlike Intel's EPT.

> Assuming 4k-byte pages in both the primary and nested page tables,
> a page table walk must make 22 memory accesses to satisfy a
> VA to PA translation, versus only four in a non-virtualized
> table walk. This can be reduced to 11 if you have the luxury
> of using 1GB mappings in the nested page table.

20 of those 22 accesses are subject to caching of various flavors.

> Performing all those accesses in a kernel fault handler would
> consume a great deal more time than a hardware table walker will (particularly
> if the hardware table walkers can cache the intermediate results
> of the higher-level blocks in the walk in the walk hardware).

> The downsides of IPT pretty much preclude their use in most
> modern operating systems where shared memory between processes
> is common (explicitly -or- implicitly (such as VDSO on linux));
> some of goals listed as benefits for IPT (e.g. easier whole
> process swapping) are made irrelevent by modern operating
> systems that don't do that. There's a rather incoherent
> description of IPT at geeksforgeeks - I'd not recommend it
> as a useful resource.

If you want to run any form of *nix you must design the center of
control at/in the CPU[s].....for better or worse.

>>Would mean that multi-level interrupt handling would still be needed
>>whenever the page isn't in the guest's TLB or VIPT (short of breaking
>>abstraction and faking the use of hardware page walking for the guest OS's).

> If you're taking an interrupt, to resolve guest TLB misses,
> performance is clearly not high priority.

>>
>>Seems like one might need a mechanism to remap the VM from real CR's to
>>a partially emulated set of CR's (VCR's ?...).

> ARM does this by adding a layer above the OS ring that can trap
> accesses to certain control registers used by the OS to the
> hypervisor for resolution. But for the most part, the guest just
> uses the same control registers as if it were running bare metal with
> no trapping - they're just loaded by the hypervisor before the guest
> is dispatched and saved by the hypervisor when scheduling a new
> guest. Thats an advantage of the exception level scheme, where
> each level has its own set of control registers.

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

> However, shortcoming of the initial implementation was if the
> hypervisor was type II, the hypervisor needed to have a special
> privileged guest to run standard user-mode code[*]. So they
> added (in V8.1) the virtual host extensions (VHE) which allowed
> the hypervisor exception level (EL2) to directly dispatch
> user-mode code to EL0 (with the normal traps from usermode
> to the OS directed to the hypervisor instead of a guest OS). This
> let the hypervisor (e.g. KVM) to act both as a hypervisor and
> a guest OS with out the context switches required to support a
> privileged guest).

> [*] And also to provide VFIO support for non-SRIOV hardware devices.

>>> But not well, nor performant.
>>>
>>
>>As far as I know, the whole "nested page tables" was the core of how
>>virtualization worked on x86...

> Before AMD added NPT (Nested Page Tables), the hypervisor needed to
> be able to recognize and trap any accesses from the guest OS to
> its own page tables and update the real page tables accordingly.
> To do that, they had several options:
> 1) Paravirtualization (i.e. all guest page table ops call the
> hypervisor rather than changing the page tables directly);
> Xen did this.
> 2) Write-protecting the page tables and trapping any writes in
> the hypervisor. Difficult to do since the page tables in
> common OS are not allocated contiguously and they are updated
> using normal loads and stores (the HV does know them, however,
> as it can trap writes to CR3 and from there can write-protect
> the entire table in the real page tables).
> 3) Binary patch the guest operating system. This was the approach used
> by VMware before AMD introduced NPT.

Nested Page Tables are the best solution (Fewest SW instructions of
overhead and total cycles of latency) we currently know of.

Re: Concertina II Progress

<uZJ8N.7151$Ycdc.529@fx09.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35247&group=comp.arch#35247

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com>
Lines: 36
Message-ID: <uZJ8N.7151$Ycdc.529@fx09.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 26 Nov 2023 16:01:30 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 26 Nov 2023 16:01:30 GMT
X-Received-Bytes: 2507
 by: Scott Lurndal - Sun, 26 Nov 2023 16:01 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Scott Lurndal wrote:
>

>>>Seems like one might need a mechanism to remap the VM from real CR's to
>>>a partially emulated set of CR's (VCR's ?...).
>
>> ARM does this by adding a layer above the OS ring that can trap
>> accesses to certain control registers used by the OS to the
>> hypervisor for resolution. But for the most part, the guest just
>> uses the same control registers as if it were running bare metal with
>> no trapping - they're just loaded by the hypervisor before the guest
>> is dispatched and saved by the hypervisor when scheduling a new
>> guest. Thats an advantage of the exception level scheme, where
>> each level has its own set of control registers.
>
>My 66000 memory maps control registers {CPU, LLC, NorthBridge,
>device, ...} into MMI/O space. A CPU, with access permission,
>can read or write another CPU's control registers--used sparingly
>to get out of trouble. Mainly this is used to allow a CPU to
>read or write device control registers.

ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).

The preferred way to access a cores own system registers is
via the MSR/MRS instructions.

<snip>

>Nested Page Tables are the best solution (Fewest SW instructions of
>overhead and total cycles of latency) we currently know of.

Indeed.

Re: Concertina II Progress

<496ce101508499530944f801b52ca8b6@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35254&group=comp.arch#35254

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 19:28:13 +0000
Organization: novaBBS
Message-ID: <496ce101508499530944f801b52ca8b6@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com> <uZJ8N.7151$Ycdc.529@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2201940"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$3e9f8Yii2Z9DELrWNZk/oeYZ0BHdb5SP77UNl0./ZjW.HG.ft.KSu
 by: MitchAlsup - Sun, 26 Nov 2023 19:28 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup) writes:
>>Scott Lurndal wrote:
>>

>>>>Seems like one might need a mechanism to remap the VM from real CR's to
>>>>a partially emulated set of CR's (VCR's ?...).
>>
>>> ARM does this by adding a layer above the OS ring that can trap
>>> accesses to certain control registers used by the OS to the
>>> hypervisor for resolution. But for the most part, the guest just
>>> uses the same control registers as if it were running bare metal with
>>> no trapping - they're just loaded by the hypervisor before the guest
>>> is dispatched and saved by the hypervisor when scheduling a new
>>> guest. Thats an advantage of the exception level scheme, where
>>> each level has its own set of control registers.
>>
>>My 66000 memory maps control registers {CPU, LLC, NorthBridge,
>>device, ...} into MMI/O space. A CPU, with access permission,
>>can read or write another CPU's control registers--used sparingly
>>to get out of trouble. Mainly this is used to allow a CPU to
>>read or write device control registers.

> ARM supports access to CPU system registers via MMIO;
> primarily for debug purposes. System Registers may be accessed
> either via MMIO accesses from a running core, subject to
> permission controls, or via JTAG interface(s).

Nice to know someone already blazed the trail.

> The preferred way to access a cores own system registers is
> via the MSR/MRS instructions.

My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.

But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.

> <snip>

>>Nested Page Tables are the best solution (Fewest SW instructions of
>>overhead and total cycles of latency) we currently know of.

> Indeed.

Re: Concertina II Progress

<uk0ckk$3dmb4$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35257&group=comp.arch#35257

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 15:17:06 -0600
Organization: A noiseless patient Spider
Lines: 129
Message-ID: <uk0ckk$3dmb4$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
<ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
<ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 26 Nov 2023 21:17:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dcf0aa9e6f429425e09f6aa229e16f53";
logging-data="3594596"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wTtZ84gpV3c9bFsWapxyT"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YyGQL85zIwOvw6il4xbn8a9pm08=
Content-Language: en-US
In-Reply-To: <Fhu8N.110789$BbXa.14700@fx16.iad>
 by: BGB - Sun, 26 Nov 2023 21:17 UTC

On 11/25/2023 4:10 PM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>
>>>
>>> If you're taking an interrupt, to resolve guest TLB misses,
>>> performance is clearly not high priority.
>>>
>>
>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>> CPU), then the cost of the TLB miss handling is on par with other things
>> like handling the timer interrupt, etc...
>
> Any cycle used by the miss handler is a cycle that could
> have been used for useful work. Timer interrupt handling
> is often very short (increment a memory location, a comparison
> and a return if no timer has expired). And we're long
> past the days of using regular timer interrupts for scheduling
> (see tickless kernels, for example).
>

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

Much of the work in the time spent in the latter is saving/restoring the
relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

At least, excluding something like using B-Tree based page tables...

It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of registers
(well, at least until we detect that there will be a page-fault, which
would still require falling back to a "more comprehensive" handler).

Any L1 miss penalties from the page-walk itself would likely also apply
to a hardware page-walker.

>
>>
>> But, what one does need, is a way to perform context switches without
>> also triggering a huge wave of TLB misses in the process.
>
> Why?
>
> Note that depending on the number of entries in your TLB
> and the scheduler behavior, it's unlikely that any prior
> TLB entries will be useful to a newly scheduled thread
> (in a different address space).
>

I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly something
the size of the working set of Doom entirely in the TLB.

In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much
reduction in miss rate.

> Having multiple banks of TLBs that you can switch between
> might be able to provide you with the capability to
> reduce the TLB miss rate on scheduling a new thread of
> execution - but CAMs aren't cheap.
>

This is why my TLB is 4-way set-associative.

An 8-way TLB would be a lot more expensive, and a fully-associative TLB
(of nearly any non-trivial size) would be effectively implausible.

> For the most part, industry has settled on a large number
> of tagged TLB entries as a good compromise. Some architectures have
> a global bit in the entry that can be set via the page
> table that indicates that ASID and/or VMID qualifications
> aren't necessary for a hit.
>

Yeah.

I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.

In my case:
The TTB register defines an ASID in the high order bits;
The TLBE also has an ASID;
The ASID is split into two parts (6 and 10 bits).
In the ASID, 0 designates global pages
But they are broken into "groups"
So typically a global page is only shared within a given group.

I am thinking the 6.10 split may have given too many bits to the group,
and 4.12 or 2.14 might have been better.

As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).

So, say, in the current scheme:
ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.

Where, say, if during a TLB Miss, if a page is marked global, it can be
put into one of these ASIDs rather than the main ASID of the current
process (if not in an ASID range which disallows global pages).

The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.

>>
>> Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
>> full TLB flush on context-switch would suck pretty bad).
>
> That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
> of the virtrual address space is shared by all processes - there's no reason
> that those entries need to be flushed on context-switch.
>

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global
pages.

Re: Concertina II Progress

<hNP8N.30033$ayBd.23182@fx07.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35264&group=comp.arch#35264

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx07.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com> <uZJ8N.7151$Ycdc.529@fx09.iad> <496ce101508499530944f801b52ca8b6@news.novabbs.com>
Lines: 58
Message-ID: <hNP8N.30033$ayBd.23182@fx07.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 26 Nov 2023 22:38:05 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 26 Nov 2023 22:38:05 GMT
X-Received-Bytes: 3403
 by: Scott Lurndal - Sun, 26 Nov 2023 22:38 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Scott Lurndal wrote:
>
>> mitchalsup@aol.com (MitchAlsup) writes:
>>>Scott Lurndal wrote:
>>>
>
>>>>>Seems like one might need a mechanism to remap the VM from real CR's to
>>>>>a partially emulated set of CR's (VCR's ?...).
>>>
>>>> ARM does this by adding a layer above the OS ring that can trap
>>>> accesses to certain control registers used by the OS to the
>>>> hypervisor for resolution. But for the most part, the guest just
>>>> uses the same control registers as if it were running bare metal with
>>>> no trapping - they're just loaded by the hypervisor before the guest
>>>> is dispatched and saved by the hypervisor when scheduling a new
>>>> guest. Thats an advantage of the exception level scheme, where
>>>> each level has its own set of control registers.
>>>
>>>My 66000 memory maps control registers {CPU, LLC, NorthBridge,
>>>device, ...} into MMI/O space. A CPU, with access permission,
>>>can read or write another CPU's control registers--used sparingly
>>>to get out of trouble. Mainly this is used to allow a CPU to
>>>read or write device control registers.
>
>> ARM supports access to CPU system registers via MMIO;
>> primarily for debug purposes. System Registers may be accessed
>> either via MMIO accesses from a running core, subject to
>> permission controls, or via JTAG interface(s).
>
>Nice to know someone already blazed the trail.

Note that a handful of system registers, when accessed
using the MRS/MSR instructions are self-synchronizing
with-respect to other state. This, architecturally,
does _not_ hold when accessed via MMIO.

>
>> The preferred way to access a cores own system registers is
>> via the MSR/MRS instructions.
>
>My 66000 has a HR (Header Register) instruction to access one
>register at a time, but a MM (memory to memory move) instruction
>can be used to swap the entire core-stack {HV-level context switch.}
>MM to a MMI/O space is guaranteed to be ATOMIC across the entire
>transfer.
>
>But it is not just system registers, but all storage within a
>CPU/core, the L2 control status registers, the HostBridge
>control and status registers,...EVEN the register Registers
>are available--remotely.
>
>> <snip>
>
>>>Nested Page Tables are the best solution (Fewest SW instructions of
>>>overhead and total cycles of latency) we currently know of.
>
>> Indeed.

Re: Concertina II Progress

<BQP8N.30034$ayBd.7475@fx07.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35265&group=comp.arch#35265

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx07.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com> <uZJ8N.7151$Ycdc.529@fx09.iad> <496ce101508499530944f801b52ca8b6@news.novabbs.com>
Lines: 28
Message-ID: <BQP8N.30034$ayBd.7475@fx07.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 26 Nov 2023 22:41:37 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 26 Nov 2023 22:41:37 GMT
X-Received-Bytes: 1983
 by: Scott Lurndal - Sun, 26 Nov 2023 22:41 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Scott Lurndal wrote:
>
>> mitchalsup@aol.com (MitchAlsup) writes:
>>>Scott Lurndal wrote:
>>>
>

>
>> The preferred way to access a cores own system registers is
>> via the MSR/MRS instructions.
>
>My 66000 has a HR (Header Register) instruction to access one
>register at a time, but a MM (memory to memory move) instruction
>can be used to swap the entire core-stack {HV-level context switch.}
>MM to a MMI/O space is guaranteed to be ATOMIC across the entire
>transfer.
>
>But it is not just system registers, but all storage within a
>CPU/core, the L2 control status registers, the HostBridge
>control and status registers,...EVEN the register Registers
>are available--remotely.

Yes, we do that (useful on chips that can also be a PCIe endpoint).

Even AMD does that with the memory controllers, SMI, I2C/I3C
etc. appearing as PCI endpoints.

Re: Concertina II Progress

<zVP8N.30036$ayBd.10342@fx07.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35266&group=comp.arch#35266

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx07.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad> <uk0ckk$3dmb4$1@dont-email.me>
Lines: 48
Message-ID: <zVP8N.30036$ayBd.10342@fx07.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 26 Nov 2023 22:46:55 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 26 Nov 2023 22:46:55 GMT
X-Received-Bytes: 2835
 by: Scott Lurndal - Sun, 26 Nov 2023 22:46 UTC

BGB <cr88192@gmail.com> writes:
>On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>
>>>>
>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>> performance is clearly not high priority.
>>>>
>>>
>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>> CPU), then the cost of the TLB miss handling is on par with other things
>>> like handling the timer interrupt, etc...
>>
>> Any cycle used by the miss handler is a cycle that could
>> have been used for useful work. Timer interrupt handling
>> is often very short (increment a memory location, a comparison
>> and a return if no timer has expired). And we're long
>> past the days of using regular timer interrupts for scheduling
>> (see tickless kernels, for example).
>>
>
>It takes roughly as much time to service a timer interrupt as to service
>a TLB miss...

You'll need to provide more than an assertion for that.

>
>Much of the work in the time spent in the latter is saving/restoring the
>relevant registers, with the actual page table walk and 'LDTLB'
>instruction typically a fairly minor part in comparison...

Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.

<snip>

>> That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
>> of the virtrual address space is shared by all processes - there's no reason
>> that those entries need to be flushed on context-switch.
>>
>
>AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
>the defined behavior?... Well, at least ignoring the support for global
>pages.

Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.

Re: Concertina II Progress

<b381a322e56ff2ac9c52eaa34374b3f3@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35267&group=comp.arch#35267

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 23:10:36 +0000
Organization: novaBBS
Message-ID: <b381a322e56ff2ac9c52eaa34374b3f3@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com> <uZJ8N.7151$Ycdc.529@fx09.iad> <496ce101508499530944f801b52ca8b6@news.novabbs.com> <hNP8N.30033$ayBd.23182@fx07.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2218274"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$rhxoCPa1/YWnOG8aNij/Fe3Z5HXVL0gGGqur5AbdptWaUww.jfM9u
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sun, 26 Nov 2023 23:10 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup) writes:
>>Scott Lurndal wrote:
>>
>>>>
>>>>My 66000 memory maps control registers {CPU, LLC, NorthBridge,
>>>>device, ...} into MMI/O space. A CPU, with access permission,
>>>>can read or write another CPU's control registers--used sparingly
>>>>to get out of trouble. Mainly this is used to allow a CPU to
>>>>read or write device control registers.
>>
>>> ARM supports access to CPU system registers via MMIO;
>>> primarily for debug purposes. System Registers may be accessed
>>> either via MMIO accesses from a running core, subject to
>>> permission controls, or via JTAG interface(s).
>>
>>Nice to know someone already blazed the trail.

> Note that a handful of system registers, when accessed
> using the MRS/MSR instructions are self-synchronizing
> with-respect to other state. This, architecturally,
> does _not_ hold when accessed via MMIO.

My 66000 architecture specification indicates that when a CPU control
register is written the CPU performs as if there were a saving of all
current state, allow the write to transpire, and then act as if you
reloaded all the state.

The "as if" qualifier allows an implementation to take less cycles
when it recognizes certain situations.

But, this is one of those things that falls out "for free"* when the
HW knows how to perform context switches as if thread-state were in
memory. {{(*) nothing is ever free, but if you have HW context
switches there are a lot of other things that can be made "as if"}}

Re: Concertina II Progress

<daf8174469b78e2ee283ccb57892790f@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35268&group=comp.arch#35268

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 23:13:07 +0000
Organization: novaBBS
Message-ID: <daf8174469b78e2ee283ccb57892790f@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <1ca3f30a0d0dc903ec9cdf498098e638@news.novabbs.com> <uZJ8N.7151$Ycdc.529@fx09.iad> <496ce101508499530944f801b52ca8b6@news.novabbs.com> <BQP8N.30034$ayBd.7475@fx07.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2218763"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$A2xlu03BwcN4.j9qaTBl/OSwe1TKqHKS/SCtLqyvXS/FZgCjBMEIS
 by: MitchAlsup - Sun, 26 Nov 2023 23:13 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup) writes:
>>Scott Lurndal wrote:
>>
>>
>>But it is not just system registers, but all storage within a
>>CPU/core, the L2 control status registers, the HostBridge
>>control and status registers,...EVEN the register Registers
>>are available--remotely.

> Yes, we do that (useful on chips that can also be a PCIe endpoint).

> Even AMD does that with the memory controllers, SMI, I2C/I3C
> etc. appearing as PCI endpoints.

I look at it like this, you are going to need the ability to reach
into the innermost areas of the chip and look at what is going on.
The easiest means to get here, today, is via PCIe--JTAG is not that
useful when there are 1T bits you might want to look at.

Re: Concertina II Progress

<993818c45413361b05a2656bf30cc304@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35270&group=comp.arch#35270

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 23:29:35 +0000
Organization: novaBBS
Message-ID: <993818c45413361b05a2656bf30cc304@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad> <uk0ckk$3dmb4$1@dont-email.me> <zVP8N.30036$ayBd.10342@fx07.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2219697"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$z/W8555Kf3ywJieOdQydYeT4qBuQ0uAHzdcfJQZ2sQofjefopEsdG
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Sun, 26 Nov 2023 23:29 UTC

Scott Lurndal wrote:

> BGB <cr88192@gmail.com> writes:
>>On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>>
>>>>>
>>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>>> performance is clearly not high priority.
>>>>>
>>>>
>>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>>> CPU), then the cost of the TLB miss handling is on par with other things
>>>> like handling the timer interrupt, etc...
>>>
>>> Any cycle used by the miss handler is a cycle that could
>>> have been used for useful work. Timer interrupt handling
>>> is often very short (increment a memory location, a comparison
>>> and a return if no timer has expired). And we're long
>>> past the days of using regular timer interrupts for scheduling
>>> (see tickless kernels, for example).
>>>
>>
>>It takes roughly as much time to service a timer interrupt as to service
>>a TLB miss...

> You'll need to provide more than an assertion for that.

Service a TLB miss with an L2 TLB is about 6 cycles on my 1-wide machine.
Walking the page tables may be as few as 1 access or as many as 24 to
L2 cache (adding in whatever cache miss latency transpires). With reasonable
Table Walk Caching, we may average 30-cycles {Hardware table walk} So,
at one end we have 6-cycles and at the other we have 24 serially dependent
L2 misses:: but averaging around 30-cycles.

Service a timer interrupt:: 10-cycles waiting for thread-state to arrive,
Cache miss waiting for instructions for ISR dispatcher, 3 instructions to
transfer control to ISR handler. Another cache miss waiting for instructions
At this point the handler needs to tell the time it has been serviced, and
optionally to send it a count of the next time it should go off. Schedule
a DPC/softIRQ, unwind the handler/dispatcher stack, and return from dispatcher
only to end up at DPC/softIRQ.

I cant see this taking less than 100 cycles.......and vastly more if SW is
burdened with doing the save and restore after finding registers to use
while shuffling data to some stack.

>>
>>Much of the work in the time spent in the latter is saving/restoring the
>>relevant registers, with the actual page table walk and 'LDTLB'
>>instruction typically a fairly minor part in comparison...

> Then you've a poorly written handler. Note that a hardware table
> walker doesn't need to save any registers.

Neither does the My 66000 ISR dispatcher. By the time control arrives,
old thread state has been returned (at least conceptually) to memory
and the CPU has its new thread state loaded {including IP, Root Pointer,
ISRasid, ISR SP, ISR FP is desired, and pointers to things the IRS may
want quick access to when it receives control--all reentrantly.

So, I contend it is not the writing of the ISR handler, it is architecture
which causes the ISR handler to have such a bit prologue and epilogue.

> <snip>

>>> That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
>>> of the virtrual address space is shared by all processes - there's no reason
>>> that those entries need to be flushed on context-switch.
>>>
>>
>>AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
>>the defined behavior?... Well, at least ignoring the support for global
>>pages.

Well done ASIDs prevent the need for TLB flushing except when kinking a thread
out the ASID bucket-list.

Re: Concertina II Progress

<uk0mla$3f3ju$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35272&group=comp.arch#35272

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 19:08:09 -0500
Organization: A noiseless patient Spider
Lines: 139
Message-ID: <uk0mla$3f3ju$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
<ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
<ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad>
<uk0ckk$3dmb4$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 27 Nov 2023 00:08:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ceec176b45ee2c6fcec01f9530ed8ad4";
logging-data="3640958"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19sYXsAXAnNPSMnupuTelmdymcYImOW2SU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:r7Dfa2iSgkmZN0vHD8IvtIdIT1U=
In-Reply-To: <uk0ckk$3dmb4$1@dont-email.me>
Content-Language: en-US
 by: Robert Finch - Mon, 27 Nov 2023 00:08 UTC

On 2023-11-26 4:17 p.m., BGB wrote:
> On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>
>>>>
>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>> performance is clearly not high priority.
>>>>
>>>
>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>> CPU), then the cost of the TLB miss handling is on par with other things
>>> like handling the timer interrupt, etc...
>>
>> Any cycle used by the miss handler is a cycle that could
>> have been used for useful work.   Timer interrupt handling
>> is often very short (increment a memory location, a comparison
>> and a return if no timer has expired).   And we're long
>> past the days of using regular timer interrupts for scheduling
>> (see tickless kernels, for example).
>>
>
> It takes roughly as much time to service a timer interrupt as to service
> a TLB miss...
>
> Much of the work in the time spent in the latter is saving/restoring the
> relevant registers, with the actual page table walk and 'LDTLB'
> instruction typically a fairly minor part in comparison...
>
> At least, excluding something like using B-Tree based page tables...
>
> It could be made faster, but would likely require doing the TLB miss
> handler in ASM and only saving/restoring the minimum number of registers
> (well, at least until we detect that there will be a page-fault, which
> would still require falling back to a "more comprehensive" handler).
>
>
> Any L1 miss penalties from the page-walk itself would likely also apply
> to a hardware page-walker.
>
A hardware table walker strikes me as not being a large component.
Although untested yet, the Q+ table walker is only about 1,200 LUTs or
1% of the FPGA. Given the small size I think it is worth it to have the
table walker in hardware. It is hard to beat hardware timing wise when
it does not need to save / restore registers.

>>
>>>
>>> But, what one does need, is a way to perform context switches without
>>> also triggering a huge wave of TLB misses in the process.
>>
>> Why?
>>
>> Note that depending on the number of entries in your TLB
>> and the scheduler behavior, it's unlikely that any prior
>> TLB entries will be useful to a newly scheduled thread
>> (in a different address space).
>>
>
> I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
> With a 16K page size, this is basically enough to keep roughly something
> the size of the working set of Doom entirely in the TLB.
>
>
> In my past experiments, 16K seemed to be the local optimum for the
> programs tested:
> 4K and 8K resulted in higher miss rates;
> 32K and 64K resulted in a more "internal fragmentation" without much
> reduction in miss rate.
>
>
>> Having multiple banks of TLBs that you can switch between
>> might be able to provide you with the capability to
>> reduce the TLB miss rate on scheduling a new thread of
>> execution - but CAMs aren't cheap.
>>
>
> This is why my TLB is 4-way set-associative.
>
> An 8-way TLB would be a lot more expensive, and a fully-associative TLB
> (of nearly any non-trivial size) would be effectively implausible.
>
>
>> For the most part, industry has settled on a large number
>> of tagged TLB entries as a good compromise.   Some architectures have
>> a global bit in the entry that can be set via the page
>> table that indicates that ASID and/or VMID qualifications
>> aren't necessary for a hit.
>>
>
> Yeah.
>
> I guess a factor here is mostly defining rules to both allow for and
> control the scope of global pages.
>
> In my case:
>   The TTB register defines an ASID in the high order bits;
>   The TLBE also has an ASID;
>   The ASID is split into two parts (6 and 10 bits).
>     In the ASID, 0 designates global pages
>     But they are broken into "groups"
>     So typically a global page is only shared within a given group.
>
> I am thinking the 6.10 split may have given too many bits to the group,
> and 4.12 or 2.14 might have been better.
>
> As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
> would not (but would see global pages in ASID 0400).
>
> So, say, in the current scheme:
>   ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
> global address space.
>
>
> Where, say, if during a TLB Miss, if a page is marked global, it can be
> put into one of these ASIDs rather than the main ASID of the current
> process (if not in an ASID range which disallows global pages).
>
> The size of the group will have an effect on miss rate in cases where
> there are a lot of active PIDs though.
>
>
>>>
>>> Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
>>> full TLB flush on context-switch would suck pretty bad).
>>
>> That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
>> kernel half
>> of the virtrual address space is shared by all processes - there's no
>> reason
>> that those entries need to be flushed on context-switch.
>>
>
> AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
> the defined behavior?... Well, at least ignoring the support for global
> pages.
>
>

Re: Concertina II Progress

<uk10cb$3k4os$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35277&group=comp.arch#35277

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 20:54:01 -0600
Organization: A noiseless patient Spider
Lines: 151
Message-ID: <uk10cb$3k4os$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
<ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
<ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad>
<uk0ckk$3dmb4$1@dont-email.me> <zVP8N.30036$ayBd.10342@fx07.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 27 Nov 2023 02:54:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2fcebed1d03588c400a4bb58f77147d2";
logging-data="3805980"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/jSGyfCsAPtvGJ19ND5gEn"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:iFcEEefy4LDGEBs1mpOsXlk3Yys=
Content-Language: en-US
In-Reply-To: <zVP8N.30036$ayBd.10342@fx07.iad>
 by: BGB - Mon, 27 Nov 2023 02:54 UTC

On 11/26/2023 4:46 PM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>>
>>>>>
>>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>>> performance is clearly not high priority.
>>>>>
>>>>
>>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>>> CPU), then the cost of the TLB miss handling is on par with other things
>>>> like handling the timer interrupt, etc...
>>>
>>> Any cycle used by the miss handler is a cycle that could
>>> have been used for useful work. Timer interrupt handling
>>> is often very short (increment a memory location, a comparison
>>> and a return if no timer has expired). And we're long
>>> past the days of using regular timer interrupts for scheduling
>>> (see tickless kernels, for example).
>>>
>>
>> It takes roughly as much time to service a timer interrupt as to service
>> a TLB miss...
>
> You'll need to provide more than an assertion for that.
>

If the interrupt's save/restore prolog/epilog by itself burns ~ 500+
cycles, then the time needed to do a few memory loads, some bit
twiddling, and an LDTLB, mostly disappears in the noise...

Granted, it cost more cycles to walk the page-table, compose, and load
the TLBE, than it does to increment a counter variable, but...

Nearly all the "expensive parts" will happen similarly in both cases.

I could get along OK using a B-Tree as a page-table, which despite the
considerable cost difference between a simple 3-level page table walk
and a B-Tree walk, this "merely doubled" the average cost of the TLB
Miss handler...

Both cases could be faster, but it would likely require writing the ISR
handlers in ASM (and not saving/restoring all of the registers).

And the potential savings are smaller:
The TLB miss handler may also need to deal with ACL Miss and needs to be
able to dispatch a Page Fault event;
The IRQ Miss handler, meanwhile, may need to deal with other types of
hardware events beyond just timer interrupts (though, at present, the
timer is the only thing that generates an interrupt, pretty much
everything else at present is polling IO).

>>
>> Much of the work in the time spent in the latter is saving/restoring the
>> relevant registers, with the actual page table walk and 'LDTLB'
>> instruction typically a fairly minor part in comparison...
>
> Then you've a poorly written handler. Note that a hardware table
> walker doesn't need to save any registers.
>

Most of this logic is auto-generated by my C compiler.

__interrupt void __isr_interrupt(void)
{ }

By itself is going to save/restore all of the registers and burn roughly
500 cycles in the process...

Though, I had considered possibly adding a "__interrupt_min" keyword,
which would try to minimize the number of registers saved/restored, but
would not allow the ISR to implement a context switch...

But, the latter restriction would make it "almost useless", as the main
two interrupts where it might be useful (the IRQ and TLB Miss handlers),
would also be naturally excluded as both may need to implement context
switches.

Did end up adding:
__interrupt_tbrsave void __isr_syscall(void)
{ }

Where "__interrupt_tbrsave" does at least optimize things in the case
where we *know* we are going to do a context switch.

In this case, it allows eliminating a few calls:
isrsave=__arch_isrsave;
memcpy(
taskern->ctx_regsave,
isrsave,
__ARCH_SIZEOF_REGSAVE__);
memcpy(
isrsave,
taskern2->ctx_regsave,
__ARCH_SIZEOF_REGSAVE__);

Which generally ended up burning another ~ 500 clock cycles.

Note that at 50MHz, one would end up needing to invoke an ISR around
1000 times per second to hit 1%.

Though, with syscalls, it was a little worse. But the new interrupt type
has helped some.

Now, syscalls are just behind the timer interrupt (which is at around 1%
of the CPU, getting triggered at around 1000 times per second).

The TLB Miss ISR is < 0.1% of the time, mostly by averaging under 100
TLB misses per second.

Though, some of this does mean that, despite the BJX2 core running at
around 3x the clock-speed of an MSP430, I can't run the clock with a
32kHz timer interrupt without effectively eating the CPU.

So, this is one area where it seems like the MSP430 has an advantage...

> <snip>
>
>>> That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
>>> of the virtrual address space is shared by all processes - there's no reason
>>> that those entries need to be flushed on context-switch.
>>>
>>
>> AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
>> the defined behavior?... Well, at least ignoring the support for global
>> pages.
>
> Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
> last decade.

It seems to have added "something" to support global pages, but doesn't
appear to use an ASID.

Re: Concertina II Progress

<uk112v$3k8d0$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35278&group=comp.arch#35278

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.chmurka.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Sun, 26 Nov 2023 21:06:05 -0600
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <uk112v$3k8d0$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad>
<ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad>
<ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad>
<ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad>
<uk0ckk$3dmb4$1@dont-email.me> <uk0mla$3f3ju$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 27 Nov 2023 03:06:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2fcebed1d03588c400a4bb58f77147d2";
logging-data="3809696"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+aFdpZjb28SM8YrM2uDVUO"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:bB2R93pm8TNbIBk/Td2XDCe2SPs=
In-Reply-To: <uk0mla$3f3ju$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 27 Nov 2023 03:06 UTC

On 11/26/2023 6:08 PM, Robert Finch wrote:
> On 2023-11-26 4:17 p.m., BGB wrote:
>> On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>>
>>>>>
>>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>>> performance is clearly not high priority.
>>>>>
>>>>
>>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>>> CPU), then the cost of the TLB miss handling is on par with other
>>>> things
>>>> like handling the timer interrupt, etc...
>>>
>>> Any cycle used by the miss handler is a cycle that could
>>> have been used for useful work.   Timer interrupt handling
>>> is often very short (increment a memory location, a comparison
>>> and a return if no timer has expired).   And we're long
>>> past the days of using regular timer interrupts for scheduling
>>> (see tickless kernels, for example).
>>>
>>
>> It takes roughly as much time to service a timer interrupt as to
>> service a TLB miss...
>>
>> Much of the work in the time spent in the latter is saving/restoring
>> the relevant registers, with the actual page table walk and 'LDTLB'
>> instruction typically a fairly minor part in comparison...
>>
>> At least, excluding something like using B-Tree based page tables...
>>
>> It could be made faster, but would likely require doing the TLB miss
>> handler in ASM and only saving/restoring the minimum number of
>> registers (well, at least until we detect that there will be a
>> page-fault, which would still require falling back to a "more
>> comprehensive" handler).
>>
>>
>> Any L1 miss penalties from the page-walk itself would likely also
>> apply to a hardware page-walker.
>>
> A hardware table walker strikes me as not being a large component.
> Although untested yet, the Q+ table walker is only about 1,200 LUTs or
> 1% of the FPGA. Given the small size I think it is worth it to have the
> table walker in hardware. It is hard to beat hardware timing wise when
> it does not need to save / restore registers.
>

Possible, though, until TLB Miss exceeds ~ 1% or so, it isn't really a
huge priority either.

In my current cases, it is generally less than 0.1% of the CPU time, so
not yet a huge priority.

Vs, say:
~ 1% for the 1kHz timer interrupt
~ 0.6% for syscall (down from around 1.2%).

The optimization I had used for syscalls is mostly N/A for the timer
interrupt though.

Had considered inverted page tables as possible as well, but, making
this faster isn't (yet) a terribly high priority.

>>>
>>>>
>>>> But, what one does need, is a way to perform context switches without
>>>> also triggering a huge wave of TLB misses in the process.
>>>
>>> Why?
>>>
>>> Note that depending on the number of entries in your TLB
>>> and the scheduler behavior, it's unlikely that any prior
>>> TLB entries will be useful to a newly scheduled thread
>>> (in a different address space).
>>>
>>
>> I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
>> With a 16K page size, this is basically enough to keep roughly
>> something the size of the working set of Doom entirely in the TLB.
>>
>>
>> In my past experiments, 16K seemed to be the local optimum for the
>> programs tested:
>> 4K and 8K resulted in higher miss rates;
>> 32K and 64K resulted in a more "internal fragmentation" without much
>> reduction in miss rate.
>>
>>
>>> Having multiple banks of TLBs that you can switch between
>>> might be able to provide you with the capability to
>>> reduce the TLB miss rate on scheduling a new thread of
>>> execution - but CAMs aren't cheap.
>>>
>>
>> This is why my TLB is 4-way set-associative.
>>
>> An 8-way TLB would be a lot more expensive, and a fully-associative
>> TLB (of nearly any non-trivial size) would be effectively implausible.
>>
>>
>>> For the most part, industry has settled on a large number
>>> of tagged TLB entries as a good compromise.   Some architectures have
>>> a global bit in the entry that can be set via the page
>>> table that indicates that ASID and/or VMID qualifications
>>> aren't necessary for a hit.
>>>
>>
>> Yeah.
>>
>> I guess a factor here is mostly defining rules to both allow for and
>> control the scope of global pages.
>>
>> In my case:
>>    The TTB register defines an ASID in the high order bits;
>>    The TLBE also has an ASID;
>>    The ASID is split into two parts (6 and 10 bits).
>>      In the ASID, 0 designates global pages
>>      But they are broken into "groups"
>>      So typically a global page is only shared within a given group.
>>
>> I am thinking the 6.10 split may have given too many bits to the
>> group, and 4.12 or 2.14 might have been better.
>>
>> As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
>> would not (but would see global pages in ASID 0400).
>>
>> So, say, in the current scheme:
>>    ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
>> global address space.
>>
>>
>> Where, say, if during a TLB Miss, if a page is marked global, it can
>> be put into one of these ASIDs rather than the main ASID of the
>> current process (if not in an ASID range which disallows global pages).
>>
>> The size of the group will have an effect on miss rate in cases where
>> there are a lot of active PIDs though.
>>
>>
>>>>
>>>> Big TLB + strategic sharing and ASIDs can help here at least
>>>> (whereas, a
>>>> full TLB flush on context-switch would suck pretty bad).
>>>
>>> That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
>>> kernel half
>>> of the virtrual address space is shared by all processes - there's no
>>> reason
>>> that those entries need to be flushed on context-switch.
>>>
>>
>> AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
>> the defined behavior?... Well, at least ignoring the support for
>> global pages.
>>
>>
>

Re: Concertina II Progress

<Bj29N.28688$rx%7.1104@fx47.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35288&group=comp.arch#35288

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org> <0Pa8N.2233$PJoc.1323@fx04.iad> <ujrnq0$2lqen$2@dont-email.me> <6Gp8N.22586$yAie.1862@fx44.iad> <ujtei9$2t545$1@dont-email.me> <SVr8N.40009$zh16.19369@fx48.iad> <ujtpjb$2unm2$1@dont-email.me> <Fhu8N.110789$BbXa.14700@fx16.iad> <uk0ckk$3dmb4$1@dont-email.me> <zVP8N.30036$ayBd.10342@fx07.iad> <uk10cb$3k4os$1@dont-email.me>
Lines: 50
Message-ID: <Bj29N.28688$rx%7.1104@fx47.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 27 Nov 2023 15:10:25 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 27 Nov 2023 15:10:25 GMT
X-Received-Bytes: 2702
 by: Scott Lurndal - Mon, 27 Nov 2023 15:10 UTC

BGB <cr88192@gmail.com> writes:
>On 11/26/2023 4:46 PM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 11/25/2023 4:10 PM, Scott Lurndal wrote:
>>>> BGB <cr88192@gmail.com> writes:
>>>>> On 11/25/2023 1:28 PM, Scott Lurndal wrote:
>>>>
>>>>>>
>>>>>> If you're taking an interrupt, to resolve guest TLB misses,
>>>>>> performance is clearly not high priority.
>>>>>>
>>>>>
>>>>> If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
>>>>> CPU), then the cost of the TLB miss handling is on par with other things
>>>>> like handling the timer interrupt, etc...
>>>>
>>>> Any cycle used by the miss handler is a cycle that could
>>>> have been used for useful work. Timer interrupt handling
>>>> is often very short (increment a memory location, a comparison
>>>> and a return if no timer has expired). And we're long
>>>> past the days of using regular timer interrupts for scheduling
>>>> (see tickless kernels, for example).
>>>>
>>>
>>> It takes roughly as much time to service a timer interrupt as to service
>>> a TLB miss...
>>
>> You'll need to provide more than an assertion for that.
>>
>
>If

Ah, speculation. Got it.

the interrupt's save/restore prolog/epilog by itself burns ~ 500+
>cycles, then the time needed to do a few memory loads, some bit
>twiddling, and an LDTLB, mostly disappears in the noise...

Again, if.

>> Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
>> last decade.
>
>It seems to have added "something" to support global pages, but doesn't
>appear to use an ASID.

They've had global pages since they introduced paging on the i386, IIRC.

Re: Concertina II Progress

<uk7rik$tu34$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35315&group=comp.arch#35315

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Wed, 29 Nov 2023 17:15:00 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <uk7rik$tu34$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uire3v$7li2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 29 Nov 2023 17:15:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a2fc770091e794763b388bf76e4920c0";
logging-data="981092"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KQ0PuIlkb3C1IA1Q65BE3V5wSvGP8b18="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:I4y9vYQAzzuKF9cvWAim32b3gJs=
 by: Quadibloc - Wed, 29 Nov 2023 17:15 UTC

On Sun, 12 Nov 2023 20:55:27 +0000, Quadibloc wrote:

> I had tried, with all sorts of ingenious compromises of register spaces
> and the like, to fit all the capabilities I wanted into the opcode space
> of a single version of the instruction set, eliminating the need for
> blocks which contained instructions belonging to alternate versions of
> the instruction set.
>
> But if the 16-bit instructions I'm making room for are useless to
> compilers, that's questionable.
>
> At first, when I mulled over this, I came up with multiple ideas to
> address it, each one crazier than the last.
>
> Seeing, therefore, that this was a difficult nut to crack, and not
> wanting to go down in another wrong direction... instead, I found a way
> to go that seemed to me to be reasonably sensible.
>
> Go back to uncompromised 32-bit instructions, even though that means
> there are no 16-bit instructions.
>
> Then, bring back short instructions - effectively 17 bits long - so as
> to have room for full register specifications. This means an alternative
> block format where 16, 32, 48, 64... bit instructions are all possible.
>
> *But* because of the room 17-bit short instructions take up in the
> header, the 32-bit instructions are the same regular format as in the
> other case. Not some kind of 33-bit or 35-bit instruction with a new set
> of instruction formats.

I have now modified the 17-bit shift instructions in the diagram, so that
they can also apply to all 32 integer registers, and I have corrected the
opcodes on the page

http://www.quadibloc.com/arch/cw0101.htm

John Savard

Re: Concertina II Progress

<ukact0$1e539$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35316&group=comp.arch#35316

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi680@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Thu, 30 Nov 2023 11:22:55 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <ukact0$1e539$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uire3v$7li2$1@dont-email.me> <uk7rik$tu34$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Nov 2023 16:22:56 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="108932fdbdfcb95c13135e42463893dc";
logging-data="1512553"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oBuFPNLCzdg1YIF7L9TmO8ujsZ5QlWkc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:nAvGLvZ3w3+73fm8Z1zSm8q10RE=
Content-Language: en-US
In-Reply-To: <uk7rik$tu34$1@dont-email.me>
 by: Robert Finch - Thu, 30 Nov 2023 16:22 UTC

On 2023-11-29 12:15 p.m., Quadibloc wrote:
> On Sun, 12 Nov 2023 20:55:27 +0000, Quadibloc wrote:
>
>> I had tried, with all sorts of ingenious compromises of register spaces
>> and the like, to fit all the capabilities I wanted into the opcode space
>> of a single version of the instruction set, eliminating the need for
>> blocks which contained instructions belonging to alternate versions of
>> the instruction set.
>>
>> But if the 16-bit instructions I'm making room for are useless to
>> compilers, that's questionable.
>>
>> At first, when I mulled over this, I came up with multiple ideas to
>> address it, each one crazier than the last.
>>
>> Seeing, therefore, that this was a difficult nut to crack, and not
>> wanting to go down in another wrong direction... instead, I found a way
>> to go that seemed to me to be reasonably sensible.
>>
>> Go back to uncompromised 32-bit instructions, even though that means
>> there are no 16-bit instructions.
>>
>> Then, bring back short instructions - effectively 17 bits long - so as
>> to have room for full register specifications. This means an alternative
>> block format where 16, 32, 48, 64... bit instructions are all possible.
>>
>> *But* because of the room 17-bit short instructions take up in the
>> header, the 32-bit instructions are the same regular format as in the
>> other case. Not some kind of 33-bit or 35-bit instruction with a new set
>> of instruction formats.
>
> I have now modified the 17-bit shift instructions in the diagram, so that
> they can also apply to all 32 integer registers, and I have corrected the
> opcodes on the page
>
> http://www.quadibloc.com/arch/cw0101.htm
>
> John Savard
Having a look at the ConcertiaII ISA. I like the idea of
pseudo-immediates. All the immediates could be moved to one end of the
block and then skipped over during instruction fetch. Thinking about
incorporating this idea for Q+. Not sure about the appeal of block. PIC
would need to be relocated in terms of blocks. I prefer byte addressed
relocation.

Re: Concertina II Progress

<ukc34t$1po20$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35342&group=comp.arch#35342

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 1 Dec 2023 07:48:45 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <ukc34t$1po20$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uire3v$7li2$1@dont-email.me> <uk7rik$tu34$1@dont-email.me>
<ukact0$1e539$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 1 Dec 2023 07:48:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8d1dc893abdbae95d1078f143ae209ee";
logging-data="1892416"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/zDoJix9823GVkrjrk+3epsq8TSzBSbd4="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:db1QrXB877khQH2prcJkSAHpQuY=
 by: Quadibloc - Fri, 1 Dec 2023 07:48 UTC

On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

> Having a look at the ConcertiaII ISA. I like the idea of
> pseudo-immediates. All the immediates could be moved to one end of the
> block and then skipped over during instruction fetch.

That is the general idea, with one minor correction.

The benefit of pseudo-immediates, like that of ordinary immediates,
are that they're already available, because they were brought into the
CPU by instruction fetch.

They get skipped over by the _next_ step, instruction decode.

Why a block structure? The goal is to have a situation where
instruction decode is largely done in parallel for the whole
block.

The first step is - is there a header? If not, decode all eight
32-bit instructions in the block in parallel.

If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the
next step has all the instructions being decoded in parallel.

The header allows the length that immediates would add to instructions
to be in the pseudo-immediated instead, avoiding another potential
complication to instruction decoding.

In addition, having headers means that the instruction set can be
expanded or made flexible without it being possible to change the
mode of the CPU to cause it to read existing instruction code the
wrong way. Any modifications to how instructions are to be interpreted
are right there in the block header, so malware that can't alter
code can't work around that by changing how it is to be read.

Among the features the headers allow to be added are VLIW features,
such as instruction predication and explicitly indicating which
instructions can execute in parallel. This allows high-performance
but lightweight (non-OoO) implementations if desired.

John Savard

Re: Concertina II Progress

<1949acd069b7c93db910f3c0357a0298@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35344&group=comp.arch#35344

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 1 Dec 2023 18:37:17 +0000
Organization: novaBBS
Message-ID: <1949acd069b7c93db910f3c0357a0298@news.novabbs.com>
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uire3v$7li2$1@dont-email.me> <uk7rik$tu34$1@dont-email.me> <ukact0$1e539$1@dont-email.me> <ukc34t$1po20$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2734798"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$arosqxkoYCmaZVfTTG2dR.Ud8ZRK6ElDdYD9Vk.rgtlCd1THB0ySS
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Fri, 1 Dec 2023 18:37 UTC

Quadibloc wrote:

> On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

>> Having a look at the ConcertiaII ISA. I like the idea of
>> pseudo-immediates. All the immediates could be moved to one end of the
>> block and then skipped over during instruction fetch.

> That is the general idea, with one minor correction.

> The benefit of pseudo-immediates, like that of ordinary immediates,
> are that they're already available, because they were brought into the
> CPU by instruction fetch.

> They get skipped over by the _next_ step, instruction decode.

> Why a block structure? The goal is to have a situation where
> instruction decode is largely done in parallel for the whole
> block.

What if you had the advantages of the block header without the
cost of the block header ??

> The first step is - is there a header? If not, decode all eight
> 32-bit instructions in the block in parallel.

Why not decode assuming there is a block header and also decode as
if there were not a block header. Then you can multiplex (choose)
later which one prevails. This puts the choice at at least 4 gates
of delay into the decode cycle.

> If so, process the header, and that will directly and immediately
> reveal where every instruction in the block begins, so again the
> next step has all the instructions being decoded in parallel.

You then have to route the instructions to the decoders. Are your
decoders expensive enough in a wide implementation that this matters?
The alternative is to have a no-header decoder running in parallel
with a header decoder and choose which to use.

> The header allows the length that immediates would add to instructions
> to be in the pseudo-immediated instead, avoiding another potential
> complication to instruction decoding.

> In addition, having headers means that the instruction set can be
> expanded or made flexible without it being possible to change the
> mode of the CPU to cause it to read existing instruction code the
> wrong way. Any modifications to how instructions are to be interpreted
> are right there in the block header, so malware that can't alter
> code can't work around that by changing how it is to be read.

You MAY be able to alter the headers later in the architecture's life,
but ultimately you sacrifice forward compatibility.

> Among the features the headers allow to be added are VLIW features,
Why would you want this ??
> such as instruction predication and explicitly indicating which
> instructions can execute in parallel.
HW does not seem to have much trouble doing this already.
> This allows high-performance
> but lightweight (non-OoO) implementations if desired.
Have any GBnOoO machines been successful ?

> John Savard


devel / comp.arch / Re: Concertina II Progress

Pages:1234567891011121314151617181920212223242526272829303132333435363738
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor