Rocksolid Light - comp.arch - Re: Page tables and TLBs [was Interrupts on the Concertina II]

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup1) writes:
>>Scott Lurndal wrote:
>>

>>> Or restricted to a specific set of cores (i.e. those currently
>>> owned by the target guest).
>>
>>Even that gets tricky when you (or the OS) virtualizes cores.

> Oh, indeed. It's helpful to have good hardware support. The
> ARM GIC, for example, helps eliminate hypervisor interaction
> during normal guest interrupt handling (aside from scheduling the
> guest on a host core).

>>
>>In my case, the interrupt controller merely sets bits in the interrupt
>>table, the watching cores watch for changes to its pending interrupt
>>register (64-bits). Said messages come up from PCIe as MSI-X messages,

> The interrupt space for MSI-X messages is 32-bits. Implementations
> may support fewer than 2**32 interrupts - ours support 2**24 distinct
> interrupt vectors.

My 66000 supports 2^16 tables of 2^15 distinct interrupts (non vectored)
per table.

>>and are directed to the interrupt controller over in the Memory Controller
>>(L3).
>>
>>> Dealing with inter-processor interrupts in a multicore guest can also
>>> be tricky;
>>
>>Core sends MSI-X message to interrupt controller and the rest happens
>>no different than a device initerrupt.

> Not necessarily, particularly if the guest isn't resident on any
> core at the time the interrupt is received.

When an interrupt is registered (recognized are raised and enabled)
and the receiving GuestOS is not running on any core, the interrupt
remains pending until some core context switches to that GuestOS.

>>
>>>>I presume the core that services the interrupt (ISR) is running the same
>>>>GuestOS under the same HyperVisor that initiated the device.
>>
>>> Generally a safe assumption. Note that the guest core may not be
>>> resident on any physical core when the guest interrupt arives.
>>
>>Which is why its table has to be present at all times--even if the threads
>>are not. When one or more threads from that GuestOS are activated, the
>>pending interrupt will be serviced.

> Yes, but the hypervisor needs to be notified by the hardware when the table
> is updated and the target guest VCPU isn't currently scheduled
> on any core so that it can decide to schedule the guest (which may,
> for instance, have been parked because it executed a WFI, PAUSE
> or MWAIT instruction).

Re: Page tables and TLBs [was Interrupts on the Concertina II]

<f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37118&group=comp.arch#37118

copy link Newsgroups: comp.arch

Date: Thu, 25 Jan 2024 17:12:42 +0000
Subject: Re: Page tables and TLBs [was Interrupts on the Concertina II]
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$KodvLcNtmRcm5wUriw0SSuvCqAjrEjJFWtMh41f6jr02/MgSt.WZG
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <uomff4$s535$1@dont-email.me> <65d2323335476993a8b1aa39720022b6@www.novabbs.org> <ZtCrN.374797$83n7.249143@fx18.iad> <d23a16dcdf06df96a208e1c39002d9a3@www.novabbs.org> <SwUrN.301374$xHn7.210887@fx14.iad> <f25fc2ceb6287463678e883b4adfc6ff@www.novabbs.org> <cwusN.276944$PuZ9.103529@fx11.iad>
Organization: Rocksolid Light
Message-ID: <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org>

by: MitchAlsup1 - Thu, 25 Jan 2024 17:12 UTC

EricP wrote:

> MitchAlsup1 wrote:
>> EricP wrote:
>>
>>> MitchAlsup1 wrote:
>>>> EricP wrote:
>>>>
>>>>> One accomplishes the same effect by caching the interior PTE nodes
>>>>> for each of the HV and GuestOS tables separately on the downward walk,
>>>>> and hold the combined nested table mapping in the TLB.
>>>>> The bottom-up table walkers on each interior PTE cache should
>>>>> eliminate 98% of the PTE reads with none of the headaches.
>>>>
>>>> I call these things:: TableWalk Accelerators.
>>>>
>>>> Given CAMs at your access, one can cache the outer layers and short
>>>> circuit most of the MMU accesses--such that you don't siply read the
>>>> Accelerator RAM 25 times (two 5-level tables), you CAM down both
>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
>>>> them put them in your CAM.} A Density trick is for each CAM to have
>>>> access to a whole cache line of PTEs (8 in my case).
>>
>>> An idea I had here was to allow the OS more explicit control
>>> for the invalidates of the interior nodes caches.
>>
>> The interior nodes, stored in the CAM, retain their physical address, and
>> are snooped, so no invalidation is required. ANY write to them is seen and
>> the entry invalidates itself.

> On My66000, but other cores don't have automatically coherent TLB's.
> This feature is intended for that general rabble.

> Just to play devil's advocate...

> To snoop page table updates My66000 TLB would need a large CAM with all
> the physical addresses of the PTE's source cache lines parallel to the
> virtual and ASID CAM's, and route the cache line invalidates through it.

Yes, .....

> While the virtual index CAM's are separated in different banks,
> one for each page table level, the P.A. CAM is for all entries in all banks.
> This extra P.A. CAM will have a lot of entries and therefore be slow.

That is the TWA.

> Also routing the Invalidate messages through the TLB could slow down all
> their ACK's messages even though there is very low probability of a hit
> because page tables update relatively infrequently.

TLBs don't ACK they self-invalidate. And they can be performing a translation
while self-invalidating.

> Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
> are set assoc., and would have to be virtually indexed and virtually
> tagged with both VA and ASID plus table level to select address mask.
> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is 4-rows*4-way.

All TLB walks are performed with RealPA.
All Snoops are performed with RealPA

> How can My66000 look up STLB entries by invalidate physical line address?
> It would have to scan all 128 rows for each message.

It is not structured like Intel L2 TLB.

Re: Interrupts on the Concertina II

<k9xsN.122719$q3F7.12879@fx45.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37121&group=comp.arch#37121

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.1d4.us!news.samoylyk.net!hugayda.aid.in.ua!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Interrupts on the Concertina II
Newsgroups: comp.arch
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <8258c9888cf0b094bafad0bf9fba9c91@www.novabbs.org> <apArN.40939$SyNd.29376@fx33.iad> <86c221cf69de29d05f45cdcf8716dbcf@www.novabbs.org>
Lines: 18
Message-ID: <k9xsN.122719$q3F7.12879@fx45.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 25 Jan 2024 17:48:00 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 25 Jan 2024 17:48:00 GMT
X-Received-Bytes: 1823

by: Scott Lurndal - Thu, 25 Jan 2024 17:48 UTC

mitchalsup@aol.com (MitchAlsup1) writes:

>> Not necessarily, particularly if the guest isn't resident on any
>> core at the time the interrupt is received.
>
>When an interrupt is registered (recognized are raised and enabled)
>and the receiving GuestOS is not running on any core, the interrupt
>remains pending until some core context switches to that GuestOS.

It is useful to notify the hypervisor of that, so that it can
schedule the guest.

>> Yes, but the hypervisor needs to be notified by the hardware when the table
>> is updated and the target guest VCPU isn't currently scheduled
>> on any core so that it can decide to schedule the guest (which may,
>> for instance, have been parked because it executed a WFI, PAUSE
>> or MWAIT instruction).

Re: Page tables and TLBs [was Interrupts on the Concertina II]

<BvQsN.271760$Wp_8.43879@fx17.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37129&group=comp.arch#37129

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.niel.me!nntp.terraraq.uk!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Page tables and TLBs [was Interrupts on the Concertina II]
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <uomff4$s535$1@dont-email.me> <65d2323335476993a8b1aa39720022b6@www.novabbs.org> <ZtCrN.374797$83n7.249143@fx18.iad> <d23a16dcdf06df96a208e1c39002d9a3@www.novabbs.org> <SwUrN.301374$xHn7.210887@fx14.iad> <f25fc2ceb6287463678e883b4adfc6ff@www.novabbs.org> <cwusN.276944$PuZ9.103529@fx11.iad> <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org>
In-Reply-To: <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 109
Message-ID: <BvQsN.271760$Wp_8.43879@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 26 Jan 2024 15:48:49 UTC
Date: Fri, 26 Jan 2024 10:47:38 -0500
X-Received-Bytes: 5799

by: EricP - Fri, 26 Jan 2024 15:47 UTC

MitchAlsup1 wrote:
> EricP wrote:
>
>> MitchAlsup1 wrote:
>>> EricP wrote:
>>>
>>>> MitchAlsup1 wrote:
>>>>> EricP wrote:
>>>>>
>>>>>> One accomplishes the same effect by caching the interior PTE nodes
>>>>>> for each of the HV and GuestOS tables separately on the downward
>>>>>> walk,
>>>>>> and hold the combined nested table mapping in the TLB.
>>>>>> The bottom-up table walkers on each interior PTE cache should
>>>>>> eliminate 98% of the PTE reads with none of the headaches.
>>>>>
>>>>> I call these things:: TableWalk Accelerators.
>>>>>
>>>>> Given CAMs at your access, one can cache the outer layers and short
>>>>> circuit most of the MMU accesses--such that you don't siply read
>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both
>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
>>>>> them put them in your CAM.} A Density trick is for each CAM to have
>>>>> access to a whole cache line of PTEs (8 in my case).
>>>
>>>> An idea I had here was to allow the OS more explicit control
>>>> for the invalidates of the interior nodes caches.
>>>
>>> The interior nodes, stored in the CAM, retain their physical address,
>>> and
>>> are snooped, so no invalidation is required. ANY write to them is
>>> seen and
>>> the entry invalidates itself.
>
>> On My66000, but other cores don't have automatically coherent TLB's.
>> This feature is intended for that general rabble.
>
>> Just to play devil's advocate...
>
>> To snoop page table updates My66000 TLB would need a large CAM with all
>> the physical addresses of the PTE's source cache lines parallel to the
>> virtual and ASID CAM's, and route the cache line invalidates through it.
>
> Yes, .....
>
>> While the virtual index CAM's are separated in different banks,
>> one for each page table level, the P.A. CAM is for all entries in all
>> banks.
>> This extra P.A. CAM will have a lot of entries and therefore be slow.
>
> That is the TWA.

No, not the Table Walk Accelerator. I'm thinking the PA CAM would
only need to be accessed for cache line invalidates. However it would be
very inconvenient if the TLB CAMs had faster access time for virtual
address lookups than for physical address lookups, so the access time
would be the longer of the two, that being PA.

Basically I'm suggesting the big PA CAM slows down VA translates
and therefore possibly all memory accesses.

>> Also routing the Invalidate messages through the TLB could slow down all
>> their ACK's messages even though there is very low probability of a hit
>> because page tables update relatively infrequently.
>
> TLBs don't ACK they self-invalidate. And they can be performing a
> translation
> while self-invalidating.

Hmmm... Danger Will Robinson. Most OS page table management depends on
synchronizing after the shootdowns complete on all affected cores.

The basic safe sequence is:
- acquire page table mutex
- modify PTE in memory for a VA
- issue IPI's with VA to all cores that might have a copy in TLB
- invalidate local TLB for VA
- wait for IPI ACK's from remote cores
- release mutex

If it doesn't wait for shootdown ACKs then it might be possible for a
stale PTE copy to exist and be used after the mutex is released.

>> Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
>> are set assoc., and would have to be virtually indexed and virtually
>> tagged with both VA and ASID plus table level to select address mask.
>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
>> 4-rows*4-way.
>
> All TLB walks are performed with RealPA.
> All Snoops are performed with RealPA
>
>> How can My66000 look up STLB entries by invalidate physical line address?
>> It would have to scan all 128 rows for each message.
>
> It is not structured like Intel L2 TLB.

Are you saying the My66000 STLB is physically indexed, physically tagged?
Hows this work for a bottom-up table walk (aka your TWA)?

The only way I know to do a bottom-up walk is to use the portion of the
VA for the higher index level to get the PA of the page table page.
That requires lookup by a masked portion of the VA with the ASID.
The bottom-up walk is done by making the VA mask shorter for each level.
This implies a Virtually Indexed Virtually Tagged PTE cache.

The VIVT PTE cache implies that certain TLB VA or ASID invalidates require
a full STLB table scan which could be up to 128 clocks for that instruction.

Re: Page tables and TLBs [was Interrupts on the Concertina II]

<dc4032aabfd00b917ac0fb83f12d70bf@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37132&group=comp.arch#37132

copy link Newsgroups: comp.arch

Date: Fri, 26 Jan 2024 21:43:21 +0000
Subject: Re: Page tables and TLBs [was Interrupts on the Concertina II]
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$x.7WVKGJvT2KmeP8hp56ieaY20bmdewY.inlAAmmy2uVPJDwT2Ufy
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <uomff4$s535$1@dont-email.me> <65d2323335476993a8b1aa39720022b6@www.novabbs.org> <ZtCrN.374797$83n7.249143@fx18.iad> <d23a16dcdf06df96a208e1c39002d9a3@www.novabbs.org> <SwUrN.301374$xHn7.210887@fx14.iad> <f25fc2ceb6287463678e883b4adfc6ff@www.novabbs.org> <cwusN.276944$PuZ9.103529@fx11.iad> <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org> <BvQsN.271760$Wp_8.43879@fx17.iad>
Organization: Rocksolid Light
Message-ID: <dc4032aabfd00b917ac0fb83f12d70bf@www.novabbs.org>

by: MitchAlsup1 - Fri, 26 Jan 2024 21:43 UTC

EricP wrote:

> MitchAlsup1 wrote:
>> EricP wrote:
>>
>>> MitchAlsup1 wrote:
>>>> EricP wrote:
>>>>
>>>>> MitchAlsup1 wrote:
>>>>>> EricP wrote:
>>>>>>
>>>>>>> One accomplishes the same effect by caching the interior PTE nodes
>>>>>>> for each of the HV and GuestOS tables separately on the downward
>>>>>>> walk,
>>>>>>> and hold the combined nested table mapping in the TLB.
>>>>>>> The bottom-up table walkers on each interior PTE cache should
>>>>>>> eliminate 98% of the PTE reads with none of the headaches.
>>>>>>
>>>>>> I call these things:: TableWalk Accelerators.
>>>>>>
>>>>>> Given CAMs at your access, one can cache the outer layers and short
>>>>>> circuit most of the MMU accesses--such that you don't siply read
>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both
>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
>>>>>> them put them in your CAM.} A Density trick is for each CAM to have
>>>>>> access to a whole cache line of PTEs (8 in my case).
>>>>
>>>>> An idea I had here was to allow the OS more explicit control
>>>>> for the invalidates of the interior nodes caches.
>>>>
>>>> The interior nodes, stored in the CAM, retain their physical address,
>>>> and
>>>> are snooped, so no invalidation is required. ANY write to them is
>>>> seen and
>>>> the entry invalidates itself.
>>
>>> On My66000, but other cores don't have automatically coherent TLB's.
>>> This feature is intended for that general rabble.
>>
>>> Just to play devil's advocate...
>>
>>> To snoop page table updates My66000 TLB would need a large CAM with all
>>> the physical addresses of the PTE's source cache lines parallel to the
>>> virtual and ASID CAM's, and route the cache line invalidates through it.
>>
>> Yes, .....
>>
>>> While the virtual index CAM's are separated in different banks,
>>> one for each page table level, the P.A. CAM is for all entries in all
>>> banks.
>>> This extra P.A. CAM will have a lot of entries and therefore be slow.
>>
>> That is the TWA.

> No, not the Table Walk Accelerator. I'm thinking the PA CAM would
> only need to be accessed for cache line invalidates. However it would be
> very inconvenient if the TLB CAMs had faster access time for virtual
> address lookups than for physical address lookups, so the access time
> would be the longer of the two, that being PA.

VA PA
| |
V V
+-----------+ +-----------+-+ +-----------+
| VA CAM |->| PTEs |v|<-| PA CAM |
+-----------+ +-----------+-+ +-----------+

> Basically I'm suggesting the big PA CAM slows down VA translates
> and therefore possibly all memory accesses.

It is a completely independent and concurrent hunk of logic that only has
access to the valid bit and can only clear the valid bit.

>>> Also routing the Invalidate messages through the TLB could slow down all
>>> their ACK's messages even though there is very low probability of a hit
>>> because page tables update relatively infrequently.
>>
>> TLBs don't ACK they self-invalidate. And they can be performing a
>> translation
>> while self-invalidating.

> Hmmm... Danger Will Robinson. Most OS page table management depends on
> synchronizing after the shootdowns complete on all affected cores.

> The basic safe sequence is:
> - acquire page table mutex
> - modify PTE in memory for a VA

Here you have obtained write permission on the line containing the PTE
being modified. By the time you have obtained write permission, all
other TLBs will have been invalidated.

> - issue IPI's with VA to all cores that might have a copy in TLB
> - invalidate local TLB for VA
> - wait for IPI ACK's from remote cores
> - release mutex

> If it doesn't wait for shootdown ACKs then it might be possible for a
> stale PTE copy to exist and be used after the mutex is released.

Race condition does not exist. By the time the core modifying the PTE
obtains write permission, all the TLBs have been cleared of that entry.

>>> Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
>>> are set assoc., and would have to be virtually indexed and virtually
>>> tagged with both VA and ASID plus table level to select address mask.
>>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
>>> 4-rows*4-way.
>>
>> All TLB walks are performed with RealPA.
>> All Snoops are performed with RealPA
>>
>>> How can My66000 look up STLB entries by invalidate physical line address?
>>> It would have to scan all 128 rows for each message.
>>
>> It is not structured like Intel L2 TLB.

> Are you saying the My66000 STLB is physically indexed, physically tagged?
> Hows this work for a bottom-up table walk (aka your TWA)?

L2 TLB is a different structure (SRAM) than TWAs (CAM).
I can't talk about it:: as Ivan used to say:: NYF.

> The only way I know to do a bottom-up walk is to use the portion of the
> VA for the higher index level to get the PA of the page table page.

I <actually> did not say I did a bottom up walk. I said I short circuited
the table walks for those layers I have recent translation PTPs. Its more
like CAM the Root to the last PTP and every CAM that hits is one layer
you don't have to access.

> That requires lookup by a masked portion of the VA with the ASID.
> The bottom-up walk is done by making the VA mask shorter for each level.
> This implies a Virtually Indexed Virtually Tagged PTE cache.

> The VIVT PTE cache implies that certain TLB VA or ASID invalidates require
> a full STLB table scan which could be up to 128 clocks for that instruction.

Re: Page tables and TLBs [was Interrupts on the Concertina II]

<I7xtN.301650$c3Ea.207523@fx10.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37141&group=comp.arch#37141

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Page tables and TLBs [was Interrupts on the Concertina II]
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <uomff4$s535$1@dont-email.me> <65d2323335476993a8b1aa39720022b6@www.novabbs.org> <ZtCrN.374797$83n7.249143@fx18.iad> <d23a16dcdf06df96a208e1c39002d9a3@www.novabbs.org> <SwUrN.301374$xHn7.210887@fx14.iad> <f25fc2ceb6287463678e883b4adfc6ff@www.novabbs.org> <cwusN.276944$PuZ9.103529@fx11.iad> <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org> <BvQsN.271760$Wp_8.43879@fx17.iad> <dc4032aabfd00b917ac0fb83f12d70bf@www.novabbs.org>
In-Reply-To: <dc4032aabfd00b917ac0fb83f12d70bf@www.novabbs.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 170
Message-ID: <I7xtN.301650$c3Ea.207523@fx10.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 28 Jan 2024 18:35:20 UTC
Date: Sun, 28 Jan 2024 13:35:20 -0500
X-Received-Bytes: 8420

by: EricP - Sun, 28 Jan 2024 18:35 UTC

MitchAlsup1 wrote:
> EricP wrote:
>
>> MitchAlsup1 wrote:
>>> EricP wrote:
>>>
>>>> MitchAlsup1 wrote:
>>>>> EricP wrote:
>>>>>
>>>>>> MitchAlsup1 wrote:
>>>>>>> EricP wrote:
>>>>>>>
>>>>>>>> One accomplishes the same effect by caching the interior PTE nodes
>>>>>>>> for each of the HV and GuestOS tables separately on the downward
>>>>>>>> walk,
>>>>>>>> and hold the combined nested table mapping in the TLB.
>>>>>>>> The bottom-up table walkers on each interior PTE cache should
>>>>>>>> eliminate 98% of the PTE reads with none of the headaches.
>>>>>>>
>>>>>>> I call these things:: TableWalk Accelerators.
>>>>>>>
>>>>>>> Given CAMs at your access, one can cache the outer layers and short
>>>>>>> circuit most of the MMU accesses--such that you don't siply read
>>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both
>>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
>>>>>>> them put them in your CAM.} A Density trick is for each CAM to have
>>>>>>> access to a whole cache line of PTEs (8 in my case).
>>>>>
>>>>>> An idea I had here was to allow the OS more explicit control
>>>>>> for the invalidates of the interior nodes caches.
>>>>>
>>>>> The interior nodes, stored in the CAM, retain their physical
>>>>> address, and
>>>>> are snooped, so no invalidation is required. ANY write to them is
>>>>> seen and
>>>>> the entry invalidates itself.
>>>
>>>> On My66000, but other cores don't have automatically coherent TLB's.
>>>> This feature is intended for that general rabble.
>>>
>>>> Just to play devil's advocate...
>>>
>>>> To snoop page table updates My66000 TLB would need a large CAM with all
>>>> the physical addresses of the PTE's source cache lines parallel to the
>>>> virtual and ASID CAM's, and route the cache line invalidates through
>>>> it.
>>>
>>> Yes, .....
>>>
>>>> While the virtual index CAM's are separated in different banks,
>>>> one for each page table level, the P.A. CAM is for all entries in
>>>> all banks.
>>>> This extra P.A. CAM will have a lot of entries and therefore be slow.
>>>
>>> That is the TWA.
>
>> No, not the Table Walk Accelerator. I'm thinking the PA CAM would
>> only need to be accessed for cache line invalidates. However it would be
>> very inconvenient if the TLB CAMs had faster access time for virtual
>> address lookups than for physical address lookups, so the access time
>> would be the longer of the two, that being PA.
>
> VA PA
> | |
> V V
> +-----------+ +-----------+-+ +-----------+
> | VA CAM |->| PTEs |v|<-| PA CAM |
> +-----------+ +-----------+-+ +-----------+

Of course, but for a 5 or 6 level page table you'd have a CAM bank
for each level to search in parallel. The loading on the PA path
would be the same as if you a CAM as large as the sum of all entries.

But as you point out below, this shouldn't be an issue because
it has little to do after the lookup.

>> Basically I'm suggesting the big PA CAM slows down VA translates
>> and therefore possibly all memory accesses.
>
> It is a completely independent and concurrent hunk of logic that only has
> access to the valid bit and can only clear the valid bit.

Yes, ok. The lookup on the PA path may take longer but there is
little to do on a hit so the total path length is shorter,
so PA invalidate won't be on the critical path for the MMU.

>>>> Also routing the Invalidate messages through the TLB could slow down
>>>> all
>>>> their ACK's messages even though there is very low probability of a hit
>>>> because page tables update relatively infrequently.
>>>
>>> TLBs don't ACK they self-invalidate. And they can be performing a
>>> translation
>>> while self-invalidating.
>
>> Hmmm... Danger Will Robinson. Most OS page table management depends on
>> synchronizing after the shootdowns complete on all affected cores.
>
>> The basic safe sequence is:
>> - acquire page table mutex
>> - modify PTE in memory for a VA
>
> Here you have obtained write permission on the line containing the PTE
> being modified. By the time you have obtained write permission, all
> other TLBs will have been invalidated.

It means you can't use the outer cache levels to filter invalidates.
You'd have to pass all invalidate messages from the coherence network
directly to the TLB PA, bypassing the cache hierarchy, to ensure the
TLB entry is removed before the cache ACK's the invalidate message.

>> - issue IPI's with VA to all cores that might have a copy in TLB
>> - invalidate local TLB for VA
>> - wait for IPI ACK's from remote cores
>> - release mutex
>
>> If it doesn't wait for shootdown ACKs then it might be possible for a
>> stale PTE copy to exist and be used after the mutex is released.
>
> Race condition does not exist. By the time the core modifying the PTE
> obtains write permission, all the TLBs have been cleared of that entry.

>>>> Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
>>>> are set assoc., and would have to be virtually indexed and virtually
>>>> tagged with both VA and ASID plus table level to select address mask.
>>>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
>>>> 4-rows*4-way.
>>>
>>> All TLB walks are performed with RealPA.
>>> All Snoops are performed with RealPA
>>>
>>>> How can My66000 look up STLB entries by invalidate physical line
>>>> address?
>>>> It would have to scan all 128 rows for each message.
>>>
>>> It is not structured like Intel L2 TLB.
>
>> Are you saying the My66000 STLB is physically indexed, physically tagged?
>> Hows this work for a bottom-up table walk (aka your TWA)?
>
> L2 TLB is a different structure (SRAM) than TWAs (CAM).
> I can't talk about it:: as Ivan used to say:: NYF.

Rats.

>> The only way I know to do a bottom-up walk is to use the portion of the
>> VA for the higher index level to get the PA of the page table page.
>
> I <actually> did not say I did a bottom up walk. I said I short circuited
> the table walks for those layers I have recent translation PTPs. Its more
> like CAM the Root to the last PTP and every CAM that hits is one layer
> you don't have to access.

What I call a bottom-up walk can be performed in parallel, serial,
or a bit of both, across the banks for each page table level.

I'd have a VA TLB lookup in parallel for page levels 1, 2 and 3 (4K, 2M, 1G),
and if all three miss then do a sequential lookups for levels 4, 5, 6.

>> That requires lookup by a masked portion of the VA with the ASID.
>> The bottom-up walk is done by making the VA mask shorter for each level.
>> This implies a Virtually Indexed Virtually Tagged PTE cache.
>
>> The VIVT PTE cache implies that certain TLB VA or ASID invalidates
>> require
>> a full STLB table scan which could be up to 128 clocks for that
>> instruction.

Re: Page tables and TLBs [was Interrupts on the Concertina II]

<3c8d15fc7456a98de2d3379d143d83aa@www.novabbs.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=37145&group=comp.arch#37145

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Page tables and TLBs [was Interrupts on the Concertina II]
Date: Sun, 28 Jan 2024 21:10:51 +0000
Organization: Rocksolid Light
Message-ID: <3c8d15fc7456a98de2d3379d143d83aa@www.novabbs.org>
References: <uo930v$24cq0$1@dont-email.me> <xTWpN.140322$yEgf.868@fx09.iad> <uoh4t9$3pht2$2@dont-email.me> <uojugt$buth$1@dont-email.me> <ff86faba91c3898f808cce78672bb058@www.novabbs.org> <uoka6h$dlog$1@dont-email.me> <55df2c2e4662c064fd9eb8f31c8783b7@www.novabbs.org> <mrwrN.323723$p%Mb.172024@fx15.iad> <uomff4$s535$1@dont-email.me> <65d2323335476993a8b1aa39720022b6@www.novabbs.org> <ZtCrN.374797$83n7.249143@fx18.iad> <d23a16dcdf06df96a208e1c39002d9a3@www.novabbs.org> <SwUrN.301374$xHn7.210887@fx14.iad> <f25fc2ceb6287463678e883b4adfc6ff@www.novabbs.org> <cwusN.276944$PuZ9.103529@fx11.iad> <f46ba0ef0a4316ea2214a1a8a7894571@www.novabbs.org> <BvQsN.271760$Wp_8.43879@fx17.iad> <dc4032aabfd00b917ac0fb83f12d70bf@www.novabbs.org> <I7xtN.301650$c3Ea.207523@fx10.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="891526"; mail-complaints-to="usenet@i2pn2.org";
posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$/vvihitmQ8LzRvSzA/LI8ee0jpWODmM56sZ4uhEPthfM2/CN.d/Ve

by: MitchAlsup1 - Sun, 28 Jan 2024 21:10 UTC

EricP wrote:

> MitchAlsup1 wrote:
>> EricP wrote:
>>
>>> MitchAlsup1 wrote:
>>>> EricP wrote:
>>>>
>>>>> MitchAlsup1 wrote:
>>>>>> EricP wrote:
>>>>>>
>>>>>>> MitchAlsup1 wrote:
>>>>>>>> EricP wrote:
>>>>>>>>
>>>>>>>>> One accomplishes the same effect by caching the interior PTE nodes
>>>>>>>>> for each of the HV and GuestOS tables separately on the downward
>>>>>>>>> walk,
>>>>>>>>> and hold the combined nested table mapping in the TLB.
>>>>>>>>> The bottom-up table walkers on each interior PTE cache should
>>>>>>>>> eliminate 98% of the PTE reads with none of the headaches.
>>>>>>>>
>>>>>>>> I call these things:: TableWalk Accelerators.
>>>>>>>>
>>>>>>>> Given CAMs at your access, one can cache the outer layers and short
>>>>>>>> circuit most of the MMU accesses--such that you don't siply read
>>>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both
>>>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
>>>>>>>> them put them in your CAM.} A Density trick is for each CAM to have
>>>>>>>> access to a whole cache line of PTEs (8 in my case).
>>>>>>
>>>>>>> An idea I had here was to allow the OS more explicit control
>>>>>>> for the invalidates of the interior nodes caches.
>>>>>>
>>>>>> The interior nodes, stored in the CAM, retain their physical
>>>>>> address, and
>>>>>> are snooped, so no invalidation is required. ANY write to them is
>>>>>> seen and
>>>>>> the entry invalidates itself.
>>>>
>>>>> On My66000, but other cores don't have automatically coherent TLB's.
>>>>> This feature is intended for that general rabble.
>>>>
>>>>> Just to play devil's advocate...
>>>>
>>>>> To snoop page table updates My66000 TLB would need a large CAM with all
>>>>> the physical addresses of the PTE's source cache lines parallel to the
>>>>> virtual and ASID CAM's, and route the cache line invalidates through
>>>>> it.
>>>>
>>>> Yes, .....
>>>>
>>>>> While the virtual index CAM's are separated in different banks,
>>>>> one for each page table level, the P.A. CAM is for all entries in
>>>>> all banks.
>>>>> This extra P.A. CAM will have a lot of entries and therefore be slow.
>>>>
>>>> That is the TWA.
>>
>>> No, not the Table Walk Accelerator. I'm thinking the PA CAM would
>>> only need to be accessed for cache line invalidates. However it would be
>>> very inconvenient if the TLB CAMs had faster access time for virtual
>>> address lookups than for physical address lookups, so the access time
>>> would be the longer of the two, that being PA.
>>
>> VA PA
>> | |
>> V V
>> +-----------+ +-----------+-+ +-----------+
>> | VA CAM |->| PTEs |v|<-| PA CAM |
>> +-----------+ +-----------+-+ +-----------+

> Of course, but for a 5 or 6 level page table you'd have a CAM bank
> for each level to search in parallel. The loading on the PA path
> would be the same as if you a CAM as large as the sum of all entries.

What you see above is the TLB
What your above paragraph talks about is what I call the TWA.

> But as you point out below, this shouldn't be an issue because
> it has little to do after the lookup.

Basically only wait for SNOOPs and for TLB misses.

>>> Basically I'm suggesting the big PA CAM slows down VA translates
>>> and therefore possibly all memory accesses.
>>
>> It is a completely independent and concurrent hunk of logic that only has
>> access to the valid bit and can only clear the valid bit.

> Yes, ok. The lookup on the PA path may take longer but there is
> little to do on a hit so the total path length is shorter,
> so PA invalidate won't be on the critical path for the MMU.

>>>>> Also routing the Invalidate messages through the TLB could slow down
>>>>> all
>>>>> their ACK's messages even though there is very low probability of a hit
>>>>> because page tables update relatively infrequently.
>>>>
>>>> TLBs don't ACK they self-invalidate. And they can be performing a
>>>> translation
>>>> while self-invalidating.
>>
>>> Hmmm... Danger Will Robinson. Most OS page table management depends on
>>> synchronizing after the shootdowns complete on all affected cores.
>>
>>> The basic safe sequence is:
>>> - acquire page table mutex
>>> - modify PTE in memory for a VA
>>
>> Here you have obtained write permission on the line containing the PTE
>> being modified. By the time you have obtained write permission, all
>> other TLBs will have been invalidated.

> It means you can't use the outer cache levels to filter invalidates.
> You'd have to pass all invalidate messages from the coherence network
> directly to the TLB PA, bypassing the cache hierarchy, to ensure the
> TLB entry is removed before the cache ACK's the invalidate message.

With my exclusive cache, I have to do that anyway. With wider than
register accesses I am already in a position where I have BW for these
SNOOPs without little overhead on either channel.

>>> - issue IPI's with VA to all cores that might have a copy in TLB
>>> - invalidate local TLB for VA
>>> - wait for IPI ACK's from remote cores
>>> - release mutex
>>
>>> If it doesn't wait for shootdown ACKs then it might be possible for a
>>> stale PTE copy to exist and be used after the mutex is released.
>>
>> Race condition does not exist. By the time the core modifying the PTE
>> obtains write permission, all the TLBs have been cleared of that entry.

> Ok

>>>>> Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
>>>>> are set assoc., and would have to be virtually indexed and virtually
>>>>> tagged with both VA and ASID plus table level to select address mask.
>>>>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
>>>>> 4-rows*4-way.
>>>>
>>>> All TLB walks are performed with RealPA.
>>>> All Snoops are performed with RealPA
>>>>
>>>>> How can My66000 look up STLB entries by invalidate physical line
>>>>> address?
>>>>> It would have to scan all 128 rows for each message.
>>>>
>>>> It is not structured like Intel L2 TLB.
>>
>>> Are you saying the My66000 STLB is physically indexed, physically tagged?
>>> Hows this work for a bottom-up table walk (aka your TWA)?
>>
>> L2 TLB is a different structure (SRAM) than TWAs (CAM).
>> I can't talk about it:: as Ivan used to say:: NYF.

> Rats.

To downgrade the human mind is bad theology. -- C. K. Chesterton

devel / comp.arch / Re: Page tables and TLBs [was Interrupts on the Concertina II]

Subject	Author
Interrupts on the Concertina II	Quadibloc
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	David Brown
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	Quadibloc
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	Quadibloc
Re: Interrupts on the Concertina II	BGB
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	BGB
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	BGB
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	BGB
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	BGB-Alt
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Page tables and TLBs [was Interrupts on the Concertina II]	EricP
Re: Page tables and TLBs [was Interrupts on the Concertina II]	MitchAlsup1
Re: Page tables and TLBs [was Interrupts on the Concertina II]	EricP
Re: Page tables and TLBs [was Interrupts on the Concertina II]	MitchAlsup1
Re: Page tables and TLBs [was Interrupts on the Concertina II]	EricP
Re: Page tables and TLBs [was Interrupts on the Concertina II]	MitchAlsup1
Re: Interrupts on the Concertina II	Scott Lurndal
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	Chris M. Thomasson
Re: Interrupts on the Concertina II	EricP
Re: Interrupts on the Concertina II	MitchAlsup1
Re: Interrupts on the Concertina II	Chris M. Thomasson