Rocksolid Light - comp.arch

Re: MMU Musings

<u6smmv$2gaa7$3@dont-email.me>

https://news.novabbs.org/devel/article-flat.php?id=32880&group=comp.arch#32880

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: MMU Musings
Date: Tue, 20 Jun 2023 13:13:03 -0400
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <u6smmv$2gaa7$3@dont-email.me>
References: <364aaeda-907d-4187-b2b3-3c6238b4ad9en@googlegroups.com>
<4inNL.1443912$iU59.593781@fx14.iad>
<2023Mar6.184902@mips.complang.tuwien.ac.at>
<d730e7ef-403d-4384-b040-f4d89261b4a9n@googlegroups.com>
<f8c5f02f-2e12-43cb-84d5-cb698c7339adn@googlegroups.com>
<asINL.59989$qpNc.22714@fx03.iad>
<2023Mar7.183226@mips.complang.tuwien.ac.at>
<83b9b080-499a-4399-819b-0ada4b64900dn@googlegroups.com>
<vf2OL.325070$PXw7.116446@fx45.iad> <tuadhs$1022s$1@dont-email.me>
<v03OL.194116$0dpc.84619@fx33.iad> <tuaf5o$10el0$1@dont-email.me>
<tuam45$11tc3$1@dont-email.me>
<c55560e4-d5a6-48e6-b30d-3291f96cfa43n@googlegroups.com>
<dd5d5917-c0c9-449a-85ca-b15f024f3fb2n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 20 Jun 2023 17:13:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f3d0bcd62b374bce57dd1eafa9d08e5f";
logging-data="2632007"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PryIPyo1ZPtvwsUuI87JVbJOV+W8cbgQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:q7DxzKdHO+DvPjqoGsL5bmv0INw=
X-Mozilla-News-Host: news://news.eternal-september.org
In-Reply-To: <dd5d5917-c0c9-449a-85ca-b15f024f3fb2n@googlegroups.com>

by: Paul A. Clayton - Tue, 20 Jun 2023 17:13 UTC

On 3/9/23 2:28 PM, MitchAlsup wrote:
> On Wednesday, March 8, 2023 at 4:28:05 PM UTC-6, robf...@gmail.com wrote:
> <
>> I like the 64kB page size.
> <
> As always there are benefits and detriments to this size of page.
> There is a never ending tension between smaller page sizes so
> processes/threads can share stuff more efficiently;

This assumes that permissions are tied to translation units. Even
without capabilities, this need not be the case. A global address
space as proposed for the Mill would naturally defer translation
(and possibly also make cross-application sharing easier, possibly
at the cost of side channels [any resource sharing seems
susceptible to side channels]), so even if the permissions used
the same granularity there might be a separation of storage (or
caching) since permissions are needed at access but translation is
only needed to route the request to the appropriate memory
controller (this seems to raise some interesting issues).

With separate caching storage near the agent/core, different
granularity may make sense. Permissions may also favor merging
into "superpages" more than even 4KiB pages do for translations.
(AMD produced a TLB that merges a cache line to PTEs into a single
TLB entry if the translations form a superpage; this "requires"
high associativity — full in AMD's case — since non-coalesced
pages in a virtual chunk would share the same index.)

(At extreme page sizes or sparse allocation, internal
fragmentation could also be a problem. Smaller processes — perhaps
for greater resilience/fault containing — would also seem to
encourage smaller permission granules. In-memory compression would
seem to help with memory consumption [unused/zeroed memory would
compress well] in this case but introduces other overheads.)

> then there is
> the larger page sizes have better TLB performance:: but you also
> have the effect, when paging, that larger pages have greater
> swap-in and swap-out latencies.

When smaller pages are effectively used as larger pages,
coalescing becomes possible. There are also tricks like supporting
holes in translation (Chang Hyun Park et al., "Perforated Page:
Supporting Fragmented Memory Allocation for Large Pages", 2020) or
mapping a large page to a shadow memory address which is
translated at the memory controller (Mark Swanson, Leigh Stoller,
and John Carter, "Increasing TLB Reach Using Superpages Backed by
Shadow Memory").

(Reducing the cost of moving pages would seem to also facilitate
defragmentation.)

Even with some sparsity and non-contiguous pages, large virtual
address spaces could still use larger pages to reduce the number
of translation layers. With 5-bit translation levels (with 8-byte
PTEs, 256-byte page table pages), two levels could be merged to
the more ordinary 10-bit translation level (8 KiB page table
page), three levels could be merged to 15-bit (256 KiB), etc. With
10-bit translation layers, there would be less opportunity, but if
a process will be using a large virtual address region (8 GiB)
somewhat densely, then using a 2 MiB translation page to skip a
layer might be useful. (Smaller translation layers also provide
more diverse node-sized pages, though it complicates the choice of
translation level page size. On the other hand, converting a 256-
byte translation layer to a 8 KiB layer would not seem to be that
expensive. Smaller layers do provide fewer bits since the smaller
alignment provides fewer implicit bits.)

(Obviously, nested page tables for virtualization are likely to be
friendlier to large pages.)

> So, if a 4KB page is read/written by I/O device in 1ms, the 64KB
> page will take 16ms. Sooner or later all this extra latency will add
> up..........For exactly the same reasons base-bounds translation
> systems changed as they went from 19-bit address spaces to 32-bit
> address spaces. One could read/write a 18-bit process in a few
> milliseconds, whereas swapping a 30-bit VA could take over 1 minute,
> 1 minute where the CPU was doing nothing !!! because 1 30-bit VaS
> application was being swapped out to make room for another 30-bit
> VaS application in a system with just 31-bit of PaS.

This also seems to get into other considerations such as the file
system interface. An mmap-oriented operating system would seem to
have tighter binding between I/O size and page size.

Flash drives seem to have interestingly different tradeoffs in
that the erase granule is larger than the write granule. (Shingled
disk drives also seem to have interesting tradeoffs.) I have not
looked at such closely, but these factors would seem to influence
ideal I/O size.

(Batch scheduling with long running processes would also seem to
be more tolerant of swapping latency. Using overlaying techniques
might also reduce the overhead as a sort of intermediate between
full process swapping and abstracted paging; perhaps a process
could set values on various memory regions and the scheduler could
choose which "segments" to swap out. Such a crazy design might
also consider scheduling choices not just by priority but also
consider throughput impacts, perhaps letting a lower priority but
memory and storage I/O light program to run since it could start
earlier and would not interfere while more memory is swapped out
in preparation for a larger higher priority program.)

> This is the other side of the tension. More short I/O events versus
> fewer longer I/O events. {In any event, paging won over swapping.}
> <
>> Higher order bits are gained in the PTEs
>> allowing
> <
> All sorts of schemes to make memory management easier/better.
> <
> For example, consider a PTP containing a ASID for the space being
> mapped. So, ASID ceases to be attached at the hip to a thread/process
> but becomes an identifier as to which addresses spaces are being
> shared. Thus, everybody sharing space X, uses ASID[X], optimizing
> sharing in all the HW facilities used to perform mapping.

HP PA-RISC had "Access IDs" as part of the PTEs, which had to
match one of eight "Protection ID Registers". (Itanium had similar
"Protection Keys" but with a minimum of 16 registers and the
ability for the register to disable reads and executes as well as
write disable provided by PA-RISC.) In theory one could also
define "global" key values that always match (and so do not
require a register entry — perhaps I am remembering this from it
actually being implemented); with no disable bits provided such
would provide low cost ID registers, alternatively modifiable
disable bits could be provided such that the savings was "only" in
the ID match possibly checking if all but three bits are zero and
using those to reference an 8-etnry disable bits table.

This is one way of facilitating single address space systems.

What this country needs is a good five cent microcomputer.

devel / comp.arch / Re: MMU Musings