Message-ID:

America has been discovered before, but it has always been hushed up. -- Oscar Wilde

devel / comp.arch / Re: Whither the Mill?

Re: Whither the Mill?

<ullb5v$2j2v6$2@dont-email.me>

https://news.novabbs.org/devel/article-flat.php?id=35808&group=comp.arch#35808

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Sat, 16 Dec 2023 17:17:19 -0600
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <ullb5v$2j2v6$2@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad>
<2695abc72966c220809e5c6690a8edf6@news.novabbs.com>
<ZP5fN.58208$83n7.3029@fx18.iad>
<ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com>
<LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me>
<73e64061bc14cec73e1e94cadc65cb79@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 16 Dec 2023 23:17:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="556d2ee551e934ab8897fd3234d80ebd";
logging-data="2722790"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3mTRm/1taFD3cmaHo/s4zWRjHhgT1gAA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:9HnoGqwlwLpdIFGImcEfcLasbBE=
Content-Language: en-US
In-Reply-To: <73e64061bc14cec73e1e94cadc65cb79@news.novabbs.com>

by: BGB-Alt - Sat, 16 Dec 2023 23:17 UTC

On 12/16/2023 5:01 PM, MitchAlsup wrote:
> BGB-Alt wrote:
>
> Why did you acquire an alt ?? Ego perhaps ??

This account is for when I am posting from my machine shop...
It is registered to a different email address, is a different account, ...

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld wrote:

> Anyway, has all development stopped? Or is their "sweat equity" model
> still going on?

I've checked the Mill web site, and Ivan Godard last posted to the forums
there just five days ago. So I can only assume that all is well, but
perhaps he has entered a phase of work on the Mill that is keeping him
busy. Which would seem to be good news.

John Savard

mitchalsup@aol.com (MitchAlsup) writes:
>Scott Lurndal wrote:
>
>> mitchalsup@aol.com (MitchAlsup) writes:
>>>Scott Lurndal wrote:
>>>
>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>Scott Lurndal wrote:
>>>>>
>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>>BGB wrote:
>>>>>
>>>>>>>> For FPGA's over $1k, almost makes more sense to ignore that they exist
>>>>>>>> (also this appears to be around the cutoff point for the free version of
>>>>>>>> Vivado as well; but one would have thought Xilinx would have already
>>>>>>>> gotten their money by someone having bought the FPGA?...).
>>>>>
>>>>>> For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
>>>>>> cost is in the noise.
>>>>>
>>>>>> For a hobby? Well...
>>>>>
>>>>>
>>>>>>>> If the compiler is kept smaller, it is faster to recompile from source.
>>>>>>>
>>>>>>>In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>>>>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal
>>>>>>>and string) and did a pretty good job of spitting out high performance
>>>>>>>code; on a machine with a 150ns cycle time.
>>>>>
>>>>>> As did our COBOL compiler (which ran in 50KB). But in both cases,
>>>>>> the languages were far simpler and much easier to generate efficient
>>>>>> code than languages like Modula, Pascal, C, et alia.
>>>>>
>>>>>>>> Though, within moderate limits, 1M lines would basically be enough to fit:
>>>>>>>> A basic kernel;
>>>>>>>> (this excludes the Linux kernel, which is well over the size limit).
>>>>>>>
>>>>>>>If there were an efficient way to run the device driver sack in user-mode
>>>>>>>without privilege and only the MMI/O pages this driver can touch mapped
>>>>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF--
>>>>>
>>>>>> That's actually quite common and one of the raison d'etre of the
>>>>>> PCI Express SR-IOV feature. When you can present a virtual
>>>>>> function to the user directly (mapping the MMIO region into
>>>>>> the user mode virtual address space) the app had direct access
>>>>>> to the hardware. Interrupts are the only tricky part, and
>>>>>> the kernel virtio subsystem, which interfaces with the user
>>>>>> application via shared memory provides interrupt handling
>>>>>> to the application.
>>>>>
>>>>>> An I/OMMU provides memory protection for DMA operations initiated
>>>>>> by the virtual function ensuring it only accesses the application
>>>>>> virtual address space.
>>>>>
>>>>>Why should device be able to access user VaS outside of the buffer the
>>>>>user provided, OH so long ago ??
>>>
>>>
>>>> Because the device wants to do DMA directly into or from the users
>>>> virtual address space. Bulk transfer, not MMIO accesses.
>>>
>>>OK, I will ask the question in the contrapositive way::
>>>If the user ask device to read into a buffer, why does the device get
>>>to see everything of the user's space along with that buffer ?
>
>> It doesn't, necessarily. The IOMMU translation table is a
>> proper subset of the user's virtual address space. The
>> application tells the kernel which portions of the address
>> space are valid DMA regions for the device to access.
>
>
>Which is my point !! you only want the device to see that <small> subset
>of the requesting application--not the whole address space. Done right
>the device can still use the application virtual address, but the device
>is not allowed to access stuff not associated with the request at hand
>right now.

I thought I made that clear from the start.

>
>For example, you are a large entity and and Chinese disk drives are way
>less expensive than non-Chinese; so you buy some. Would you let those
>disk drives access anything in some requestors address space--no, you
>would only allow that device to access the user supplied buffer and
>whatever page rounding up that transpires.

So far as I know there are no chinese disk drives that support
SR-IOV.

BGB-Alt <bohannonindustriesllc@gmail.com> writes:
>On 12/16/2023 1:25 PM, EricP wrote:
>> MitchAlsup wrote:
>>> Scott Lurndal wrote:
>>>
>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>> Scott Lurndal wrote:
>>>>>
>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>> BGB wrote:
>>>>>
>>>>>>>> For FPGA's over $1k, almost makes more sense to ignore that they
>>>>>>>> exist (also this appears to be around the cutoff point for the
>>>>>>>> free version of Vivado as well; but one would have thought Xilinx
>>>>>>>> would have already gotten their money by someone having bought
>>>>>>>> the FPGA?...).
>>>>>
>>>>>> For anyone serious, an verif engineer can cost $500-1000/day.   The
>>>>>> FPGA
>>>>>> cost is in the noise.
>>>>>
>>>>>> For a hobby? Well...
>>>>>
>>>>>
>>>>>>>> If the compiler is kept smaller, it is faster to recompile from
>>>>>>>> source.
>>>>>>>
>>>>>>> In 1979 I joined a company with a FORTRAN mostly-77- that compiled
>>>>>>> at 10,000 lines of code per second for an IBM-like minicomputer
>>>>>>> (less decimal and string) and did a pretty good job of spitting
>>>>>>> out high performance
>>>>>>> code; on a machine with a 150ns cycle time.
>>>>>
>>>>>> As did our COBOL compiler (which ran in 50KB). But in both cases,
>>>>>> the languages were far simpler and much easier to generate efficient
>>>>>> code than languages like Modula, Pascal, C, et alia.
>>>>>
>>>>>>>> Though, within moderate limits, 1M lines would basically be
>>>>>>>> enough to fit:
>>>>>>>>    A basic kernel;
>>>>>>>>      (this excludes the Linux kernel, which is well over the size
>>>>>>>> limit).
>>>>>>>
>>>>>>> If there were an efficient way to run the device driver sack in
>>>>>>> user-mode
>>>>>>> without privilege and only the MMI/O pages this driver can touch
>>>>>>> mapped
>>>>>>> into his VAS. Poof none of the driver stack is in the kernel. --IF--
>>>>>
>>>>>> That's actually quite common and one of the raison d'etre of the
>>>>>> PCI Express SR-IOV feature.    When you can present a virtual
>>>>>> function to the user directly (mapping the MMIO region into
>>>>>> the user mode virtual address space) the app had direct access
>>>>>> to the hardware.    Interrupts are the only tricky part, and
>>>>>> the kernel virtio subsystem, which interfaces with the user
>>>>>> application via shared memory provides interrupt handling
>>>>>> to the application.
>>>>>
>>>>>> An I/OMMU provides memory protection for DMA operations initiated
>>>>>> by the virtual function ensuring it only accesses the application
>>>>>> virtual address space.
>>>>>
>>>>> Why should device be able to access user VaS outside of the buffer
>>>>> the user provided, OH so long ago ??
>>>
>>>
>>>> Because the device wants to do DMA directly into or from the users
>>>> virtual address space.   Bulk transfer, not MMIO accesses.
>>>
>>> OK, I will ask the question in the contrapositive way::
>>> If the user ask device to read into a buffer, why does the device get
>>> to see everything of the user's space along with that buffer ?
>>>
>>> The way you write you are assuming the device can write into the
>>> user's code space when he ask for a read from one of his buffers !?!
>>>
>>> You _could_ give device translations to anything and everything
>>> in user space, but this seems excessive when the user only wants
>>> the device to read/write small area inside his VaS.
>>>
>>> OS code already has to manipulate PTE entries or MMU tables so
>>> the device can write read-only and execute-only pages along with
>>> removing write-permission on a page with data inbound from a device.
>>
>> The OS can't remove the page RW access for a user mode page while an
>> IO device is DMA writing the page, if that's what you meant,
>> as the DMA-in may be writing to a smaller buffer within a larger page.
>> It is perfectly normal for a thread to continue to work in buffer
>> bytes adjacent to the one currently involved in an async IO.
>>
>
>One thing I don't get here is why there would be direct DMA between
>userland and the device (at least for filesystem and similar).

https://www.dpdk.org/
https://opendataplane.org/

Are two very common use cases for usermode drivers.

>
>Like, say, for a filesystem, it is presumably:
> read syscall from user to OS;
> route this to the corresponding VFS driver;
> Requests spanning multiple blocks being broken up into parts;
> VFS driver checks the block-cache / buffer-cache;
> If found, copy from cache into user-space;
> If not found, send request to the underlying block device;
> Wait for response (and/or reschedule task for later);
> Copy result back into userland.

No, it would be for the user mode application to access
disk/ssd/nvme blocks directly and impose whatever structure on those
blocks that it wishes. No OS intervention at all, DMA directly
into userspace instead of bouncing through kernel.

The NVME controllers use a command ring, and when virtualized,
each VF provides a command ring directly to the user mode
application - the application can insert commands (read, write,
erase, etc) into the ring, write to the doorbell register
a and wait for completion by polling or waiting for a virtio
interrupt.

Again the application is just reading blocks and interpreting
them any way it wishes (e.g. for a database application
which doesn't need a filesystem).

On 12/16/2023 4:56 PM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>> On 12/16/2023 12:04 PM, moi wrote:
>>> On 16/12/2023 07:22, Niklas Holsti wrote:
>>>> On 2023-12-16 0:39, Scott Lurndal wrote:
>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>
>>>> [snip]
>>>>
>>>>>> In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
>>>>>> 10,000 lines of code per second for an IBM-like minicomputer (less
>>>>>> decimal
>>>>>> and string) and did a pretty good job of spitting out high performance
>>>>>> code; on a machine with a 150ns cycle time.
>>>>>
>>>>> As did our COBOL compiler (which ran in 50KB).
>>>>
>>>>
>>>> Are you both sure that those numbers are really lines per *second*?
>>>> They seem improbably high, and compilation speeds in those years used
>>>> to be stated in lines per *minute*.
>>>>
>>>
>>> Almost certainly per minute.
>>> I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
>>> It achieved 20K cards per minute and was considered to be very fast.
>>>
>>
>> Lines per minute seems to make sense.
>>
>>
>> Modern PC's are orders of magnitude faster, but still don't have
>> "instant" compile times by any means.
>>
>> Could be faster though, but would likely need languages other than C or
>> (especially) C++.
>
> I assume you never worked with Turbo Pascal.
>
> That was amazing. It compiled code so fast that it was never a
> bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
> The first version I ever used, 3.0 (?) compiled from memory to
> memory, so even slow I/O (to floppy disc, at the time) was not
> an issue.
>

Yeah, I mostly missed out on that era.

Didn't get much into computers until I was in the "late single digits"
age range, and by this point the world was mostly 386 and 486 PC's
running Windows 3.x and similar.

Seemingly, Pascal was already "mostly dead" by this point.

When I started messing with programming in elementary school:
First was QBasic, but other than this I was also messing around with
TurboC. Not long after (when the world migrated to Win95) had jumped
over to Cygwin.

During middle and high-school, mostly during the Win98 era, mostly used
Cygwin and MinGW. Though, I was weird, and mostly ended up running
WinNT4 and Win2K (and dual booting with Linux) rather than Win9X.

Then later jumped over to MSVC / Visual Studio for native windows
programs while taking college classes.

Though, part of the jump was because, at this point, Visual Studio had
become basically freeware; and Visual Studio had a much better debugger
(gdb kinda sucks...).

Still, much time has passed for me, and in a fairly short time I will
cross over into having existed for 4 decades.

> This was made possible by using a streamlined one-pass compiler. It
> didn't do much optimization, but when the alternative was BASIC, the
> generated code was still extremely fast by comparision.
>

I remember QBasic.

Didn't take long to start to see the limitations...

> There were a few drawbacks. The biggest one was that programming errors
> tended to freeze the machine. Another (not so important) was that,
> if you were one of the lucky people to have an 80x87 coprocessor, the
> generated code did not check for overflow of the coprocessor stack.

OK.

For most of my life, x87 had been built into the CPU.

According to Thomas Koenig <tkoenig@netcologne.de>:
>> Modern PC's are orders of magnitude faster, but still don't have
>> "instant" compile times by any means.
>>
>> Could be faster though, but would likely need languages other than C or
>> (especially) C++.
>
>I assume you never worked with Turbo Pascal.
>
>That was amazing. It compiled code so fast that it was never a
>bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.

Back around 1970 the Dartmouth Time-Sharing System (DTSS) ran on a GE
635, which was about the same performance as the original PDP-10 and a
front end DAtanet 30 which had about the compute power of a modern
toaster. By clever system design they made it support 100 users, and
the response time was really good. The time from when you typed RUN to
when your program was compiled and started running was too fast to
notice.

It was a real time-sharing system that supported multiple languages,
not just BASIC, and the languages were all compiled, not interpreted.
The compilers were so fast that for years they never bothered to write
a linker, since you could just compile all the source code for your
routines togther. (They finally wrote a linker they added PL/I.)

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

On Fr 15 Dez 2023 at 13:05, BGB <cr88192@gmail.com> wrote:

> Also, it would be nice to have a basically usable OS and core software
> stack in under 1M lines.
>
> Say, by not trying to be everything to everyone, and limiting how much
> is allowed in the core OS (or is allowed within the build process for
> the core OS).
>
> Though, within moderate limits, 1M lines would basically be enough to fit:
> A basic kernel;
> (this excludes the Linux kernel, which is well over the size limit).
> A (moderate sized) C compiler;
> (but not GCC, which is also well over this size limit).
> A shell+utils comparable to BusyBox;
> Various core OS libraries and similar, etc.
>
> For this, will assume an at least nominally POSIX like environment.
>
> Programs that run on the OS would not be counted in the line-count budget.

Have you had a look at plan9 yet?

'Andreas

BGB-Alt wrote:
> On 12/16/2023 1:25 PM, EricP wrote:
>> MitchAlsup wrote:
>>> Scott Lurndal wrote:
>>>
>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>> Scott Lurndal wrote:
>>>>>
>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>> BGB wrote:
>>>>>
>>>>>>>> For FPGA's over $1k, almost makes more sense to ignore that they
>>>>>>>> exist (also this appears to be around the cutoff point for the
>>>>>>>> free version of Vivado as well; but one would have thought
>>>>>>>> Xilinx would have already gotten their money by someone having
>>>>>>>> bought the FPGA?...).
>>>>>
>>>>>> For anyone serious, an verif engineer can cost $500-1000/day.
>>>>>> The FPGA
>>>>>> cost is in the noise.
>>>>>
>>>>>> For a hobby? Well...
>>>>>
>>>>>
>>>>>>>> If the compiler is kept smaller, it is faster to recompile from
>>>>>>>> source.
>>>>>>>
>>>>>>> In 1979 I joined a company with a FORTRAN mostly-77- that
>>>>>>> compiled at 10,000 lines of code per second for an IBM-like
>>>>>>> minicomputer (less decimal and string) and did a pretty good job
>>>>>>> of spitting out high performance
>>>>>>> code; on a machine with a 150ns cycle time.
>>>>>
>>>>>> As did our COBOL compiler (which ran in 50KB). But in both cases,
>>>>>> the languages were far simpler and much easier to generate efficient
>>>>>> code than languages like Modula, Pascal, C, et alia.
>>>>>
>>>>>>>> Though, within moderate limits, 1M lines would basically be
>>>>>>>> enough to fit:
>>>>>>>> A basic kernel;
>>>>>>>> (this excludes the Linux kernel, which is well over the
>>>>>>>> size limit).
>>>>>>>
>>>>>>> If there were an efficient way to run the device driver sack in
>>>>>>> user-mode
>>>>>>> without privilege and only the MMI/O pages this driver can touch
>>>>>>> mapped
>>>>>>> into his VAS. Poof none of the driver stack is in the kernel.
>>>>>>> --IF--
>>>>>
>>>>>> That's actually quite common and one of the raison d'etre of the
>>>>>> PCI Express SR-IOV feature. When you can present a virtual
>>>>>> function to the user directly (mapping the MMIO region into
>>>>>> the user mode virtual address space) the app had direct access
>>>>>> to the hardware. Interrupts are the only tricky part, and
>>>>>> the kernel virtio subsystem, which interfaces with the user
>>>>>> application via shared memory provides interrupt handling
>>>>>> to the application.
>>>>>
>>>>>> An I/OMMU provides memory protection for DMA operations initiated
>>>>>> by the virtual function ensuring it only accesses the application
>>>>>> virtual address space.
>>>>>
>>>>> Why should device be able to access user VaS outside of the buffer
>>>>> the user provided, OH so long ago ??
>>>
>>>
>>>> Because the device wants to do DMA directly into or from the users
>>>> virtual address space. Bulk transfer, not MMIO accesses.
>>>
>>> OK, I will ask the question in the contrapositive way::
>>> If the user ask device to read into a buffer, why does the device get
>>> to see everything of the user's space along with that buffer ?
>>>
>>> The way you write you are assuming the device can write into the
>>> user's code space when he ask for a read from one of his buffers !?!
>>>
>>> You _could_ give device translations to anything and everything
>>> in user space, but this seems excessive when the user only wants
>>> the device to read/write small area inside his VaS.
>>>
>>> OS code already has to manipulate PTE entries or MMU tables so
>>> the device can write read-only and execute-only pages along with
>>> removing write-permission on a page with data inbound from a device.
>>
>> The OS can't remove the page RW access for a user mode page while an
>> IO device is DMA writing the page, if that's what you meant,
>> as the DMA-in may be writing to a smaller buffer within a larger page.
>> It is perfectly normal for a thread to continue to work in buffer
>> bytes adjacent to the one currently involved in an async IO.
>>
>
> One thing I don't get here is why there would be direct DMA between
> userland and the device (at least for filesystem and similar).

Zero-copy IO. That has always been available on WinNT provided hardware
supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file
cache it gets copied, otherwise it DMA's directly to/from the user buffer.
Normally one wants cached file blocks but there are times when one doesn't
and wants the more optimal direct buffer IO (eg, a video player).

There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO. A single virtual buffer becomes a list of
physical fragments, so a scatter-gather list becomes a list of lists
of physical byte buffer fragments, called a Memory Descriptor List (MDL)
in Windows.

And then SR-IOV adds virtual machines to the mix, where a guest OS
physical address becomes a hypervisor guest virtual address,
and not only are guest buffers in guest user space, but the guest OS
MDL's are themselves in hypervisor virtual space and require their own
hypervisor MDL's (lists of lists of lists of fragments).

>
> Like, say, for a filesystem, it is presumably:
> read syscall from user to OS;
> route this to the corresponding VFS driver;
> Requests spanning multiple blocks being broken up into parts;
> VFS driver checks the block-cache / buffer-cache;
> If found, copy from cache into user-space;
> If not found, send request to the underlying block device;
> Wait for response (and/or reschedule task for later);
> Copy result back into userland.

Yes, pretty much (there is page mangement, quota management).
Except if I request a direct IO it DMA's direct to/from the user buffer,
if hardware supports that.

> Though, it may make sense that if a request isn't available immediately,
> and there is some sort of DMA mechanism, the OS could block the task and
> then resume it once the data becomes available. For polling IO, doesn't
> likely make much difference as the CPU is basically stuck in a busy loop
> either way until the IO finishes.

Yes, that's DMA resource management. Basically each system has a certain
number of scatter-gather IO mappers, now implemented by the IOMMU page table.
Each IO queues a request for its mappers, and the DMA resource manager doles
out a set of IO mapping registers, which may be less that you requested
in which case you break up your IO into multiple requests.
Then you program the scatter-gather map using info from the IO's MDL,
pass the mapped IO space addresses to the device, and Bob's your uncle.
When the IO completes, your driver tears down its IO map and releases
the mapping registers to the next waiting IO.

> Though, could make sense for hardware accelerating pixel-copying
> operations for a GUI.

On Windows the Gui is managed completely differently.
I'm not familiar enough with the details to comment other than to say
it is executed as privileged subroutines by the calling thread but in
super mode, which allows it direct access to the calling virtual space.

EricP <ThatWouldBeTelling@thevillage.com> writes:
>BGB-Alt wrote:
>> On 12/16/2023 1:25 PM, EricP wrote:
>>> MitchAlsup wrote:
>>>> Scott Lurndal wrote:
>>>>
>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>> Scott Lurndal wrote:
>>>>>>

>>
>> One thing I don't get here is why there would be direct DMA between
>> userland and the device (at least for filesystem and similar).
>
>Zero-copy IO. That has always been available on WinNT provided hardware
>supports it. General byte-buffer IO could always do zero-copy DMA,
>with HW support. For files one can do IO direct to a user buffer with
>certain restrictions, buffers must be file block size and alignment.
>I haven't checked but guessing that if the file block is already in file
>cache it gets copied, otherwise it DMA's directly to/from the user buffer.
>Normally one wants cached file blocks but there are times when one doesn't
>and wants the more optimal direct buffer IO (eg, a video player).
>
>There is also scatter-gather IO, intended for network cards,
>where the IO is a list of byte sized and aligned virtual buffers.
>
>The all interacts with DMA and page management because the physical
>page frames that contain the bytes must be pinned in memory for the
>duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system
architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

> A single virtual buffer becomes a list of
>physical fragments, so a scatter-gather list becomes a list of lists
>of physical byte buffer fragments, called a Memory Descriptor List (MDL)
>in Windows.
>
>And then SR-IOV adds virtual machines to the mix,

Not necessarily just virtual machines - it's also used
to expose the virtual function to user mode code in
a bare metal (or virtualized) operating system.

On 12/17/2023 3:23 AM, Andreas Eder wrote:
> On Fr 15 Dez 2023 at 13:05, BGB <cr88192@gmail.com> wrote:
>
>> Also, it would be nice to have a basically usable OS and core software
>> stack in under 1M lines.
>>
>> Say, by not trying to be everything to everyone, and limiting how much
>> is allowed in the core OS (or is allowed within the build process for
>> the core OS).
>>
>> Though, within moderate limits, 1M lines would basically be enough to fit:
>> A basic kernel;
>> (this excludes the Linux kernel, which is well over the size limit).
>> A (moderate sized) C compiler;
>> (but not GCC, which is also well over this size limit).
>> A shell+utils comparable to BusyBox;
>> Various core OS libraries and similar, etc.
>>
>> For this, will assume an at least nominally POSIX like environment.
>>
>> Programs that run on the OS would not be counted in the line-count budget.
>
> Have you had a look at plan9 yet?

Fwiw, for some damn reason this make me think about plan9 from some
posts way back on comp.programming.threads. I need to find some time to
find them: Here is one that mentioned it:

https://groups.google.com/g/comp.programming.threads/c/nyrEJDt8FvM/m/uZUcQcnWPLQJ

Re: Whither the Mill?

<ulo5t2$363ke$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35853&group=comp.arch#35853

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Sun, 17 Dec 2023 19:05:33 -0600
Organization: A noiseless patient Spider
Lines: 314
Message-ID: <ulo5t2$363ke$1@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad>
<2695abc72966c220809e5c6690a8edf6@news.novabbs.com>
<ZP5fN.58208$83n7.3029@fx18.iad>
<ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com>
<LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me>
<mTGfN.47607$yEgf.35565@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 18 Dec 2023 01:05:38 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9c1491685217355571768dff831df90e";
logging-data="3346062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qXHRw2r/xR6MZJt6W4iJV"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:65H6HTppQ7rE7Y37WFaqhjFldvc=
Content-Language: en-US
In-Reply-To: <mTGfN.47607$yEgf.35565@fx09.iad>

by: BGB - Mon, 18 Dec 2023 01:05 UTC

On 12/17/2023 12:12 PM, EricP wrote:
> BGB-Alt wrote:
>> On 12/16/2023 1:25 PM, EricP wrote:
>>> MitchAlsup wrote:
>>>> Scott Lurndal wrote:
>>>>
>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>> Scott Lurndal wrote:
>>>>>>
>>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>>> BGB wrote:
>>>>>>
>>>>>>>>> For FPGA's over $1k, almost makes more sense to ignore that
>>>>>>>>> they exist (also this appears to be around the cutoff point for
>>>>>>>>> the free version of Vivado as well; but one would have thought
>>>>>>>>> Xilinx would have already gotten their money by someone having
>>>>>>>>> bought the FPGA?...).
>>>>>>
>>>>>>> For anyone serious, an verif engineer can cost $500-1000/day. The
>>>>>>> FPGA
>>>>>>> cost is in the noise.
>>>>>>
>>>>>>> For a hobby? Well...
>>>>>>
>>>>>>
>>>>>>>>> If the compiler is kept smaller, it is faster to recompile from
>>>>>>>>> source.
>>>>>>>>
>>>>>>>> In 1979 I joined a company with a FORTRAN mostly-77- that
>>>>>>>> compiled at 10,000 lines of code per second for an IBM-like
>>>>>>>> minicomputer (less decimal and string) and did a pretty good job
>>>>>>>> of spitting out high performance
>>>>>>>> code; on a machine with a 150ns cycle time.
>>>>>>
>>>>>>> As did our COBOL compiler (which ran in 50KB). But in both cases,
>>>>>>> the languages were far simpler and much easier to generate efficient
>>>>>>> code than languages like Modula, Pascal, C, et alia.
>>>>>>
>>>>>>>>> Though, within moderate limits, 1M lines would basically be
>>>>>>>>> enough to fit:
>>>>>>>>>    A basic kernel;
>>>>>>>>>      (this excludes the Linux kernel, which is well over the
>>>>>>>>> size limit).
>>>>>>>>
>>>>>>>> If there were an efficient way to run the device driver sack in
>>>>>>>> user-mode
>>>>>>>> without privilege and only the MMI/O pages this driver can touch
>>>>>>>> mapped
>>>>>>>> into his VAS. Poof none of the driver stack is in the kernel.
>>>>>>>> --IF--
>>>>>>
>>>>>>> That's actually quite common and one of the raison d'etre of the
>>>>>>> PCI Express SR-IOV feature.    When you can present a virtual
>>>>>>> function to the user directly (mapping the MMIO region into
>>>>>>> the user mode virtual address space) the app had direct access
>>>>>>> to the hardware.    Interrupts are the only tricky part, and
>>>>>>> the kernel virtio subsystem, which interfaces with the user
>>>>>>> application via shared memory provides interrupt handling
>>>>>>> to the application.
>>>>>>
>>>>>>> An I/OMMU provides memory protection for DMA operations initiated
>>>>>>> by the virtual function ensuring it only accesses the application
>>>>>>> virtual address space.
>>>>>>
>>>>>> Why should device be able to access user VaS outside of the buffer
>>>>>> the user provided, OH so long ago ??
>>>>
>>>>
>>>>> Because the device wants to do DMA directly into or from the users
>>>>> virtual address space.   Bulk transfer, not MMIO accesses.
>>>>
>>>> OK, I will ask the question in the contrapositive way::
>>>> If the user ask device to read into a buffer, why does the device get
>>>> to see everything of the user's space along with that buffer ?
>>>>
>>>> The way you write you are assuming the device can write into the
>>>> user's code space when he ask for a read from one of his buffers !?!
>>>>
>>>> You _could_ give device translations to anything and everything
>>>> in user space, but this seems excessive when the user only wants
>>>> the device to read/write small area inside his VaS.
>>>>
>>>> OS code already has to manipulate PTE entries or MMU tables so
>>>> the device can write read-only and execute-only pages along with
>>>> removing write-permission on a page with data inbound from a device.
>>>
>>> The OS can't remove the page RW access for a user mode page while an
>>> IO device is DMA writing the page, if that's what you meant,
>>> as the DMA-in may be writing to a smaller buffer within a larger page.
>>> It is perfectly normal for a thread to continue to work in buffer
>>> bytes adjacent to the one currently involved in an async IO.
>>>
>>
>> One thing I don't get here is why there would be direct DMA between
>> userland and the device (at least for filesystem and similar).
>
> Zero-copy IO. That has always been available on WinNT provided hardware
> supports it. General byte-buffer IO could always do zero-copy DMA,
> with HW support. For files one can do IO direct to a user buffer with
> certain restrictions, buffers must be file block size and alignment.
> I haven't checked but guessing that if the file block is already in file
> cache it gets copied, otherwise it DMA's directly to/from the user buffer.
> Normally one wants cached file blocks but there are times when one doesn't
> and wants the more optimal direct buffer IO (eg, a video player).
>

OK.

Nothing like this in my case, only buffered IO.

Currently, the buffering is managed by the filesystem driver rather than
the block-device.

So, say, reading/writing the SDcard is normally unbuffered, but the FAT
driver will keep a cache of previously accessed clusters and similar. It
might make sense to move this into a more general-purpose mechanism though.

For FAT though, there may be wonk in that (AFAIK) there is no strict
requirement that the start of the data area be aligned to the cluster
size (so, say, one could potentially have a volume with 32K clusters
aligned on a 2K boundary). Well, unless this is disallowed and I missed it.

If I were designing my own filesystem, I would probably have done some
things differently. Though, my ideas didn't really look like EXTn either.

Had previously considered something that would have looked like
something partway between EXT2 and a somewhat simplified NTFS, but not
done much here as it would make a lot of hassle on the Windows side of
things.

Mostly would want a few features that seem a bit lacking in FAT.

Though, did recently discover the existence of the "Projected
FileSystem" API in Windows, which allows the possibility of implementing
custom user-mode filesystems on Windows (sorta; it is a bit wonky).

This does open / re-open some possibilities.

> There is also scatter-gather IO, intended for network cards,
> where the IO is a list of byte sized and aligned virtual buffers.
>
> The all interacts with DMA and page management because the physical
> page frames that contain the bytes must be pinned in memory for the
> duration of the DMA IO. A single virtual buffer becomes a list of
> physical fragments, so a scatter-gather list becomes a list of lists
> of physical byte buffer fragments, called a Memory Descriptor List (MDL)
> in Windows.
>
> And then SR-IOV adds virtual machines to the mix, where a guest OS
> physical address becomes a hypervisor guest virtual address,
> and not only are guest buffers in guest user space, but the guest OS
> MDL's are themselves in hypervisor virtual space and require their own
> hypervisor MDL's (lists of lists of lists of fragments).
>

OK.

I can note that in my project, there is no DMA mechanism as of yet.
Pretty much everything is either MMIO mapped buffers or polling IO.

When I looked at a network card before (once, long ago), IIRC its design
was more like:
There were a pair of ring-buffers, for TX and RX;
One would write frames to the TX buffer, and update the pointers, and
the card would send them;
When a frame arrived, it would add it into the buffer, update the
pointers, and then raise an IRQ.

Click here to read the complete article

On 12/17/2023 5:23 AM, Andreas Eder wrote:
> On Fr 15 Dez 2023 at 13:05, BGB <cr88192@gmail.com> wrote:
>
>> Also, it would be nice to have a basically usable OS and core software
>> stack in under 1M lines.
>>
>> Say, by not trying to be everything to everyone, and limiting how much
>> is allowed in the core OS (or is allowed within the build process for
>> the core OS).
>>
>> Though, within moderate limits, 1M lines would basically be enough to fit:
>> A basic kernel;
>> (this excludes the Linux kernel, which is well over the size limit).
>> A (moderate sized) C compiler;
>> (but not GCC, which is also well over this size limit).
>> A shell+utils comparable to BusyBox;
>> Various core OS libraries and similar, etc.
>>
>> For this, will assume an at least nominally POSIX like environment.
>>
>> Programs that run on the OS would not be counted in the line-count budget.
>
> Have you had a look at plan9 yet?
>

Have heard of Plan9 before, never really looked at the code nor looked
much into it.

Was also aware of Minix, but what little I looked into it made it seem
fairly limited in some areas (though it seems to have changed things a
fair bit in more recent versions). Seems to be using the BSD userland.

But, yeah, for my project, it might make sense to find some sort of
userland I can port and use on top of TestKern, don't necessarily want
to write all of the userland myself.

Would have the functional limitation that I would need to be able to
build it with my compiler, which means basically "generic C only".

....

> 'Andreas

Re: Whither the Mill?

<ulocca$3amon$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35855&group=comp.arch#35855

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Sun, 17 Dec 2023 21:56:07 -0500
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <ulocca$3amon$1@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad>
<2695abc72966c220809e5c6690a8edf6@news.novabbs.com>
<ZP5fN.58208$83n7.3029@fx18.iad>
<ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com>
<LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me>
<mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 18 Dec 2023 02:56:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cc336373cd61ea6cd9a20d66838d2ebd";
logging-data="3496727"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX180ZqjoMSEUic5xxylrfQZv0keSEG5nby4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:ULFkU4Mt8y8tnTA8LNLRUnCL1Bw=
In-Reply-To: <tVHfN.35563$JLvf.23986@fx44.iad>

by: Paul A. Clayton - Mon, 18 Dec 2023 02:56 UTC

On 12/17/23 2:24 PM, Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
[snip zero-copy and scatter-gather I/O]
>> The all interacts with DMA and page management because the physical
>> page frames that contain the bytes must be pinned in memory for the
>> duration of the DMA IO.
>
> PCI express has an optional feature, PRI (Page Request Interface)
> that allows the hardware to request that a page be 'pinned' just
> for the duration of a DMA operation. The ARM64 server base system
> architecture document requires that the host support PRI. This
> works in conjunction with PCIe ATS (Address Translation Services)
> which allows the endpoint device to ask the host for translations
> and cache them in the endpoint so the endpoint can use physical
> addresses directly. This is usually implemented by the IOMMU
> on the host treating the endpoint as if it had a remote TLB cache.

Interesting. I had proposed some years ago that rather than
pinning a physical page for I/O a page be provided when needed
from a free list (including that the data could be cached/buffered
with a virtual address tag).

The Mill's backless memory is similar, deferring physical memory
allocation until cache eviction using a free list (that is
refilled by a thread that is activated at low water mark)

Thanks to search on Google Groups I found the message (dated Sep
28, 2010, 6:07:08 PM). I wrote:
-> Could a really smart IOTLB help with this? If the target of
-> a write is a virtual address, the IOTLB might translate it to
-> an IO Hub local memory address (and/or cache it, perhaps using
-> virtual tags). (It seems it might be useful to distinguish
-> between different purposes of non-cacheable storage. I would
-> guess that the DMA kind is primarily meant to avoid cache
-> pollution not ensure that side-effects occur.)
->
-> Along similar lines, I wondered if a smart IOTLB could be
-> used to make page-pinning only a 'kiss of cowpox' not a
-> 'Kiss of Death'. If an IOTLB could dynamically assign
-> pages from a free list, a huge number of virtual pages
-> could be 'locked'. (It would still be possible for a
-> write to page-fault--if the system software could not
-> provide pages to the free list fast enough to meet the
-> demand by the IOTLB--and read page-faults would be
-> possible; but software might be able to retry the IO
-> requests.)
->
-> (I also wonder if processor TLB COW support would be
-> worthwhile. Aside from COW, such might be used by a
-> user-level memory allocator to free and allocate pages.
-> A shared page free list might allow tighter memory
-> usage. [ISTR the BSD malloc tried to free pages back
-> to the OS. The above mechanism would simply put a
-> hardware managed buffer between the use memory
-> management and the OS.])
->
-> (Even further off-topic, could 'cache' pages be useful?
-> I.e., the software handles re-generation/fill and only
-> needs a low-overhead exception when the page has been
-> reclaimed for other uses. Rather than having system
-> software save and then restore the cache page, it
-> could just be dropped. Even if the cost of restoration
-> is greater than the cost of a save and restore, this
-> sort of caching could be a win if the probability of
-> reuse is low enough.)

The Google groups url:
https://groups.google.com/g/comp.arch/c/u7z9E-zvoPo/m/fmGM4_Ih7ywJ

Re: Whither the Mill?

<ulp9d0$3fgcv$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35859&group=comp.arch#35859

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Mon, 18 Dec 2023 12:11:27 +0100
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <ulp9d0$3fgcv$1@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad> <ku51hoFaf95U1@mid.individual.net>
<ku6760FivvvU1@mid.individual.net> <ulkr8e$2gtuu$1@dont-email.me>
<ull9ua$vm0s$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 18 Dec 2023 11:11:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f3c2ca2b59784eb76ba7accbb808f231";
logging-data="3654047"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+IiF/DX+s3PXEnaHu22zggRYH8dvqNT0gU5bWQ5l0fqg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18
Cancel-Lock: sha1:R9+Sjer2AkozqxfCeUPAUtgrzps=
In-Reply-To: <ull9ua$vm0s$1@newsreader4.netcologne.de>

by: Terje Mathisen - Mon, 18 Dec 2023 11:11 UTC

Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>> Modern PC's are orders of magnitude faster, but still don't have
>> "instant" compile times by any means.
>>
>> Could be faster though, but would likely need languages other than C or
>> (especially) C++.
>
> I assume you never worked with Turbo Pascal.

I was going to bring up TP but you beat me to it. :-)
>
> That was amazing. It compiled code so fast that it was never a
> bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
> The first version I ever used, 3.0 (?) compiled from memory to
> memory, so even slow I/O (to floppy disc, at the time) was not
> an issue.

TP1.0 was an executable which in ~37KB managed to fit an IDE,
compiler/linker/loader/debugger and RTL, and if you abstained form
getting human readable error messages you could save about 1.5KB.
>
> This was made possible by using a streamlined one-pass compiler. It
> didn't do much optimization, but when the alternative was BASIC, the
> generated code was still extremely fast by comparision.

That compiler had zero optimation, it was a pure pattern match->emit
code engine that would reload the same variable from RAM on every
statement, but as you said, still far faster than the alternatives.

When speed was an actual issue I would switch to (inline) assembler,
even though that was initially just a way to embed machine code directly
so I had to assemble it in DEBUG.
>
> There were a few drawbacks. The biggest one was that programming errors
> tended to freeze the machine. Another (not so important) was that,
> if you were one of the lucky people to have an 80x87 coprocessor, the
> generated code did not check for overflow of the coprocessor stack.
>
The fp code generated by TP would never overflow the 87 stack afair,
since it would do single operations and pop the results at once?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

On 12/18/2023 5:11 AM, Terje Mathisen wrote:
> Thomas Koenig wrote:
>> BGB <cr88192@gmail.com> schrieb:
>>> Modern PC's are orders of magnitude faster, but still don't have
>>> "instant" compile times by any means.
>>>
>>> Could be faster though, but would likely need languages other than C or
>>> (especially) C++.
>>
>> I assume you never worked with Turbo Pascal.
>
> I was going to bring up TP but you beat me to it. :-)
>>
>> That was amazing. It compiled code so fast that it was never a
>> bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
>> The first version I ever used, 3.0 (?) compiled from memory to
>> memory, so even slow I/O (to floppy disc, at the time) was not
>> an issue.
>
> TP1.0 was an executable which in ~37KB managed to fit an IDE,
> compiler/linker/loader/debugger and RTL, and if you abstained form
> getting human readable error messages you could save about 1.5KB.

Yeah, in any case, small compiler is possible.
And, we don't necessarily need some 10+ MLOC monstrosity to do so...

>>
>> This was made possible by using a streamlined one-pass compiler. It
>> didn't do much optimization, but when the alternative was BASIC, the
>> generated code was still extremely fast by comparision.
>
> That compiler had zero optimation, it was a pure pattern match->emit
> code engine that would reload the same variable from RAM on every
> statement, but as you said, still far faster than the alternatives.
>
> When speed was an actual issue I would switch to (inline) assembler,
> even though that was initially just a way to embed machine code directly
> so I had to assemble it in DEBUG.

Early in my SH/BJX1 project, BGBCC wasn't too far off:
Used R8..R14 for caching variables;
Would often move values into R4..R7 to operate on them.

The variable Load and Store operations would:
Do a MOV if value is in a register;
Load from memory otherwise, putting the value into a register.

With all variables being flushed at the end of a basic block (with any
dirty variables being written back to memory).

In my case, I had used a similar model in my JIT compilers (generally on
x86).

So, the an ADD operation might look like:
MOV R8, R4
MOV R9, R5
ADD R5, R4
MOV R4, R10

I then switched to a model which was more like:
Get Var1 as a register for Read;
Get Var2 as a register for Read;
Get Var3 as a register for Write;
Do the operation;
Release Var1, Var2, Var3.

Which could avoid needing a bunch of extra MOV's and similar.

The idea for the mostly stalled TKUCC effort would be to use a similar
model to the current form of BGBCC, just focusing more on minimalism,
and probably using separate compilation. Though, there are pros/cons for
"generate everything all at once"; which requires more memory, but has
more opportunity for optimizations, or at least for pruning stuff.

Though, have noted that GCC seems to have devised a different mechanism
to prune stuff with separate compilation (via "-ffunction-sections" and
"-fdata-sections"), namely, to put every function and variable into its
own section in the object files, which can then be pruned based on
reachability, which are then combined into a single section during linking.

Did recently notice in some fiddling that some things were invoking GCC
like:
echo ... | $CC -E -xc - | ...

Was kind of a pain, but added similar behavior to BGBCC in the attempt
to make BGBCC better able to mimic GCC's command-lines.

Did need to have it omit line numbers in this case, as BGBCC had used a
different notation for encoding these:
BGBCC:
/*"fname"lnum*/ line
GCC:
# lnum "fname"
line
And the way the commands were doing text parsing was incompatible with
BGBCC's line-numbering scheme.

Also early versions of my BJX2 core, in addition to the slow memory bus,
also did not have pipelined memory operations (and memory access
operations used the same OPM/OK signaling scheme as the bus).

In this case, the cost of extra MOV instructions was considered minor
relative to the cost of the memory loads/stores.

Situation has at least improved since then.
Still very often fighting bugs though...

>>
>> There were a few drawbacks. The biggest one was that programming errors
>> tended to freeze the machine. Another (not so important) was that,
>> if you were one of the lucky people to have an 80x87 coprocessor, the
>> generated code did not check for overflow of the coprocessor stack.
>>
> The fp code generated by TP would never overflow the 87 stack afair,
> since it would do single operations and pop the results at once?
>
> Terje
>

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 12/17/23 2:24 PM, Scott Lurndal wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>[snip zero-copy and scatter-gather I/O]
>>> The all interacts with DMA and page management because the physical
>>> page frames that contain the bytes must be pinned in memory for the
>>> duration of the DMA IO.
>>
>> PCI express has an optional feature, PRI (Page Request Interface)
>> that allows the hardware to request that a page be 'pinned' just
>> for the duration of a DMA operation. The ARM64 server base system
>> architecture document requires that the host support PRI. This
>> works in conjunction with PCIe ATS (Address Translation Services)
>> which allows the endpoint device to ask the host for translations
>> and cache them in the endpoint so the endpoint can use physical
>> addresses directly. This is usually implemented by the IOMMU
>> on the host treating the endpoint as if it had a remote TLB cache.
>
>Interesting. I had proposed some years ago that rather than
>pinning a physical page for I/O a page be provided when needed
>from a free list (including that the data could be cached/buffered
>with a virtual address tag).

In most usage cases, the page being DMA'd from/to has other
unrelated data in it, rather than being fully dedicated to
a single buffer or set of buffers.

The PRI is more about making sure the OS makes the page present
before the DMA operation begins and ensuring that it won't go
away before the DMA operation ends.

Re: Whither the Mill?

<e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35862&group=comp.arch#35862

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Mon, 18 Dec 2023 17:39:01 +0000
Organization: novaBBS
Message-ID: <e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com>
References: <ulclu3$3sglk$1@dont-email.me> <gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me> <a54ae908ce5af533e638e112833b35ea@news.novabbs.com> <JA4fN.38899$xHn7.23180@fx14.iad> <2695abc72966c220809e5c6690a8edf6@news.novabbs.com> <ZP5fN.58208$83n7.3029@fx18.iad> <ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com> <LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me> <mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad> <ulocca$3amon$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="342750"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$v4N2axRznel6KJXR.72JDeSOY1N.NTnY0LL8mCLfwjmcfMW6SR2p.
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Mon, 18 Dec 2023 17:39 UTC

Paul A. Clayton wrote:

> On 12/17/23 2:24 PM, Scott Lurndal wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
> [snip zero-copy and scatter-gather I/O]
>>> The all interacts with DMA and page management because the physical
>>> page frames that contain the bytes must be pinned in memory for the
>>> duration of the DMA IO.
>>
>> PCI express has an optional feature, PRI (Page Request Interface)
>> that allows the hardware to request that a page be 'pinned' just
>> for the duration of a DMA operation. The ARM64 server base system
>> architecture document requires that the host support PRI. This
>> works in conjunction with PCIe ATS (Address Translation Services)
>> which allows the endpoint device to ask the host for translations
>> and cache them in the endpoint so the endpoint can use physical
>> addresses directly. This is usually implemented by the IOMMU
>> on the host treating the endpoint as if it had a remote TLB cache.

> Interesting. I had proposed some years ago that rather than
> pinning a physical page for I/O a page be provided when needed
> from a free list (including that the data could be cached/buffered
> with a virtual address tag).

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

> The Mill's backless memory is similar, deferring physical memory
> allocation until cache eviction using a free list (that is
> refilled by a thread that is activated at low water mark)

Re: Whither the Mill?

<VE0gN.6755$Sf59.2927@fx48.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35865&group=comp.arch#35865

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.nntp4.net!weretis.net!feeder8.news.weretis.net!3.eu.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
References: <ulclu3$3sglk$1@dont-email.me> <gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me> <a54ae908ce5af533e638e112833b35ea@news.novabbs.com> <JA4fN.38899$xHn7.23180@fx14.iad> <2695abc72966c220809e5c6690a8edf6@news.novabbs.com> <ZP5fN.58208$83n7.3029@fx18.iad> <ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com> <LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me> <mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad>
In-Reply-To: <tVHfN.35563$JLvf.23986@fx44.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 54
Message-ID: <VE0gN.6755$Sf59.2927@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 18 Dec 2023 19:00:05 UTC
Date: Mon, 18 Dec 2023 13:59:34 -0500
X-Received-Bytes: 3730

by: EricP - Mon, 18 Dec 2023 18:59 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> BGB-Alt wrote:
>>> On 12/16/2023 1:25 PM, EricP wrote:
>>>> MitchAlsup wrote:
>>>>> Scott Lurndal wrote:
>>>>>
>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>> Scott Lurndal wrote:
>>>>>>>
>
>>> One thing I don't get here is why there would be direct DMA between
>>> userland and the device (at least for filesystem and similar).
>> Zero-copy IO. That has always been available on WinNT provided hardware
>> supports it. General byte-buffer IO could always do zero-copy DMA,
>> with HW support. For files one can do IO direct to a user buffer with
>> certain restrictions, buffers must be file block size and alignment.
>> I haven't checked but guessing that if the file block is already in file
>> cache it gets copied, otherwise it DMA's directly to/from the user buffer.
>> Normally one wants cached file blocks but there are times when one doesn't
>> and wants the more optimal direct buffer IO (eg, a video player).
>>
>> There is also scatter-gather IO, intended for network cards,
>> where the IO is a list of byte sized and aligned virtual buffers.
>>
>> The all interacts with DMA and page management because the physical
>> page frames that contain the bytes must be pinned in memory for the
>> duration of the DMA IO.
>
> PCI express has an optional feature, PRI (Page Request Interface)
> that allows the hardware to request that a page be 'pinned' just
> for the duration of a DMA operation. The ARM64 server base system
> architecture document requires that the host support PRI. This
> works in conjunction with PCIe ATS (Address Translation Services)
> which allows the endpoint device to ask the host for translations
> and cache them in the endpoint so the endpoint can use physical
> addresses directly. This is usually implemented by the IOMMU
> on the host treating the endpoint as if it had a remote TLB cache.

I don't know how one would make use of that on Windows as it completely
separates the IO off so that the OS can switch to a different process
address space while the DMA takes place. The data structures to support
paging might not be easily accessible which would introduce long latency
in the middle of a DMA - which is exactly why it doesn't do this.
(I don't think Linux allows paging inside the OS or drivers either.)

On Windows you can have paging while managing a device if you put
the driver code in either a privileged user or super mode thread,
and then you deal with any timing issues.
The old floppy driver worked this way - as an OS thread.
But that was a very slow device and used programmed IO not DMA.

Re: Whither the Mill?

<ulqrl2$3of5a$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35867&group=comp.arch#35867

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Mon, 18 Dec 2023 20:29:04 -0500
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <ulqrl2$3of5a$1@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad>
<2695abc72966c220809e5c6690a8edf6@news.novabbs.com>
<ZP5fN.58208$83n7.3029@fx18.iad>
<ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com>
<LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me>
<mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad>
<ulocca$3amon$1@dont-email.me>
<e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 19 Dec 2023 01:29:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f64f844a06932dce74c88cfaefa61052";
logging-data="3947690"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18eaMBMeafRSS5qsyowO2J47/mLTGnVqRE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:Yjhr/E25qvlzriZY/j7E932cutg=
In-Reply-To: <e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com>

by: Paul A. Clayton - Tue, 19 Dec 2023 01:29 UTC

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]
> Guest OS can pin a guest physical page, but HyperVisor decides
> if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error. I would also guess that some I/O operations
could be merely retried, but some might just be lost. For a
virtualized I/O device, it would seem that the OS would be
confused if a (virtual) physical page was reported as having an
access error but perhaps there would be some generic transaction
failed indicator with information about retrying.

(Even with a pool of free pages and significant virtually tagged
caching, a page freeing thread could be "outrun" by I/O requesting
new pages. This presents denial of service attack potential as
well as ordinary danger of resource starvation. [For short DMAs,
caching-only might be practical with a main memory page never
being allocated. This would require unpinning/binding the page
after the data was copied; the copy could be "free" since the data
would be transferred to a processor cache anyway.])

Managing/avoiding oversubscription of resources is probably a week
or more of a OS design course. I sometimes wish I could spend a
few hundred years in a time bubble studying some of these things.

Re: Whither the Mill?

<acc8b3ce283d57769c43e13af1499def@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35874&group=comp.arch#35874

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Tue, 19 Dec 2023 03:40:07 +0000
Organization: novaBBS
Message-ID: <acc8b3ce283d57769c43e13af1499def@news.novabbs.com>
References: <ulclu3$3sglk$1@dont-email.me> <gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me> <a54ae908ce5af533e638e112833b35ea@news.novabbs.com> <JA4fN.38899$xHn7.23180@fx14.iad> <2695abc72966c220809e5c6690a8edf6@news.novabbs.com> <ZP5fN.58208$83n7.3029@fx18.iad> <ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com> <LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me> <mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad> <ulocca$3amon$1@dont-email.me> <e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com> <ulqrl2$3of5a$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="387198"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$sScNdi5O62oXnGjK0T94DuePAJpNOdpnpBV6SFcXos26AJcw4hgRG
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Tue, 19 Dec 2023 03:40 UTC

Paul A. Clayton wrote:

> On 12/18/23 12:39 PM, MitchAlsup wrote:
> [snip page pinning for DMA]
>> Guest OS can pin a guest physical page, but HyperVisor decides
>> if the page is present or absent in memory.

> Out of curiosity, what happens when an I/O device tries to DMA to
> a page which the OS thinks is pinned. I would *guess* that a DMA
> operation that fails for an unvirtualized I/O device merely
> presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the
I/O device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

> I would also guess that some I/O operations
> could be merely retried, but some might just be lost. For a
> virtualized I/O device, it would seem that the OS would be
> confused if a (virtual) physical page was reported as having an
> access error but perhaps there would be some generic transaction
> failed indicator with information about retrying.

> (Even with a pool of free pages and significant virtually tagged
> caching, a page freeing thread could be "outrun" by I/O requesting
> new pages.

Les the "proper supervisor" sort it out. Keep HW out of the game.

> This presents denial of service attack potential as
> well as ordinary danger of resource starvation. [For short DMAs,
> caching-only might be practical with a main memory page never
> being allocated. This would require unpinning/binding the page
> after the data was copied; the copy could be "free" since the data
> would be transferred to a processor cache anyway.])

> Managing/avoiding oversubscription of resources is probably a week
> or more of a OS design course. I sometimes wish I could spend a
> few hundred years in a time bubble studying some of these things.

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 12/18/23 12:39 PM, MitchAlsup wrote:
>[snip page pinning for DMA]
>> Guest OS can pin a guest physical page, but HyperVisor decides
>> if the page is present or absent in memory.
>
>Out of curiosity, what happens when an I/O device tries to DMA to
>a page which the OS thinks is pinned.

The I/O device simple pushes data to the physical address. It's
the responsibility of the operating software to ensure the
physical address given to the device (either via ATS where the
device hosts the "tlb" or via the IOMMU) is correct and legal.

If the IOMMU translation tables mark the page as absent, an error response
will be returned to the device. If ATS was used, and the
host didn't invalidate the translation at the host, the
device will DMA to the specified physical address regardless
of whether it is the correct page.

Re: Whither the Mill?

<un1kjq$2pvtc$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36507&group=comp.arch#36507

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Mon, 1 Jan 2024 11:48:19 -0500
Organization: A noiseless patient Spider
Lines: 114
Message-ID: <un1kjq$2pvtc$1@dont-email.me>
References: <ulclu3$3sglk$1@dont-email.me>
<gp1pni5t4vfqjsp81fogoboeoqe5hrj5pv@4ax.com> <uli82g$217dj$1@dont-email.me>
<a54ae908ce5af533e638e112833b35ea@news.novabbs.com>
<JA4fN.38899$xHn7.23180@fx14.iad>
<2695abc72966c220809e5c6690a8edf6@news.novabbs.com>
<ZP5fN.58208$83n7.3029@fx18.iad>
<ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com>
<LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me>
<mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad>
<ulocca$3amon$1@dont-email.me>
<e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com>
<ulqrl2$3of5a$1@dont-email.me>
<acc8b3ce283d57769c43e13af1499def@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 2 Jan 2024 18:28:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ed57ef1e2c8fc107a6f9e5f0685c36ed";
logging-data="2949036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX199gQ129aMkRvV8XyXraylpMtiswOnd334="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:qBCp/4xfRE3S+xz/Ahkza2d3d8k=
In-Reply-To: <acc8b3ce283d57769c43e13af1499def@news.novabbs.com>

by: Paul A. Clayton - Mon, 1 Jan 2024 16:48 UTC

On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
>
>> On 12/18/23 12:39 PM, MitchAlsup wrote:
>> [snip page pinning for DMA]
>>> Guest OS can pin a guest physical page, but HyperVisor decides
>>> if the page is present or absent in memory.
>
>> Out of curiosity, what happens when an I/O device tries to DMA to
>> a page which the OS thinks is pinned. I would *guess* that a DMA
>> operation that fails for an unvirtualized I/O device merely
>> presents an error.
>
> If the page fault occurs in the level 1 table, Guest OS gets a
> device page fault exception, if it happens in the level 2 table
> HyperVisor gets a device page fault exception.
>
> If the device can recover from page faults, the proper supervisor
> "does OS stuff" and then signals the device to proceed with the
> still pending device request. The "does OS stuff" does for the I/O
> device pretty much what the proper supervisor does with a
> CPU page fault--with all the nuances and idiosyncrasies (or more.)

If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a
possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).

The HV could allocate the page for any such device, but that
requires the HV to be bloated with device driver specifics and to
check page allocation whenever an OS gave a DMA target to such a
device.

Am I missing something?

[snip previous confusion comment]
>> (Even with a pool of free pages and significant virtually tagged
>> caching, a page freeing thread could be "outrun" by I/O requesting
>> new pages.
>
> Les the "proper supervisor" sort it out. Keep HW out of the game.

Software cannot present the illusion of the page being pinned (for
devices that cannot handle page faults).

Hardware can easily cache data sent from a device at the "I/O Hub"
level using virtual addresses. With "Shadow Memory" ("physical"
addresses outside the memory region that have an additional
translation layer using a TLB-like structure; "Increasing TLB
Reach Using Superpages Backed by Shadow Memory" (Mark Swanson et
al., 1998) proposed this concept) — extended with delayed
allocated support — would allow physically tagged processor caches
to cache DMA that is not backed by actual main memory.

(Of course, elliptical orbits can be approximated to arbitrary
accuracy with epicycles. Accumulating fixes to increasingly rare
behavioral deviations will not lead to an elegant design — even if
it is common practice. On the other hand, just accepting the
coarse approximation of circular orbits, while more elegant, seems
flawed. Some applications will encounter the three body problem
and be forced into a sort of inelegance. For page pinning, I think
hardware page buffering may be worth the complexity, especially
since the functionality presents other opportunities. Since the
Mill provides the same with its Backless Memory, I am not the only
one who thinks the complexity is acceptable — this does not mean
that the complexity cost is not excessive and your experience
indicating it is excessive certainly urges more caution.)

By adding a small pool of free pages and a means to request more
when a low-water mark is reached, hardware/firmware could reduce
the frequency of HV/OS involvement and, except under extreme
utilization when the caches overflow (which could be rather
extreme if even 20% of last-level cache was usable) and the page
free pool is empty, allow "legacy" devices that demand pages be
present to operate as if they were present.

I realize that introducing a "bug" (really any behavioral variance
that violates expectations) that only manifests under extreme
circumstances is problematic. Such a variance would at least have
the documentation of an unexpected but sensible error notification
(possibly just the I/O device giving a page fault error when
hardware cannot keep up); since this is like a paravirtualization
feature the OS/HV would not be confused. (A HV using it could
mostly work with an OS that did not use the feature — rarely
encountering impossible I/O device page faults.)

The distinction between hardware, firmware, and paravirtualizing
hypervisor seems somewhat fuzzy, especially if the hypervisor is
provided by the hardware designer. If moving functionality from a
hypervisor to firmware/hardware can make useful new functionality
possible/practical (which appears to be the case with the above
proposal), then I think such an expansion of hardware
responsibility should be considered. (I fear the Mill will not
get even an FPGA implementation that would really test the limits
of Backless Memory, so this seems likely to be a mere "academic"
proposition.)

>> This presents denial of service attack potential as
>> well as ordinary danger of resource starvation. [For short DMAs,
>> caching-only might be practical with a main memory page never
>> being allocated. This would require unpinning/binding the page
>> after the data was copied; the copy could be "free" since the data
>> would be transferred to a processor cache anyway.])
>
>> Managing/avoiding oversubscription of resources is probably a week
>> or more of a OS design course. I sometimes wish I could spend a
>> few hundred years in a time bubble studying some of these things.

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
> >
> >> On 12/18/23 12:39 PM, MitchAlsup wrote:
> >> [snip page pinning for DMA]
> >>> Guest OS can pin a guest physical page, but HyperVisor decides
> >>> if the page is present or absent in memory.
> >
> >> Out of curiosity, what happens when an I/O device tries to DMA to
> >> a page which the OS thinks is pinned. I would *guess* that a DMA
> >> operation that fails for an unvirtualized I/O device merely
> >> presents an error.
> >
> > If the page fault occurs in the level 1 table, Guest OS gets a
> > device page fault exception, if it happens in the level 2 table
> > HyperVisor gets a device page fault exception.
> >
> > If the device can recover from page faults, the proper supervisor
> > "does OS stuff" and then signals the device to proceed with the
> > still pending device request. The "does OS stuff" does for the I/O
> > device pretty much what the proper supervisor does with a
> > CPU page fault--with all the nuances and idiosyncrasies (or more.)
>
>If the HV encounters a device that cannot handle a page fault for
>a page that it decided not to allocate but the OS did (knowing
>that that specific device could not handle page faults), what
>error status is sent to the OS? The HV cannot simply pass along a
>"page fault" error because the OS _knows_ that the page was
>allocated; that would break pure virtualization and potentially
>seriously confuse the OS if virtualization was not considered as a
>possibility (e.g., the OS might assume the device had either a
>transient or persistent error that caused the wrong error type to
>be returned, confirm it as persistent after the second encounter,
>and mark the device as broken).

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former
handles the translations and the later requests that a page
be "pinned" for a subsequent DMA operation.

The HV controls the IOMMU which provides both the ATS and PRI interfaces
to the device. So the HV can invalidate a translation held in the
device (for ATS) or refuse to pin a page (or unpin a page).

Re: Whither the Mill?

<c655b3864159e9a942f86d1b09453283@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=36567&group=comp.arch#36567

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Whither the Mill?
Date: Wed, 3 Jan 2024 18:00:54 +0000
Organization: novaBBS
Message-ID: <c655b3864159e9a942f86d1b09453283@news.novabbs.com>
References: <ulclu3$3sglk$1@dont-email.me> <2695abc72966c220809e5c6690a8edf6@news.novabbs.com> <ZP5fN.58208$83n7.3029@fx18.iad> <ea8c8a6be398fa64936d2da4efc2ca71@news.novabbs.com> <LQmfN.5865$zqTf.4843@fx35.iad> <ull9he$2j2v6$1@dont-email.me> <mTGfN.47607$yEgf.35565@fx09.iad> <tVHfN.35563$JLvf.23986@fx44.iad> <ulocca$3amon$1@dont-email.me> <e76ab4f97a53caa61bbb7b729fcca360@news.novabbs.com> <ulqrl2$3of5a$1@dont-email.me> <acc8b3ce283d57769c43e13af1499def@news.novabbs.com> <un1kjq$2pvtc$1@dont-email.me> <Y81lN.18613$9cLc.10524@fx02.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2151448"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$UByehB.PhYN90WhUb7DSwOTJTFVffXJPwERuBUeQKFTP61M70/QfC

by: MitchAlsup - Wed, 3 Jan 2024 18:00 UTC

Scott Lurndal wrote:

> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
>> >
>> >> On 12/18/23 12:39 PM, MitchAlsup wrote:
>> >> [snip page pinning for DMA]
>> >>> Guest OS can pin a guest physical page, but HyperVisor decides
>> >>> if the page is present or absent in memory.
>> >
>> >> Out of curiosity, what happens when an I/O device tries to DMA to
>> >> a page which the OS thinks is pinned. I would *guess* that a DMA
>> >> operation that fails for an unvirtualized I/O device merely
>> >> presents an error.
>> >
>> > If the page fault occurs in the level 1 table, Guest OS gets a
>> > device page fault exception, if it happens in the level 2 table
>> > HyperVisor gets a device page fault exception.
>> >
>> > If the device can recover from page faults, the proper supervisor
>> > "does OS stuff" and then signals the device to proceed with the
>> > still pending device request. The "does OS stuff" does for the I/O
>> > device pretty much what the proper supervisor does with a
>> > CPU page fault--with all the nuances and idiosyncrasies (or more.)
>>
>>If the HV encounters a device that cannot handle a page fault for
>>a page that it decided not to allocate but the OS did (knowing
>>that that specific device could not handle page faults), what
>>error status is sent to the OS? The HV cannot simply pass along a
>>"page fault" error because the OS _knows_ that the page was
>>allocated; that would break pure virtualization and potentially
>>seriously confuse the OS if virtualization was not considered as a
>>possibility (e.g., the OS might assume the device had either a
>>transient or persistent error that caused the wrong error type to
>>be returned, confirm it as persistent after the second encounter,
>>and mark the device as broken).

> If the HV is allowing direct access to the device, and allowing
> the device to use physical addresses via cached translations,
> then the device must support both PCIe ATS and PRI. The former

Or have a HostBridge that provides translation services to
virtualized devices....

> handles the translations and the later requests that a page
> be "pinned" for a subsequent DMA operation.

> The HV controls the IOMMU which provides both the ATS and PRI interfaces
> to the device. So the HV can invalidate a translation held in the
> device (for ATS) or refuse to pin a page (or unpin a page).

mitchalsup@aol.com (MitchAlsup) writes:
>Scott Lurndal wrote:
>
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>>On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
>>> >
>>> >> On 12/18/23 12:39 PM, MitchAlsup wrote:
>>> >> [snip page pinning for DMA]
>>> >>> Guest OS can pin a guest physical page, but HyperVisor decides
>>> >>> if the page is present or absent in memory.
>>> >
>>> >> Out of curiosity, what happens when an I/O device tries to DMA to
>>> >> a page which the OS thinks is pinned. I would *guess* that a DMA
>>> >> operation that fails for an unvirtualized I/O device merely
>>> >> presents an error.
>>> >
>>> > If the page fault occurs in the level 1 table, Guest OS gets a
>>> > device page fault exception, if it happens in the level 2 table
>>> > HyperVisor gets a device page fault exception.
>>> >
>>> > If the device can recover from page faults, the proper supervisor
>>> > "does OS stuff" and then signals the device to proceed with the
>>> > still pending device request. The "does OS stuff" does for the I/O
>>> > device pretty much what the proper supervisor does with a
>>> > CPU page fault--with all the nuances and idiosyncrasies (or more.)
>>>
>>>If the HV encounters a device that cannot handle a page fault for
>>>a page that it decided not to allocate but the OS did (knowing
>>>that that specific device could not handle page faults), what
>>>error status is sent to the OS? The HV cannot simply pass along a
>>>"page fault" error because the OS _knows_ that the page was
>>>allocated; that would break pure virtualization and potentially
>>>seriously confuse the OS if virtualization was not considered as a
>>>possibility (e.g., the OS might assume the device had either a
>>>transient or persistent error that caused the wrong error type to
>>>be returned, confirm it as persistent after the second encounter,
>>>and mark the device as broken).
>
>> If the HV is allowing direct access to the device, and allowing
>> the device to use physical addresses via cached translations,
>> then the device must support both PCIe ATS and PRI. The former
>
>Or have a HostBridge that provides translation services to
>virtualized devices....

All of the major operating systems fully support PCIe ATS and PRI
standards.

Leveraging that makes your processor viable, using a custom host
bridge doesn't.

Subject	Author
Whither the Mill?	Stephen Fuld
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	BGB
Re: Whither the Mill?	George Neuner
Re: Whither the Mill?	BGB
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	EricP
Re: Whither the Mill?	BGB-Alt
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	BGB-Alt
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	EricP
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	Paul A. Clayton
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Paul A. Clayton
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Paul A. Clayton
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	EricP
Re: Whither the Mill?	BGB
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	MitchAlsup
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	Niklas Holsti
Re: Whither the Mill?	Anton Ertl
Re: Whither the Mill?	Thomas Koenig
Re: Whither the Mill?	Scott Lurndal
Re: Whither the Mill?	moi
Re: Whither the Mill?	BGB
Re: Whither the Mill?	Thomas Koenig
Re: Whither the Mill?	BGB
Re: fast compiling, Whither the Mill?	John Levine
Re: Whither the Mill?	Terje Mathisen
Re: Whither the Mill?	BGB
Re: Whither the Mill?	BGB-Alt
Re: Whither the Mill?	EricP
Re: Whither the Mill?	Andreas Eder
Re: Whither the Mill?	Chris M. Thomasson
Re: Whither the Mill?	BGB
Re: Whither the Mill?	Quadibloc