Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

No line available at 300 baud.


devel / comp.arch / Re: hmac hardware acceleration...

SubjectAuthor
* hmac hardware acceleration...Chris M. Thomasson
`* Re: hmac hardware acceleration...MitchAlsup
 `* Re: hmac hardware acceleration...Terje Mathisen
  +* Re: hmac hardware acceleration...EricP
  |`* Re: hmac hardware acceleration...Scott Lurndal
  | `* Re: hmac hardware acceleration...EricP
  |  `* Re: hmac hardware acceleration...MitchAlsup
  |   `* Re: hmac hardware acceleration...EricP
  |    +* Re: hmac hardware acceleration...MitchAlsup
  |    |+- Re: hmac hardware acceleration...Scott Lurndal
  |    |+- Re: hmac hardware acceleration...EricP
  |    |`* Re: hmac hardware acceleration...Chris M. Thomasson
  |    | `- Re: hmac hardware acceleration...Chris M. Thomasson
  |    `- Re: hmac hardware acceleration...Scott Lurndal
  `- Re: hmac hardware acceleration...Thomas Koenig

1
hmac hardware acceleration...

<ul8qlh$3htui$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35692&group=comp.arch#35692

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: hmac hardware acceleration...
Date: Mon, 11 Dec 2023 21:21:52 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <ul8qlh$3htui$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Dec 2023 05:21:53 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="23d5b978bf91b8a6165d34ef4bea246f";
logging-data="3733458"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bSAn7+ioLBYJSfVZpA35qMdj3HmQkYqM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:g0/0u5ueeZuS0GREZrBnfguEqFU=
Content-Language: en-US
 by: Chris M. Thomasson - Tue, 12 Dec 2023 05:21 UTC

Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little write
up, has some crude Python 3 code in there. It's not all that fast, yikes!

http://funwithfractals.atspace.cc/ct_cipher

Online version of it:

Online experiment:

http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never use
it until it has been deemed worth to protect a pet's life, your Mom's
life, your own life, ect.

Re: hmac hardware acceleration...

<aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35744&group=comp.arch#35744

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Thu, 14 Dec 2023 20:08:30 +0000
Organization: novaBBS
Message-ID: <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
References: <ul8qlh$3htui$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="4122720"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$r.ZGPH/Y6eEs14A9JUwnQOMhsFjNT4mxigGi.jjTMqvbypqWCQLd6
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Thu, 14 Dec 2023 20:08 UTC

Chris M. Thomasson wrote:

> Humm... I am wondering if hardware based HMAC could possibly help out
> one of my encryption experiments, for fun... A hyper crude little write
> up, has some crude Python 3 code in there. It's not all that fast, yikes!

> http://funwithfractals.atspace.cc/ct_cipher

> Online version of it:

> Online experiment:

> http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

> First of all never use this cipher simply because it has not been
> properly peer reviewed yet! If interested, experiment with it, never use
> it until it has been deemed worth to protect a pet's life, your Mom's
> life, your own life, ect.

When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in
ISA directly.

Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any
CPU register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.

Re: hmac hardware acceleration...

<ulh5mg$1rhh9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35754&group=comp.arch#35754

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.samoylyk.net!news.mb-net.net!open-news-network.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Fri, 15 Dec 2023 10:19:12 +0100
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <ulh5mg$1rhh9$1@dont-email.me>
References: <ul8qlh$3htui$1@dont-email.me>
<aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 15 Dec 2023 09:19:12 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dab2240ebb002b11283d42f201a52037";
logging-data="1951273"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18DE6bGaboldQxhKBnP5cPJ0UjFj0JukgI9Oyczh4Y1XQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17.1
Cancel-Lock: sha1:XSHLd2RzxYGQgA59r6EqPVrIxV8=
In-Reply-To: <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
 by: Terje Mathisen - Fri, 15 Dec 2023 09:19 UTC

MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> Humm... I am wondering if hardware based HMAC could possibly help out
>> one of my encryption experiments, for fun... A hyper crude little
>> write up, has some crude Python 3 code in there. It's not all that
>> fast, yikes!
>
>> http://funwithfractals.atspace.cc/ct_cipher
>
>> Online version of it:
>
>> Online experiment:
>
>> http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
>
>> First of all never use this cipher simply because it has not been
>> properly peer reviewed yet! If interested, experiment with it, never
>> use it until it has been deemed worth to protect a pet's life, your
>> Mom's life, your own life, ect.
>
>
> When I looked into this a while back, I came to the conclusion that
> incorporating something like SHA256, SHA512, DES, AES, ... encryption
> stuff suits an attached processor a lot better than putting it in ISA
> directly.
>
> Why: It is fundamentally difficult to chop up the units of work
> to fit in GPRs, and if you run the data through the GPRs (or any CPU
> register) you open up holes in your security blanket that
> are never open in the attached processor implementation. Perf
> will be better in the attached processor version unless the
> width of the en/decryption is small.

I disagree, specifically because these algorithms are used a lot on
short inputs: For a bulk process an attached coprocessor is an excellent
idea, but when you just want to verify the hash of a very short message,
or encrypt a single packet, you do want this to be very close to the cpu.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: hmac hardware acceleration...

<bb_eN.21874$xHn7.20939@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35762&group=comp.arch#35762

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me>
In-Reply-To: <ulh5mg$1rhh9$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 59
Message-ID: <bb_eN.21874$xHn7.20939@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 15 Dec 2023 15:22:47 UTC
Date: Fri, 15 Dec 2023 10:22:38 -0500
X-Received-Bytes: 3209
 by: EricP - Fri, 15 Dec 2023 15:22 UTC

Terje Mathisen wrote:
> MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> Humm... I am wondering if hardware based HMAC could possibly help out
>>> one of my encryption experiments, for fun... A hyper crude little
>>> write up, has some crude Python 3 code in there. It's not all that
>>> fast, yikes!
>>
>>> http://funwithfractals.atspace.cc/ct_cipher
>>
>>> Online version of it:
>>
>>> Online experiment:
>>
>>> http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
>>
>>> First of all never use this cipher simply because it has not been
>>> properly peer reviewed yet! If interested, experiment with it, never
>>> use it until it has been deemed worth to protect a pet's life, your
>>> Mom's life, your own life, ect.
>>
>>
>> When I looked into this a while back, I came to the conclusion that
>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>> stuff suits an attached processor a lot better than putting it in ISA
>> directly.
>>
>> Why: It is fundamentally difficult to chop up the units of work
>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>> register) you open up holes in your security blanket that
>> are never open in the attached processor implementation. Perf
>> will be better in the attached processor version unless the
>> width of the en/decryption is small.
>
> I disagree, specifically because these algorithms are used a lot on
> short inputs: For a bulk process an attached coprocessor is an excellent
> idea, but when you just want to verify the hash of a very short message,
> or encrypt a single packet, you do want this to be very close to the cpu.
>
> Terje

An issue I see is in thread switching. You don't want a user process
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.

It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.

Or make the coprocessor jobs interruptible.

Or maybe like a barrel processor where the OS can allocate
as many tasks as it wants, and assign one to each user thread plus
some to itself for high priority comms and low priority background.

Re: hmac hardware acceleration...

<Ry2fN.54371$7sbb.893@fx16.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35767&group=comp.arch#35767

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!eternal-september.org!tncsrv06.tnetconsulting.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: hmac hardware acceleration...
Newsgroups: comp.arch
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad>
Lines: 66
Message-ID: <Ry2fN.54371$7sbb.893@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Fri, 15 Dec 2023 20:21:05 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Fri, 15 Dec 2023 20:21:05 GMT
X-Received-Bytes: 3509
 by: Scott Lurndal - Fri, 15 Dec 2023 20:21 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>>> Humm... I am wondering if hardware based HMAC could possibly help out
>>>> one of my encryption experiments, for fun... A hyper crude little
>>>> write up, has some crude Python 3 code in there. It's not all that
>>>> fast, yikes!
>>>
>>>> http://funwithfractals.atspace.cc/ct_cipher
>>>
>>>> Online version of it:
>>>
>>>> Online experiment:
>>>
>>>> http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
>>>
>>>> First of all never use this cipher simply because it has not been
>>>> properly peer reviewed yet! If interested, experiment with it, never
>>>> use it until it has been deemed worth to protect a pet's life, your
>>>> Mom's life, your own life, ect.
>>>
>>>
>>> When I looked into this a while back, I came to the conclusion that
>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>>> stuff suits an attached processor a lot better than putting it in ISA
>>> directly.
>>>
>>> Why: It is fundamentally difficult to chop up the units of work
>>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>>> register) you open up holes in your security blanket that
>>> are never open in the attached processor implementation. Perf
>>> will be better in the attached processor version unless the
>>> width of the en/decryption is small.
>>
>> I disagree, specifically because these algorithms are used a lot on
>> short inputs: For a bulk process an attached coprocessor is an excellent
>> idea, but when you just want to verify the hash of a very short message,
>> or encrypt a single packet, you do want this to be very close to the cpu.
>>
>> Terje
>
>An issue I see is in thread switching. You don't want a user process
>to be able to block the OS thread switching for an arbitrary time
>while it syncs with this coprocessor.
>
>It needs a coprocessor which is both fully asynchronous for
>bulk jobs from multiple processes and threads in the background,
>high priority communication packets from drivers,
>and like the x87 available as a semi-asynchronous resource to the
>current thread on zero notice but for limited size jobs.

Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.

There are no worries about the host scheduling threads
in the process - the process owns the virtual function.

(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).

This model is used in DPDK and ODP, for example.

Re: hmac hardware acceleration...

<ulk4is$ur3h$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35782&group=comp.arch#35782

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dd23-0-3405-29ed-c929-c26d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Sat, 16 Dec 2023 12:18:36 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <ulk4is$ur3h$1@newsreader4.netcologne.de>
References: <ul8qlh$3htui$1@dont-email.me>
<aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
<ulh5mg$1rhh9$1@dont-email.me>
Injection-Date: Sat, 16 Dec 2023 12:18:36 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dd23-0-3405-29ed-c929-c26d.ipv6dyn.netcologne.de:2001:4dd7:dd23:0:3405:29ed:c929:c26d";
logging-data="1010801"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 16 Dec 2023 12:18 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> MitchAlsup wrote:

>> When I looked into this a while back, I came to the conclusion that
>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>> stuff suits an attached processor a lot better than putting it in ISA
>> directly.

That is the solution that IBM Z is using.

>> Why: It is fundamentally difficult to chop up the units of work
>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>> register) you open up holes in your security blanket that
>> are never open in the attached processor implementation. Perf
>> will be better in the attached processor version unless the
>> width of the en/decryption is small.
>
> I disagree, specifically because these algorithms are used a lot on
> short inputs: For a bulk process an attached coprocessor is an excellent
> idea, but when you just want to verify the hash of a very short message,
> or encrypt a single packet, you do want this to be very close to the cpu.

And that is what Power does with its vcipher and vcipherlast
instructions, which do a single round of AES.

POWER9 has six cycles of latency and at most operation per cycle,
Power10 between four and seven cycles, but four in parallel (I
guess they invested some of their silicon there).

AES operates on blocks of 128 bits, so 128-bit registers are quite
natural there. For My 66000, this would require either register
pairs or a variant of Carry, so thi is probably not an easy fit.

Re: hmac hardware acceleration...

<8gkfN.51800$Wp_8.38957@fx17.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35791&group=comp.arch#35791

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad>
In-Reply-To: <Ry2fN.54371$7sbb.893@fx16.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 72
Message-ID: <8gkfN.51800$Wp_8.38957@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 16 Dec 2023 16:29:56 UTC
Date: Sat, 16 Dec 2023 11:28:37 -0500
X-Received-Bytes: 4243
 by: EricP - Sat, 16 Dec 2023 16:28 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Terje Mathisen wrote:
>>> MitchAlsup wrote:
>>>>
>>>> When I looked into this a while back, I came to the conclusion that
>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>>>> stuff suits an attached processor a lot better than putting it in ISA
>>>> directly.
>>>>
>>>> Why: It is fundamentally difficult to chop up the units of work
>>>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>>>> register) you open up holes in your security blanket that
>>>> are never open in the attached processor implementation. Perf
>>>> will be better in the attached processor version unless the
>>>> width of the en/decryption is small.
>>> I disagree, specifically because these algorithms are used a lot on
>>> short inputs: For a bulk process an attached coprocessor is an excellent
>>> idea, but when you just want to verify the hash of a very short message,
>>> or encrypt a single packet, you do want this to be very close to the cpu.
>>>
>>> Terje
>> An issue I see is in thread switching. You don't want a user process
>> to be able to block the OS thread switching for an arbitrary time
>> while it syncs with this coprocessor.
>>
>> It needs a coprocessor which is both fully asynchronous for
>> bulk jobs from multiple processes and threads in the background,
>> high priority communication packets from drivers,
>> and like the x87 available as a semi-asynchronous resource to the
>> current thread on zero notice but for limited size jobs.
>
> Our coprocessors are 'virtualized', such that they provide
> a physical function and a number of virtual functions; a
> virtual function can be assigned (mapped into the
> address space directly) to a process and
> it can directly access the coprocessor from user mode.
>
> There are no worries about the host scheduling threads
> in the process - the process owns the virtual function.
>
> (see PCI express single-root I/O virtualization (SR-IOV) which
> is the model used for standard OS compatability).
>
> This model is used in DPDK and ODP, for example.

Unfortunately the PCIe specs are all paywalled so I can't get the
real poop on it. Linux doesn't seem to have any documentation on it.
Microsoft only has the Windows driver development guides which
I've had a look at.

Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".

I also could not find out how Windows delivers virtual interrupts
signaling IO completion for SR-IOV devices. Assuming it would use
something call an APC's, similar to a *nix signal, that would be
an expensive way to be notified of coprocessor completion.
Again as SR-IOV was intended for IO virtualization so in that
context that overhead is reasonable.

Otherwise one would have to use SR-IOV polling in a spin loop to
detect completion, whereas a coprocessor like the x87 has the FWAIT
instruction to halt the processor until completion.

Re: hmac hardware acceleration...

<f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35800&group=comp.arch#35800

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Sat, 16 Dec 2023 21:46:01 +0000
Organization: novaBBS
Message-ID: <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="150663"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$V.n64VKefYvN73eFB/JLpeHSTHRL9dH2m554LpCFdHYQoc82f.mci
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sat, 16 Dec 2023 21:46 UTC

EricP wrote:

> Scott Lurndal wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> Terje Mathisen wrote:
>>>> MitchAlsup wrote:
>>>>>
>>>>> When I looked into this a while back, I came to the conclusion that
>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>>>>> stuff suits an attached processor a lot better than putting it in ISA
>>>>> directly.
>>>>>
>>>>> Why: It is fundamentally difficult to chop up the units of work
>>>>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>>>>> register) you open up holes in your security blanket that
>>>>> are never open in the attached processor implementation. Perf
>>>>> will be better in the attached processor version unless the
>>>>> width of the en/decryption is small.
>>>> I disagree, specifically because these algorithms are used a lot on
>>>> short inputs: For a bulk process an attached coprocessor is an excellent
>>>> idea, but when you just want to verify the hash of a very short message,
>>>> or encrypt a single packet, you do want this to be very close to the cpu.
>>>>
>>>> Terje
>>> An issue I see is in thread switching. You don't want a user process
>>> to be able to block the OS thread switching for an arbitrary time
>>> while it syncs with this coprocessor.
>>>
>>> It needs a coprocessor which is both fully asynchronous for
>>> bulk jobs from multiple processes and threads in the background,
>>> high priority communication packets from drivers,
>>> and like the x87 available as a semi-asynchronous resource to the
>>> current thread on zero notice but for limited size jobs.
>>
>> Our coprocessors are 'virtualized', such that they provide
>> a physical function and a number of virtual functions; a
>> virtual function can be assigned (mapped into the
>> address space directly) to a process and
>> it can directly access the coprocessor from user mode.
>>
>> There are no worries about the host scheduling threads
>> in the process - the process owns the virtual function.
>>
>> (see PCI express single-root I/O virtualization (SR-IOV) which
>> is the model used for standard OS compatability).
>>
>> This model is used in DPDK and ODP, for example.

> Unfortunately the PCIe specs are all paywalled so I can't get the
> real poop on it. Linux doesn't seem to have any documentation on it.
> Microsoft only has the Windows driver development guides which
> I've had a look at.

> Presenting the coprocessor as a Virtual Function (VF) could work but,
> from the limited info I have seen, using a VF does seem to be limited
> because the SR-IOV device only export a fixed number of VF's,
> (eg 16, 32, 64) as it is the device that maps from the
> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
> optimizing paravirtualized network cards, in that context it is a
> reasonable limitation. However this would not be suitable for a
> coprocessor to say "access denied, all are in use".

Each Guest OS gets one VF.

> I also could not find out how Windows delivers virtual interrupts
> signaling IO completion for SR-IOV devices. Assuming it would use
> something call an APC's, similar to a *nix signal, that would be
> an expensive way to be notified of coprocessor completion.

A sufficiently privileged interrupt dispatching thread receives control.
It examines the pending interrupts and dispatches the interrupt handler.
The interrupt handler then services the interrupt and DPCs/softIRQs cleanup
activities.
A stack of PDCs/softIRQs wander through the cleanup work and finally
schedule the user thread (synch) or send user thread a signal (asynch)
Scheduler receives control and sooner or later delivers control back
to user.

> Again as SR-IOV was intended for IO virtualization so in that
> context that overhead is reasonable.

> Otherwise one would have to use SR-IOV polling in a spin loop to
> detect completion, whereas a coprocessor like the x87 has the FWAIT
> instruction to halt the processor until completion.

Re: hmac hardware acceleration...

<v7GfN.23137$LONb.17592@fx08.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35833&group=comp.arch#35833

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!eternal-september.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx08.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad> <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>
In-Reply-To: <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 119
Message-ID: <v7GfN.23137$LONb.17592@fx08.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 17 Dec 2023 17:22:35 UTC
Date: Sun, 17 Dec 2023 12:21:57 -0500
X-Received-Bytes: 6540
 by: EricP - Sun, 17 Dec 2023 17:21 UTC

MitchAlsup wrote:
> EricP wrote:
>
>> Scott Lurndal wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>> Terje Mathisen wrote:
>>>>> MitchAlsup wrote:
>>>>>>
>>>>>> When I looked into this a while back, I came to the conclusion that
>>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>>>>>> stuff suits an attached processor a lot better than putting it in
>>>>>> ISA directly.
>>>>>>
>>>>>> Why: It is fundamentally difficult to chop up the units of work
>>>>>> to fit in GPRs, and if you run the data through the GPRs (or any
>>>>>> CPU register) you open up holes in your security blanket that
>>>>>> are never open in the attached processor implementation. Perf
>>>>>> will be better in the attached processor version unless the
>>>>>> width of the en/decryption is small.
>>>>> I disagree, specifically because these algorithms are used a lot on
>>>>> short inputs: For a bulk process an attached coprocessor is an
>>>>> excellent idea, but when you just want to verify the hash of a very
>>>>> short message, or encrypt a single packet, you do want this to be
>>>>> very close to the cpu.
>>>>>
>>>>> Terje
>>>> An issue I see is in thread switching. You don't want a user process
>>>> to be able to block the OS thread switching for an arbitrary time
>>>> while it syncs with this coprocessor.
>>>>
>>>> It needs a coprocessor which is both fully asynchronous for
>>>> bulk jobs from multiple processes and threads in the background,
>>>> high priority communication packets from drivers,
>>>> and like the x87 available as a semi-asynchronous resource to the
>>>> current thread on zero notice but for limited size jobs.
>>>
>>> Our coprocessors are 'virtualized', such that they provide
>>> a physical function and a number of virtual functions; a
>>> virtual function can be assigned (mapped into the
>>> address space directly) to a process and
>>> it can directly access the coprocessor from user mode.
>>>
>>> There are no worries about the host scheduling threads
>>> in the process - the process owns the virtual function.
>>>
>>> (see PCI express single-root I/O virtualization (SR-IOV) which
>>> is the model used for standard OS compatability).
>>>
>>> This model is used in DPDK and ODP, for example.
>
>> Unfortunately the PCIe specs are all paywalled so I can't get the
>> real poop on it. Linux doesn't seem to have any documentation on it.
>> Microsoft only has the Windows driver development guides which
>> I've had a look at.
>
>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>> from the limited info I have seen, using a VF does seem to be limited
>> because the SR-IOV device only export a fixed number of VF's,
>> (eg 16, 32, 64) as it is the device that maps from the
>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>> optimizing paravirtualized network cards, in that context it is a
>> reasonable limitation. However this would not be suitable for a
>> coprocessor to say "access denied, all are in use".
>
> Each Guest OS gets one VF.

Right, that is the intended purpose for network cards in virtual machines
although the SR-IOV specs are generalized.

Then there is also was this movement that wants to do "zero-copy" network
IO directly from user space IO buffers, which is a VF per process opening
the device.

Then the two combine and it becomes a VF per guest processes opening the
device per guest OS and that fixed device quota of 16,32,64 VF's starts
looking a little sparse.

In either of these cases the IO device has to be opened so it would be ok
to return a status "denied, device not available" as that is already a
possible IO open status.

A coprocessor is intended to be implicitly immediately available,
under OS control, to the current processor context, be it threads, OS
or drivers. That implies huge quota of VF's for all threads plus
sundry other uses on all guest OS just in case they want one.
And, just guessing at the device internals, implies huge management tables,
CAMs instead of SRAMs, caches, blah, blah, etc.

>> I also could not find out how Windows delivers virtual interrupts
>> signaling IO completion for SR-IOV devices. Assuming it would use
>> something call an APC's, similar to a *nix signal, that would be
>> an expensive way to be notified of coprocessor completion.
>
> A sufficiently privileged interrupt dispatching thread receives control.
> It examines the pending interrupts and dispatches the interrupt handler.
> The interrupt handler then services the interrupt and DPCs/softIRQs cleanup
> activities.
> A stack of PDCs/softIRQs wander through the cleanup work and finally
> schedule the user thread (synch) or send user thread a signal (asynch)
> Scheduler receives control and sooner or later delivers control back
> to user.

I'm familiar with the OS mechanisms, its the overhead I'm pointing out.
To do this the hypervisor has to dispatch a virtual interrupt to the
guest OS, which converts it to its local delivery mechanism,
on Windows DPC->UAPC, on *nix to softIrq->signal,
and delivers it to the guest thread on the guest OS.

The overhead of the async completion signal would likely be much greater
that the cost of the original coprocessor hash/encrypt.

>> Again as SR-IOV was intended for IO virtualization so in that
>> context that overhead is reasonable.
>
>> Otherwise one would have to use SR-IOV polling in a spin loop to
>> detect completion, whereas a coprocessor like the x87 has the FWAIT
>> instruction to halt the processor until completion.

Re: hmac hardware acceleration...

<80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35834&group=comp.arch#35834

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Sun, 17 Dec 2023 17:57:39 +0000
Organization: novaBBS
Message-ID: <80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad> <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com> <v7GfN.23137$LONb.17592@fx08.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="236530"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$w7DkgFslN5mifLnQUwP7jefFDue8RAyDwZ3ZYwQ1c7x1I8IGlg2oW
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sun, 17 Dec 2023 17:57 UTC

EricP wrote:

> MitchAlsup wrote:
>> EricP wrote:
>>
>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>> from the limited info I have seen, using a VF does seem to be limited
>>> because the SR-IOV device only export a fixed number of VF's,
>>> (eg 16, 32, 64) as it is the device that maps from the
>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>> optimizing paravirtualized network cards, in that context it is a
>>> reasonable limitation. However this would not be suitable for a
>>> coprocessor to say "access denied, all are in use".
>>
>> Each Guest OS gets one VF.

> Right, that is the intended purpose for network cards in virtual machines
> although the SR-IOV specs are generalized.

> Then there is also was this movement that wants to do "zero-copy" network
> IO directly from user space IO buffers, which is a VF per process opening
> the device.

Why is this not an I/O MMU mapping. Kernel still does setup and teardown
but device does DMA directly into user (requestor) memory.

> Then the two combine and it becomes a VF per guest processes opening the
> device per guest OS and that fixed device quota of 16,32,64 VF's starts
> looking a little sparse.

Which is why direct user access to devices will never win.

> In either of these cases the IO device has to be opened so it would be ok
> to return a status "denied, device not available" as that is already a
> possible IO open status.

> A coprocessor is intended to be implicitly immediately available,
> under OS control, to the current processor context, be it threads, OS
> or drivers. That implies huge quota of VF's for all threads plus
> sundry other uses on all guest OS just in case they want one.
> And, just guessing at the device internals, implies huge management tables,
> CAMs instead of SRAMs, caches, blah, blah, etc.

Re: hmac hardware acceleration...

<PiIfN.35633$JLvf.21896@fx44.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35844&group=comp.arch#35844

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: hmac hardware acceleration...
Newsgroups: comp.arch
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad> <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com> <v7GfN.23137$LONb.17592@fx08.iad>
Lines: 125
Message-ID: <PiIfN.35633$JLvf.21896@fx44.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 17 Dec 2023 19:51:11 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 17 Dec 2023 19:51:11 GMT
X-Received-Bytes: 6699
 by: Scott Lurndal - Sun, 17 Dec 2023 19:51 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>MitchAlsup wrote:
>> EricP wrote:
>>
>>> Scott Lurndal wrote:
>>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>>> Terje Mathisen wrote:
>>>>>> MitchAlsup wrote:
>>>>>>>
>>>>>>> When I looked into this a while back, I came to the conclusion that
>>>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>>>>>>> stuff suits an attached processor a lot better than putting it in
>>>>>>> ISA directly.
>>>>>>>
>>>>>>> Why: It is fundamentally difficult to chop up the units of work
>>>>>>> to fit in GPRs, and if you run the data through the GPRs (or any
>>>>>>> CPU register) you open up holes in your security blanket that
>>>>>>> are never open in the attached processor implementation. Perf
>>>>>>> will be better in the attached processor version unless the
>>>>>>> width of the en/decryption is small.
>>>>>> I disagree, specifically because these algorithms are used a lot on
>>>>>> short inputs: For a bulk process an attached coprocessor is an
>>>>>> excellent idea, but when you just want to verify the hash of a very
>>>>>> short message, or encrypt a single packet, you do want this to be
>>>>>> very close to the cpu.
>>>>>>
>>>>>> Terje
>>>>> An issue I see is in thread switching. You don't want a user process
>>>>> to be able to block the OS thread switching for an arbitrary time
>>>>> while it syncs with this coprocessor.
>>>>>
>>>>> It needs a coprocessor which is both fully asynchronous for
>>>>> bulk jobs from multiple processes and threads in the background,
>>>>> high priority communication packets from drivers,
>>>>> and like the x87 available as a semi-asynchronous resource to the
>>>>> current thread on zero notice but for limited size jobs.
>>>>
>>>> Our coprocessors are 'virtualized', such that they provide
>>>> a physical function and a number of virtual functions; a
>>>> virtual function can be assigned (mapped into the
>>>> address space directly) to a process and
>>>> it can directly access the coprocessor from user mode.
>>>>
>>>> There are no worries about the host scheduling threads
>>>> in the process - the process owns the virtual function.
>>>>
>>>> (see PCI express single-root I/O virtualization (SR-IOV) which
>>>> is the model used for standard OS compatability).
>>>>
>>>> This model is used in DPDK and ODP, for example.
>>
>>> Unfortunately the PCIe specs are all paywalled so I can't get the
>>> real poop on it. Linux doesn't seem to have any documentation on it.
>>> Microsoft only has the Windows driver development guides which
>>> I've had a look at.
>>
>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>> from the limited info I have seen, using a VF does seem to be limited
>>> because the SR-IOV device only export a fixed number of VF's,
>>> (eg 16, 32, 64) as it is the device that maps from the
>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>> optimizing paravirtualized network cards, in that context it is a
>>> reasonable limitation. However this would not be suitable for a
>>> coprocessor to say "access denied, all are in use".
>>
>> Each Guest OS gets one VF.
>
>Right, that is the intended purpose for network cards in virtual machines
>although the SR-IOV specs are generalized.

Indeed, and the number of VF's is limited by the PCIe specification to
65535 with one PF.

The device is dividing its resources amongst the VF's, so the maximum number
of VF's is controlled by the amount of resources available on the device
and the implementation of the logic on the device.

The number oF VF's exposed to the host is controlled by the host driver
(up to the max supported by the device) via stores to the device
configuration space SR-IOV capability.

>A coprocessor is intended to be implicitly immediately available,
>under OS control, to the current processor context, be it threads, OS
>or drivers. That implies huge quota of VF's for all threads plus
>sundry other uses on all guest OS just in case they want one.

That assumes that the coprocess will be used by all processors,
which aside from legacy coprocessors like FPUs (even then, most
applications didn't actually use floating point and there are
hooks in most major operating systems to detect whether an application
uses floating point so they don't need to save the FPR over context switches).

>And, just guessing at the device internals, implies huge management tables,
>CAMs instead of SRAMs, caches, blah, blah, etc.

Certainly in many cases, CAMS are quite useful. Particularly on
networking hardware that performs hardware packet classification
based on header fields.

>The overhead of the async completion signal would likely be much greater
>that the cost of the original coprocessor hash/encrypt.

That again, depends on the coprocessor. If the amount of work
that is offloaded isn't large enough to subsume the slight extra
cost for the virtio interrupt (particularly on cpus where the
interrupt overhead is low - e.g. ARMv8), you probably should
couple the coprocessor closer to the CPU, much like ARM Neoverse
cores where the RND instruction interacts with an off-cpu random
number generator (via MMIO).

Here's what our chips look like to the kernel/software:

https://doc.dpdk.org/guides-20.05/platform/octeontx2.html

Packet comes in, hardware allocates packet storage from the
NPA (network pool allocator) hardware block. Passes to
NCPC for classification (big CAMS), queues to scheduler,
scheduler may or may not interact with a processor or
one of the many blocks that can be added to the processing
flow for a packet (crypto for IPsec, compression, etc) before
queuing the packet for egress (where shaping occurs) on
a network port.

Re: hmac hardware acceleration...

<2kIfN.35637$JLvf.7856@fx44.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35845&group=comp.arch#35845

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: hmac hardware acceleration...
Newsgroups: comp.arch
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad> <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com> <v7GfN.23137$LONb.17592@fx08.iad> <80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
Lines: 38
Message-ID: <2kIfN.35637$JLvf.7856@fx44.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 17 Dec 2023 19:52:30 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 17 Dec 2023 19:52:30 GMT
X-Received-Bytes: 2366
 by: Scott Lurndal - Sun, 17 Dec 2023 19:52 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
>Why is this not an I/O MMU mapping. Kernel still does setup and teardown
>but device does DMA directly into user (requestor) memory.

It is an IOMMU mapping.

>
>> Then the two combine and it becomes a VF per guest processes opening the
>> device per guest OS and that fixed device quota of 16,32,64 VF's starts
>> looking a little sparse.
>
>Which is why direct user access to devices will never win.

Sorry, they already have for the use cases where it makes sense.

Re: hmac hardware acceleration...

<q4JfN.65313$Wp_8.29954@fx17.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35848&group=comp.arch#35848

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
References: <ul8qlh$3htui$1@dont-email.me> <aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com> <ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad> <Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad> <f1ea40112658daba86c6bdffc1950aef@news.novabbs.com> <v7GfN.23137$LONb.17592@fx08.iad> <80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
In-Reply-To: <80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 65
Message-ID: <q4JfN.65313$Wp_8.29954@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 17 Dec 2023 20:44:06 UTC
Date: Sun, 17 Dec 2023 15:43:50 -0500
X-Received-Bytes: 3994
 by: EricP - Sun, 17 Dec 2023 20:43 UTC

MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
> Why is this not an I/O MMU mapping. Kernel still does setup and teardown
> but device does DMA directly into user (requestor) memory.

It is. My understanding from looking at the Windows Driver documents
was it would have to be allocated when the VF pseudo-device is opened
instead of for each individual IO. At FileOpen the OS would need to be
told one or more virtual buffers the pseudo-device will work within.

Leaving HV's out for the moment, the SR-IOV needs to pre-allocate and
prepare any buffer physical memory at the time of pseudo-device open.
Then when you write the pseudo-device control register referencing
a byte range in the virtual buffer, it can validate it and initiate
the IO without a trip through the OS.

At pseudo-device open it would check any pinning quotas, fault in the
buffer pages and pin them, and create a virtual buffer to physical
fragment map, and set up the IOMMU DMA registers (PTE's) which the
device HW uses later.

For networks this is slightly complicated because network cards want to
do lots of scatter-gather IO from many byte sized and aligned buffers,
to assemble the TCPIP packet headers, merge that with the app's payload,
and possible add a packet trailer for the checksum (which the card
usually adds automatically).

The IO operation should consists of just writing the VF control register a
pointer to an operation, which points to a user space scatter-gather list
of byte buffers inside the pre-allocated and prepared memory areas.

The HV adds one more indirection layer to this because all those
pinned physical addresses and fragments above are actually guest OS
addresses which the HV converts to real physical fragments,
pins real physical frames and sets up real IOMMU maps for them.
The when you write the VF register the HW card can assemble the
packet direct from the guest user virtual buffers as all the
guest OS and HV management work was done at FileOpen.

Re: hmac hardware acceleration...

<ulnmj5$33r83$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35849&group=comp.arch#35849

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Sun, 17 Dec 2023 12:44:20 -0800
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <ulnmj5$33r83$1@dont-email.me>
References: <ul8qlh$3htui$1@dont-email.me>
<aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
<ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad>
<Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad>
<f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>
<v7GfN.23137$LONb.17592@fx08.iad>
<80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 17 Dec 2023 20:44:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7dcea355c22f1533bc3a8f39b9e5a9f5";
logging-data="3271939"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181II9Ekw6LBONDHW4QlifsWFH9LzOluh0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:+NkXI7N8ANc8h4GxYCIiKad+8XQ=
Content-Language: en-US
In-Reply-To: <80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
 by: Chris M. Thomasson - Sun, 17 Dec 2023 20:44 UTC

On 12/17/2023 9:57 AM, MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
> Why is this not an I/O MMU mapping. Kernel still does setup and teardown
> but device does DMA directly into user (requestor) memory.

Not exactly sure if this is relevant, but are you familiar with the Cell
processors back on the PlayStation 3? PPE and several SPE's? Iirc there
was a DMA to communicate with the SPE's. Although, some games did not
even use them because they were too difficult to program for.

[...]

Re: hmac hardware acceleration...

<ulnmqs$33r82$4@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35850&group=comp.arch#35850

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: hmac hardware acceleration...
Date: Sun, 17 Dec 2023 12:48:27 -0800
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <ulnmqs$33r82$4@dont-email.me>
References: <ul8qlh$3htui$1@dont-email.me>
<aecc5efbb8795bb2ff4b54a2dcf79f1d@news.novabbs.com>
<ulh5mg$1rhh9$1@dont-email.me> <bb_eN.21874$xHn7.20939@fx14.iad>
<Ry2fN.54371$7sbb.893@fx16.iad> <8gkfN.51800$Wp_8.38957@fx17.iad>
<f1ea40112658daba86c6bdffc1950aef@news.novabbs.com>
<v7GfN.23137$LONb.17592@fx08.iad>
<80de00eff8dacaa6a481706d844dadc2@news.novabbs.com>
<ulnmj5$33r83$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 17 Dec 2023 20:48:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7dcea355c22f1533bc3a8f39b9e5a9f5";
logging-data="3271938"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18QVHaGbNFN32YlMBaIp7YyaJrz3LvANqY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:w7n+CVB+RQwtk6HGsAfh5P7PtiY=
Content-Language: en-US
In-Reply-To: <ulnmj5$33r83$1@dont-email.me>
 by: Chris M. Thomasson - Sun, 17 Dec 2023 20:48 UTC

On 12/17/2023 12:44 PM, Chris M. Thomasson wrote:
> On 12/17/2023 9:57 AM, MitchAlsup wrote:
>> EricP wrote:
>>
>>> MitchAlsup wrote:
>>>> EricP wrote:
>>>>
>>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>>> from the limited info I have seen, using a VF does seem to be limited
>>>>> because the SR-IOV device only export a fixed number of VF's,
>>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>>> optimizing paravirtualized network cards, in that context it is a
>>>>> reasonable limitation. However this would not be suitable for a
>>>>> coprocessor to say "access denied, all are in use".
>>>>
>>>> Each Guest OS gets one VF.
>>
>>> Right, that is the intended purpose for network cards in virtual
>>> machines
>>> although the SR-IOV specs are generalized.
>>
>>> Then there is also was this movement that wants to do "zero-copy"
>>> network
>>> IO directly from user space IO buffers, which is a VF per process
>>> opening
>>> the device.
>>
>> Why is this not an I/O MMU mapping. Kernel still does setup and teardown
>> but device does DMA directly into user (requestor) memory.
>
> Not exactly sure if this is relevant, but are you familiar with the Cell
> processors back on the PlayStation 3? PPE and several SPE's? Iirc there
> was a DMA to communicate with the SPE's. Although, some games did not
> even use them because they were too difficult to program for.
>
>
> [...]

Wrt my cipher, well, the problem is that its not really parallel at all.
I cannot complete step a without first processing a - 1. So, shit.

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor