Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Computer programs expand so as to fill the core available.


devel / comp.arch / Re: Cray style vectors

SubjectAuthor
* A Very Bad IdeaQuadibloc
+- Re: A Very Bad IdeaChris M. Thomasson
+* Vectors (was: A Very Bad Idea)Anton Ertl
|+* Re: Vectors (was: A Very Bad Idea)Quadibloc
||+- Re: Vectors (was: A Very Bad Idea)Anton Ertl
||`- Re: VectorsMitchAlsup1
|`- Re: VectorsMitchAlsup1
+* Re: A Very Bad IdeaBGB
|`* Re: A Very Bad IdeaMitchAlsup1
| `- Re: A Very Bad IdeaBGB-Alt
+- Re: A Very Bad IdeaMitchAlsup1
+* Re: A Very Bad Idea?Lawrence D'Oliveiro
|`* Re: A Very Bad Idea?MitchAlsup1
| `- Re: A Very Bad Idea?BGB-Alt
`* Re: Cray style vectors (was: A Very Bad Idea)Marcus
 +* Re: Cray style vectors (was: A Very Bad Idea)Quadibloc
 |+- Re: Cray style vectors (was: A Very Bad Idea)Quadibloc
 |+* Re: Cray style vectors (was: A Very Bad Idea)Scott Lurndal
 ||`* Re: Cray style vectors (was: A Very Bad Idea)Thomas Koenig
 || `* Re: Cray style vectorsMitchAlsup1
 ||  `- Re: Cray style vectorsQuadibloc
 |`* Re: Cray style vectorsMarcus
 | +- Re: Cray style vectorsMitchAlsup1
 | `* Re: Cray style vectorsQuadibloc
 |  +- Re: Cray style vectorsQuadibloc
 |  +* Re: Cray style vectorsAnton Ertl
 |  |`* Re: Cray style vectorsStephen Fuld
 |  | +* Re: Cray style vectorsAnton Ertl
 |  | |+- Re: Cray style vectorsMitchAlsup1
 |  | |`* Re: Cray style vectorsStephen Fuld
 |  | | `* Re: Cray style vectorsMitchAlsup
 |  | |  `* Re: Cray style vectorsStephen Fuld
 |  | |   `* Re: Cray style vectorsTerje Mathisen
 |  | |    `* Re: Cray style vectorsAnton Ertl
 |  | |     +* Re: Cray style vectorsTerje Mathisen
 |  | |     |+- Re: Cray style vectorsMitchAlsup1
 |  | |     |+* Re: Cray style vectorsTim Rentsch
 |  | |     ||+* Re: Cray style vectorsMitchAlsup1
 |  | |     |||`* Re: Cray style vectorsTim Rentsch
 |  | |     ||| +* Re: Cray style vectorsOpus
 |  | |     ||| |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||| +* Re: Cray style vectorsScott Lurndal
 |  | |     ||| |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||| `* Re: Cray style vectorsMitchAlsup1
 |  | |     |||  `- Re: Cray style vectorsTim Rentsch
 |  | |     ||`* Re: Cray style vectorsTerje Mathisen
 |  | |     || `* Re: Cray style vectorsTim Rentsch
 |  | |     ||  `* Re: Cray style vectorsTerje Mathisen
 |  | |     ||   +* Re: Cray style vectorsTerje Mathisen
 |  | |     ||   |+* Re: Cray style vectorsMichael S
 |  | |     ||   ||`* Re: Cray style vectorsMitchAlsup1
 |  | |     ||   || `- Re: Cray style vectorsScott Lurndal
 |  | |     ||   |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||   `- Re: Cray style vectorsTim Rentsch
 |  | |     |+- Re: Cray style vectorsAnton Ertl
 |  | |     |`* Re: Cray style vectorsDavid Brown
 |  | |     | +* Re: Cray style vectorsTerje Mathisen
 |  | |     | |+* Re: Cray style vectorsMitchAlsup1
 |  | |     | ||+* Re: Cray style vectorsAnton Ertl
 |  | |     | |||`* What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | ||| `* Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  |`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | +* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |+- Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | |`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | | `* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |  `* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | |   +- Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |   `* Re: What integer C type to use (was: Cray style vectors)Tim Rentsch
 |  | |     | |||  | |    `* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |     `- Re: What integer C type to use (was: Cray style vectors)Tim Rentsch
 |  | |     | |||  | `- Re: What integer C type to useMitchAlsup1
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  |+* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||+- Re: What integer C type to useDavid Brown
 |  | |     | |||  ||`* Re: What integer C type to useTerje Mathisen
 |  | |     | |||  || `* Re: What integer C type to useTim Rentsch
 |  | |     | |||  ||  `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||   +- Re: What integer C type to useTim Rentsch
 |  | |     | |||  ||   `* Re: What integer C type to useDavid Brown
 |  | |     | |||  ||    `- Re: What integer C type to useThomas Koenig
 |  | |     | |||  |+* Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  ||+* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  |||+* Re: What integer C type to use (was: Cray style vectors)Michael S
 |  | |     | |||  ||||+- Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  ||||`- Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  |||`- Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  ||`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  || `* Re: What integer C type to useDavid Brown
 |  | |     | |||  ||  `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||   `- Re: What integer C type to useDavid Brown
 |  | |     | |||  |`* Re: What integer C type to use (was: Cray style vectors)Thomas Koenig
 |  | |     | |||  | +* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | |+* Re: What integer C type to useDavid Brown
 |  | |     | |||  | ||`* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | || `* Re: What integer C type to useDavid Brown
 |  | |     | |||  | ||  `* Re: What integer C type to useMichael S
 |  | |     | |||  | ||   +* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | ||   |`* Re: What integer C type to useMichael S
 |  | |     | |||  | ||   | `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | ||   `- Re: What integer C type to useThomas Koenig
 |  | |     | |||  | |`* Re: What integer C type to useThomas Koenig
 |  | |     | |||  | `* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Brian G. Lucas
 |  | |     | |||  `- Re: What integer C type to useBGB
 |  | |     | ||+- Re: Cray style vectorsDavid Brown
 |  | |     | ||`- Re: Cray style vectorsTim Rentsch
 |  | |     | |+- Re: Cray style vectorsDavid Brown
 |  | |     | |`- Re: Cray style vectorsTim Rentsch
 |  | |     | `* Re: Cray style vectorsThomas Koenig
 |  | |     `* Re: Cray style vectorsBGB
 |  | `- Re: Cray style vectorsMitchAlsup1
 |  +- Re: Cray style vectorsBGB
 |  +* Re: Cray style vectorsMarcus
 |  `* Re: Cray style vectorsMitchAlsup1
 `* Re: Cray style vectors (was: A Very Bad Idea)Michael S

Pages:12345678910
Re: Cray style vectors

<uqmn7c$3n35k$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37390&group=comp.arch#37390

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 04:10:21 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <uqmn7c$3n35k$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 16 Feb 2024 04:10:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2484725a4f2fa43a6cbbd42604c9f7da";
logging-data="3902644"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1nGKAEPVYpmv5RxTAfUVu0lGgM/9GZ8Y="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:tlELOSchXln/CPXcYnGd0OzPGXg=
 by: Quadibloc - Fri, 16 Feb 2024 04:10 UTC

On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
> On 2024-02-14, Quadibloc wrote:

>> But there's also one very bad thing about a vector register file.

>> Like any register file, it has to be *saved* and *restored* under
>> certain circumstances. Most especially, it has to be saved before,
>> and restored after, other user-mode programs run, even if they
>> aren't _expected_ to use vectors, as a program interrupted by
>> a real-time-clock interrupt to let other users do stuff has to
>> be able to *rely* on its registers all staying undisturbed, as if
>> no interrupts happened.

> Yes, that is the major drawback of a vector register file, so it has to
> be dealt with somehow.

Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

My first question, though, is whether or not we can really make caches
that good.

But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.

The basic way in which I originally felt I could make it work was really
quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that
access the vector registers.

The details of how one may have to make use of that capability... well,
that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!

Hey, the GPU in a computer these days is also a singular resource.

Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.

Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style
vectors, so that the best possible vector performance for conventional
non-batch programs is also available.

Now, what would I think of as being "something very much like VVM" without
actually being VVM?

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.

So my substitute for VVM should now be obvious - explicit memory-to-memory
vector instructions, like on an old STAR-100.

John Savard

Re: Cray style vectors

<uqmoda$3nbj5$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37391&group=comp.arch#37391

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 04:30:34 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <uqmoda$3nbj5$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 16 Feb 2024 04:30:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2484725a4f2fa43a6cbbd42604c9f7da";
logging-data="3911269"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/1lWk8/b/ak5SfZuwbNtcoXDu7+dVyRvw="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:sk4VRyxkVXIeKNMYPTYyFIPSNEc=
 by: Quadibloc - Fri, 16 Feb 2024 04:30 UTC

On Fri, 16 Feb 2024 04:10:21 +0000, Quadibloc wrote:

> The basic way in which I originally felt I could make it work was really
> quite simple. The operating system, from privileged code, could set a
> bit in the PSW that turns on, or off, the ability to run instructions that
> access the vector registers.
>
> The details of how one may have to make use of that capability... well,
> that's software. So maybe the OS has to stipulate that one can only have
> one process at a time that uses these vectors - and that process has to
> run as a batch process!

and then I also wrote...

> So my substitute for VVM should now be obvious - explicit memory-to-memory
> vector instructions, like on an old STAR-100.

However, an obvious objection can be raised.

Vector programs that can only be run one at a time on a computer using
your new chip? That's a throwback to ancient times; people using today's
computers with GUI operating systems aren't used to that sort of thing,
and will therefore end up tossing your computer out, thinking that it's
broken!

So there is one more stratagem that I need to employ to avoid that
disaster.

Nothing is stopping the operating system and compilers from supporting
a particular kind of *fat binaries* that addresses this issue, making
it all invisible to the user.

Vector programs would come in a form that includes _both_ Cray I
style code and STAR-100 style code, and the highest-priority
vector program on the machine would get to run in Cray I mode until
it finishes.

Yes, that means that later programs with even higher priority would
be doomed to run slow, but this horse can't be changed in midstream,
and so one just has to live with this limitation.

John Savard

Re: Cray style vectors

<2024Feb16.082736@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37392&group=comp.arch#37392

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.chmurka.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 07:27:36 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 58
Message-ID: <2024Feb16.082736@mips.complang.tuwien.ac.at>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="14f529019b36b164fee52ad3c53e2d63";
logging-data="3968088"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xcpv3Usx8iFkIBssuyQZc"
Cancel-Lock: sha1:elSwMgSeJMLFen9i4qq+cnWuNy8=
X-newsreader: xrn 10.11
 by: Anton Ertl - Fri, 16 Feb 2024 07:27 UTC

Quadibloc <quadibloc@servername.invalid> writes:
>Why not just use Mitch Alsup's wonderful VVM?
>
>It is true that the state of the art has advanced since the Cray I
>was first introduced. So, perhaps Mitch Alsup has indeed found,
>through improving data forwarding, as I understand it, a way to make
>the performance of a memory-memory vector machine (like the Control
>Data STAR-100) match that of one with vector registers (like the
>Cray I, which succeeded where the STAR-100 failed).

I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop

for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}

a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.

>Basically, Mitch has his architecture designed for implementation on
>CPUs that are smart enough to notice certain combinations of instructions
>and execute them as though they're single instructions doing the same
>thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?), and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures. I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

>So this makes those exact combinations part of the... ISA syntax...
>which I think is too hard for assembler programmers to remember,

My understanding is that there is no need to remember much. Just
remember that it has to be a simple loop, and mark it. But, as in all
auto-vectorization schemes, there are cases where it works better than
in others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Cray style vectors

<uqn5ct$3p860$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37393&group=comp.arch#37393

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!nntp.comgw.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 02:10:57 -0600
Organization: A noiseless patient Spider
Lines: 230
Message-ID: <uqn5ct$3p860$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 08:12:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f1a7d434d719dc8773ba2789c864db52";
logging-data="3973312"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+y0zmLKkj/au/FLqNm7z+bgQESihNDTTE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:pQCqsxoYlmZf9sUTBb9wpG+uJV4=
Content-Language: en-US
In-Reply-To: <uqmn7c$3n35k$1@dont-email.me>
 by: BGB - Fri, 16 Feb 2024 08:10 UTC

On 2/15/2024 10:10 PM, Quadibloc wrote:
> On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
>> On 2024-02-14, Quadibloc wrote:
>
>>> But there's also one very bad thing about a vector register file.
>
>>> Like any register file, it has to be *saved* and *restored* under
>>> certain circumstances. Most especially, it has to be saved before,
>>> and restored after, other user-mode programs run, even if they
>>> aren't _expected_ to use vectors, as a program interrupted by
>>> a real-time-clock interrupt to let other users do stuff has to
>>> be able to *rely* on its registers all staying undisturbed, as if
>>> no interrupts happened.
>
>> Yes, that is the major drawback of a vector register file, so it has to
>> be dealt with somehow.
>
> Yes, and therefore I am looking into ways to deal with it somehow.
>
> Why not just use Mitch Alsup's wonderful VVM?
>
> It is true that the state of the art has advanced since the Cray I
> was first introduced. So, perhaps Mitch Alsup has indeed found,
> through improving data forwarding, as I understand it, a way to make
> the performance of a memory-memory vector machine (like the Control
> Data STAR-100) match that of one with vector registers (like the
> Cray I, which succeeded where the STAR-100 failed).
>
> But because the historical precedent seems to indicate otherwise, and
> because while data forwarding is very definitely a good thing (and,
> indeed, necessary to have for best performance _on_ a vector register
> machine too) it has its limits.
>
> What _could_ substitute for vector registers isn't data forwarding,
> it's the cache, since that does the same thing vector registers do:
> it brings in vector operands closer to the CPU where they're more
> quickly accessible. So a STAR-100 with a *really good cache* as well
> as data forwarding could, I suppose, compete with a Cray I.
>
> My first question, though, is whether or not we can really make caches
> that good.
>
> But skepticism about VVM isn't actually helpful if Cray-style vectors
> are now impossible to be made to work given current memory speeds.
>

Possibly true.

One could push a lot more data through a SIMD unit if one could, say:
Perform two memory loads and one memory store per cycle;
Then, say, the idea of vector registers with a memory address and
implicit SIMD vectors, makes sense.

But, this still leaves the problem of "what happens when all the data no
longer fits in the L1 cache?..."

Doing vector calculations at memcpy speed doesn't really gain much over
doing SIMD calculations at memcpy speed. SIMD is, meanwhile, easier to
implement, and does have obvious merit over scalar calculations in many
contexts.

So, say, I have a 200 MFLOP SIMD unit, but with Binary16, realistically
I could only get ~ 35 MFLOP at for a large vector operation at L2 speeds
(or 17 MFLOP if using Binary32). The limit may jump to 70 MFLOP if the
operation resembles a large vector dot-product.

But, yeah, if the SIMD code is carefully written, it is possible for the
SIMD code to operate at near memcpy speeds (despite the inherent
inefficiency of needing to "waste" clock-cycles on things like memory
loads).

OTOH, theoretically, a computer pushing 400MB/s for memcpy could get 100
MFLOP at Binary32 precision (but, might end up being less if working on
it using x87). Or, if the calculation is more involved, slower still.

If the operation resembles a 4-element vector multiply-accumulate,
cycling over each element in SIMD-like patterns (say, the "fake the SIMD
operations using structs of floats" strategy), x87 eats it hard; and is
a bigger bottleneck than the memory bandwidth.

OTOH: If one is doing a bunch of large branching scalar calculations
that each produce a single output value, x87 makes a lot more sense
(IOW, the sorts of things people more traditionally think of when they
think of "math", as opposed to a whole lot of "c[i]=a[i]*b[i]+c[i];" or
similar...).

> The basic way in which I originally felt I could make it work was really
> quite simple. The operating system, from privileged code, could set a
> bit in the PSW that turns on, or off, the ability to run instructions that
> access the vector registers.
>
> The details of how one may have to make use of that capability... well,
> that's software. So maybe the OS has to stipulate that one can only have
> one process at a time that uses these vectors - and that process has to
> run as a batch process!
>
> Hey, the GPU in a computer these days is also a singular resource.
>
> Having resources that have to be treated that way is not really what
> people are used to, but a computer that _can_ run your CFD codes
> efficiently is better than a computer that *can't* run your CFD codes.
>

If a computer can't effectively run an algorithm due to ISA design, this
is a design fail.

Say, "define mechanism, not policy" and the like. Some vector ISA's seem
unattractive though in that the data needs to be made to fit the
operations (even more so than SIMD would otherwise imply), or would
require structuring things in ways that would not be cache friendly.

As I see it, they also don't really "solve" much that couldn't otherwise
be solved with SIMD if one had multiple memory ports, but doing so is
still potentially rendered moot if one doesn't have enough memory
bandwidth to keep everything fed.

Though, luckily, most "common" algorithms are can fit most of the code
and data in the L1 cache, more limited by processing the data than by
memory bandwidth (and, in this case, it might be easier to justify
having multiple memory ports to get data into and out of registers more
quickly).

> Given _that_, obviously if VVM is a better fit to the regular computer
> model, and it offers nearly the same performance, then what I should do
> is offer VVM or something very much like it _in addition_ to Cray-style
> vectors, so that the best possible vector performance for conventional
> non-batch programs is also available.
>
> Now, what would I think of as being "something very much like VVM" without
> actually being VVM?
>
> Basically, Mitch has his architecture designed for implementation on
> CPUs that are smart enough to notice certain combinations of instructions
> and execute them as though they're single instructions doing the same
> thing, which can then be executed more efficiently.
>
> So this makes those exact combinations part of the... ISA syntax...
> which I think is too hard for assembler programmers to remember, and
> I think it's also too hard for at least some implementors. I see it
> as asking for trouble in a way that I'd rather avoid.
>
> So my substitute for VVM should now be obvious - explicit memory-to-memory
> vector instructions, like on an old STAR-100.
>

One has the issue of likely needing arcane implementation magic.

The other that it is likely to be limiting and cache unfriendly.

For better/worse, I partly followed the MMX / SSE model.
But, cutting a lot of corners to make things cheaper.

Generally leaving out things like packed byte operations or saturating
arithmetic, which while potentially useful, are not particularly cheap
either.

I generally try to approach SIMD in a similar way to the integer ISA,
where it is usually better to add features reluctantly (well, and/or end
up with the ISA clogged up with stuff that does not meaningfully
contribute to performance).

Well, nevermind stuff that exists mostly as a workaround to other design
choices (say, instructions which only need to exist due to the lack of
an architectural zero register, or because of my choice to use pointer
tagging, ...).

But, as noted, if I were doing it all again, I might consider having ZR,
LR, and GBR, in the GPR space.

But, then I get back to fiddling with my CMP3R extension:
Have noted that enabling it helps with Doom, but somewhat hurts
Dhrystone, for reasons that aren't entirely obvious (enabling it should
presumably not hurt Dhrystone, given all it is really doing is replacing
2-op sequences with 1-op in a way that should theoretically reduce
4-cycle dependent pairs down to 1|2 cycles).

Have replaced the "CMPQGE Rm, Imm5u, Rn" instruction with "CMPQLT Rm,
Imm5u, Rn", noting that the latter can usefully encode more cases than
the former ("GE Imm" can be replaced with "GT Imm-1" to similar effect,
whereas "LT" can encode both the LT and LE cases, even if it breaks
symmetry with the 3R cases, where LT and LE are expressed by flipping
the arguments).

Well, and also in the process finding some cases where the existing
compiler logic was compiling stuff inefficiently (was using FF Op64
encodings where an FE jumbo-encodings would have been better in XG2
Mode, and because Op64 only had a 17-bit immediate, was *also* leading
to loading the immediate values into temporary registers).


Click here to read the complete article
Re: Cray style vectors

<uqngut$3r1tr$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37396&group=comp.arch#37396

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 12:29:33 +0100
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <uqngut$3r1tr$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 11:29:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="803354bbfaef453f15f73b5b2f37346f";
logging-data="4032443"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1965+xBRMUsVJoAgFEW8uS6T60n+DtfTJM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:qcChA4ZznMHyba8Lxl14zdZ8OE8=
Content-Language: en-US
In-Reply-To: <uqmn7c$3n35k$1@dont-email.me>
 by: Marcus - Fri, 16 Feb 2024 11:29 UTC

On 2024-02-16, Quadibloc wrote:
> On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
>> On 2024-02-14, Quadibloc wrote:
>
>>> But there's also one very bad thing about a vector register file.
>
>>> Like any register file, it has to be *saved* and *restored* under
>>> certain circumstances. Most especially, it has to be saved before,
>>> and restored after, other user-mode programs run, even if they
>>> aren't _expected_ to use vectors, as a program interrupted by
>>> a real-time-clock interrupt to let other users do stuff has to
>>> be able to *rely* on its registers all staying undisturbed, as if
>>> no interrupts happened.
>
>> Yes, that is the major drawback of a vector register file, so it has to
>> be dealt with somehow.
>
> Yes, and therefore I am looking into ways to deal with it somehow.
>
> Why not just use Mitch Alsup's wonderful VVM?
>
> It is true that the state of the art has advanced since the Cray I
> was first introduced. So, perhaps Mitch Alsup has indeed found,
> through improving data forwarding, as I understand it, a way to make
> the performance of a memory-memory vector machine (like the Control
> Data STAR-100) match that of one with vector registers (like the
> Cray I, which succeeded where the STAR-100 failed).
>
> But because the historical precedent seems to indicate otherwise, and
> because while data forwarding is very definitely a good thing (and,
> indeed, necessary to have for best performance _on_ a vector register
> machine too) it has its limits.
>
> What _could_ substitute for vector registers isn't data forwarding,
> it's the cache, since that does the same thing vector registers do:
> it brings in vector operands closer to the CPU where they're more
> quickly accessible. So a STAR-100 with a *really good cache* as well
> as data forwarding could, I suppose, compete with a Cray I.
>
> My first question, though, is whether or not we can really make caches
> that good.
>

I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.

I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.

E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).

Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.

Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).

With vector operations you pause the front end (fetch and decode) while
iterating over vector elements, which eliminates branch misprediction
penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the
previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment,
compare, branch).

As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.

As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this
assumption is mostly based on my own ignorance, so take it with a grain
of salt).

For reference, the microarchitectural complexity that I'm thinking about
is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

/Marcus

Re: Cray style vectors

<uqnhej$3r1tr$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37397&group=comp.arch#37397

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 12:37:55 +0100
Organization: A noiseless patient Spider
Lines: 135
Message-ID: <uqnhej$3r1tr$2@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com> <uqln04$3e9bp$2@dont-email.me>
<20240215230033.00000e64@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 11:37:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="803354bbfaef453f15f73b5b2f37346f";
logging-data="4032443"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oSG8L+TaH6UttJ7CBwzzRg3Undq3pRQ8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:pFutra0YR0DWJhspmkwtN2ya/Qs=
In-Reply-To: <20240215230033.00000e64@yahoo.com>
Content-Language: en-US
 by: Marcus - Fri, 16 Feb 2024 11:37 UTC

On 2024-02-15, Michael S wrote:
> On Thu, 15 Feb 2024 20:00:20 +0100
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>
>> On 2024-02-14, Michael S wrote:
>>> On Tue, 13 Feb 2024 19:57:28 +0100
>>> Marcus <m.delete@this.bitsnbites.eu> wrote:
>>>
>>>> On 2024-02-05, Quadibloc wrote:
>>>>> I am very fond of the vector architecture of the Cray I and
>>>>> similar machines, because it seems to me the one way of
>>>>> increasing computer performance that proved effective in
>>>>> the past that still isn't being applied to microprocessors
>>>>> today.
>>>>>
>>>>> Mitch Alsup, however, has noted that such an architecture is
>>>>> unworkable today due to memory bandwidth issues. The one
>>>>> extant example of this architecture these days, the NEC
>>>>> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>>>>> gigabytes on the same card as the CPU, with a form factor
>>>>> resembling a video card - it doesn't try to use the main
>>>>> memory bus of a PC motherboard. So that seems to confirm
>>>>> this.
>>>>>
>>>>
>>>> FWIW I would just like to share my positive experience with MRISC32
>>>> style vectors (very similar to Cray 1, except 32-bit instead of
>>>> 64-bit).
>>>>
>>>
>>> Does it means that you have 8 VRs and each VR is 2048 bits?
>>
>> No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
>> number of registers as I have five-bit vector address fields in the
>> instruction encoding (because 32 scalar registers). I have been
>> thinking about reducing it to 16 vector registers, and find some
>> clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
>> there yet.
>>
>> The number of vector elements in each register is implementation
>> defined, but currently the minimum number of vector elements is set to
>> 16 (I wanted to set it relatively high to push myself to come up with
>> solutions to problems related to large vector registers).
>>
>> Each vector element is 32 bits wide.
>>
>> So, in total: 32 x 16 x 32 bits = 16384 bits
>>
>> This is, incidentally, exactly the same as for AVX-512.
>>
>>>> My machine can start and finish at most one 32-bit operation on
>>>> every clock cycle, so it is very simple. The same thing goes for
>>>> vector operations: at most one 32-bit vector element per clock
>>>> cycle.
>>>>
>>>> Thus, it always feels like using vector instructions would not give
>>>> any performance gains. Yet, every time I vectorize a scalar loop
>>>> (basically change scalar registers for vector registers), I see a
>>>> very healthy performance increase.
>>>>
>>>> I attribute this to reduced loop overhead, eliminated hazards,
>>>> reduced I$ pressure and possibly improved cache locality and
>>>> reduced register pressure.
>>>>
>>>> (I know very well that VVM gives similar gains without the VRF)
>>>>
>>>> I guess my point here is that I think that there are opportunities
>>>> in the very low end space (e.g. in order) to improve performance by
>>>> simply adding MRISC32-style vector support. I think that the gains
>>>> would be even bigger for non-pipelined machines, that could start
>>>> "pumping" the execute stage on every cycle when processing vectors,
>>>> skipping the fetch and decode cycles.
>>>>
>>>> BTW, I have also noticed that I often only need a very limited
>>>> number of vector registers in the core vectorized loops (e.g. 2-4
>>>> registers), so I don't think that the VRF has to be excruciatingly
>>>> big to add value to a small core.
>>>
>>> It depends on what you are doing.
>>> If you want good performance in matrix multiply type of algorithm
>>> then 8 VRs would not take you very far. 16 VRs are ALOT better.
>>> More than 16 VR can help somewhat, but the difference between 32
>>> and 16 (in this type of kernels) is much much smaller than
>>> difference between 8 and 16.
>>> Radix-4 and mixed-radix FFT are probably similar except that I never
>>> profiled as thoroughly as I did SGEMM.
>>>
>>
>> I expect that people will want to do such things with an MRISC32 core.
>> However, for the "small cores" that I'm talking about, I doubt that
>> they would even have floating-point support. It's more a question of
>> simple loop optimizations - e.g. the kinds you find in libc or
>> software rasterization kernels. For those you will often get lots of
>> work done with just four vector registers.
>>
>>>> I also envision that for most cases
>>>> you never have to preserve vector registers over function calls.
>>>> I.e. there's really no need to push/pop vector registers to the
>>>> stack, except for context switches (which I believe should be
>>>> optimized by tagging unused vector registers to save on stack
>>>> bandwidth).
>>>>
>>>> /Marcus
>>>
>>> If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
>>> ARM Helium-style, would not work as well or better.
>>> My personal opinion is that even for low ens in-order cores the
>>> CRAY-like huge ratio between VR width and execution width is far
>>> from optimal. Ratio of 8 looks like more optimal in case when
>>> performance of vectorized loops is a top priority. Ratio of 4 is a
>>> wise choice otherwise.
>>
>> For MRISC32 I'm aiming for splitting a vector operation into four.
>> That seems to eliminate most RAW hazards as execution pipelines tend
>> to be at most four stages long (or thereabout). So, with a pipeline
>> width of 128 bits (which seems to be the goto width for many
>> implementations), you want registers that have 4 x 128 = 512 bits,
>> which is one of the reasons that I mandate at least 512-bit vector
>> registers in MRISC32.
>>
>> Of course, nothing is set in stone, but so far that has been my
>> thinking.
>>
>> /Marcus
>
> Sounds quite reasonable, but I wouldn't call it "Cray-style".
>

Then what would you call it?

I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.

/Marcus

Re: Cray style vectors

<20240216140425.00003744@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37398&group=comp.arch#37398

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 14:04:25 +0200
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <20240216140425.00003744@yahoo.com>
References: <upq0cr$6b5m$1@dont-email.me>
<uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com>
<uqln04$3e9bp$2@dont-email.me>
<20240215230033.00000e64@yahoo.com>
<uqnhej$3r1tr$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="582b96d0745f13ac5a5e6bb14e7087c8";
logging-data="4024548"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+14hl866M0UYf2u3YzZcKpgIHudjDlG08="
Cancel-Lock: sha1:JMPBRJaA2OxUrtQm7c94GB3eGWE=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Fri, 16 Feb 2024 12:04 UTC

On Fri, 16 Feb 2024 12:37:55 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

>
> Then what would you call it?
>
> I just use the term "Cray-style" to differentiate the style of vector
> ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
> memory-memory vector ISA:s, etc.
>
> /Marcus

I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by
architecture which in theory is problematic. But I don't care, I am not
in academy.

Re: Cray style vectors

<uqnkak$3rgmh$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37399&group=comp.arch#37399

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 13:27:00 +0100
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <uqnkak$3rgmh$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com> <uqln04$3e9bp$2@dont-email.me>
<20240215230033.00000e64@yahoo.com> <uqnhej$3r1tr$2@dont-email.me>
<20240216140425.00003744@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 12:27:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="803354bbfaef453f15f73b5b2f37346f";
logging-data="4047569"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lKeYJFvxboNXQ4fQX+lTv+Qxi3FdWrOc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:2NLvX2/3/voSObdh42bZjXaVmec=
Content-Language: en-US
In-Reply-To: <20240216140425.00003744@yahoo.com>
 by: Marcus - Fri, 16 Feb 2024 12:27 UTC

On 2024-02-16, Michael S wrote:
> On Fri, 16 Feb 2024 12:37:55 +0100
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>
>>
>> Then what would you call it?
>>
>> I just use the term "Cray-style" to differentiate the style of vector
>> ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
>> memory-memory vector ISA:s, etc.
>>
>> /Marcus
>
> I'd call it a variant of SIMD.
> For me everything with vector register width to ALU width ratio <= 4 is
> SIMD. 8 is borderline, above 8 is vector.
> It means that sometimes I classify by implementation instead of by
> architecture which in theory is problematic. But I don't care, I am not
> in academy.

Ok, I am generally talking about the ISA, which dictates the semantics
and what kind of implementations that are possible (or at least
feasible).

For my current MRISC32-A1 implementation, the vector register width to
ALU width ratio is 16, so it would definitely qualify as "vector" then.

The ISA is designed, however, to support wider execution, but the idea
is to *not require* very wide execution, but rather encourage sequential
execution (up to a point where things like hazard resolution become
less of a problem and OoO is not really a necessity for high
throughput).

/Marcus

Re: Cray style vectors

<uqnmue$3o4m9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37400&group=comp.arch#37400

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 05:11:42 -0800
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <uqnmue$3o4m9$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 13:11:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="aac029dd4223c05227a529d622281c9e";
logging-data="3936969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191az6Mw57g4LUcdmx90xryTnJfOB1sBH4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:J9XOXa1ZIoZklsPGB46YR7f66dI=
In-Reply-To: <2024Feb16.082736@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Fri, 16 Feb 2024 13:11 UTC

On 2/15/2024 11:27 PM, Anton Ertl wrote:
> Quadibloc <quadibloc@servername.invalid> writes:
>> Why not just use Mitch Alsup's wonderful VVM?
>>
>> It is true that the state of the art has advanced since the Cray I
>> was first introduced. So, perhaps Mitch Alsup has indeed found,
>> through improving data forwarding, as I understand it, a way to make
>> the performance of a memory-memory vector machine (like the Control
>> Data STAR-100) match that of one with vector registers (like the
>> Cray I, which succeeded where the STAR-100 failed).
>
> I don't think that's a proper characterization of VVM. One advantage
> that vector registers have over memory-memory machines is that vector
> registers, once loaded, can be used several times. And AFAIK VVM has
> that advantage, too. E.g., if you have the loop
>
> for (i=0; i<n; i++) {
> double b = a[i];
> c[i] = b;
> d[i] = b;
> }
>
> a[i] is loaded only once (also in VVM), while a memory-memory
> formulation would load a[i] twice. And on the microarchiectural
> level, VVM may work with vector registers, but the nice part is that
> it's only microarchiecture, and it avoids all the nasty consequences
> of making it architectural, such as more expensive context switches.
>
>> Basically, Mitch has his architecture designed for implementation on
>> CPUs that are smart enough to notice certain combinations of instructions
>> and execute them as though they're single instructions doing the same
>> thing, which can then be executed more efficiently.
>
> My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

> and that
> the loop can do almost anything, but (I think) it must be a simple
> loop without further control structures.

It allows predicated instructions within the loop

> I think he also allows
> recurrences (in particular, reductions), but I don't understand how
> his hardware auto-vectorizes that; e.g.:
>
> double r=0.0;
> for (i=0; i<n; i++)
> r += a[i];
>
> This is particularly nasty given that FP addition is not associative;
> but even if you allow fast-math-style reassociation, doing this in
> hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Cray style vectors

<2024Feb16.152320@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37402&group=comp.arch#37402

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 14:23:20 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 73
Message-ID: <2024Feb16.152320@mips.complang.tuwien.ac.at>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at> <uqnmue$3o4m9$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="14f529019b36b164fee52ad3c53e2d63";
logging-data="4094997"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18k2GFuEftfPC/8q+2Wcmeu"
Cancel-Lock: sha1:mEMJSCsFXU2C4vxVbXVrWWe1jLw=
X-newsreader: xrn 10.11
 by: Anton Ertl - Fri, 16 Feb 2024 14:23 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 2/15/2024 11:27 PM, Anton Ertl wrote:
>> Quadibloc <quadibloc@servername.invalid> writes:
>>> Basically, Mitch has his architecture designed for implementation on
>>> CPUs that are smart enough to notice certain combinations of instructions
>>> and execute them as though they're single instructions doing the same
>>> thing, which can then be executed more efficiently.
>>
>> My understanding is that he requires explicit marking (why?),
>
>Of course, Mitch can answer for himself, but ISTM that the explicit
>marking allows a more efficient implementation, specifically the
>instructions in the loop can be fetched and decoded only once, it allows
>the HW to elide some register writes, and saves an instruction by
>combining the loop count decrement and test and the return branch into a
>single instruction. Perhaps the HW could figure out all of that by
>analyzing a "normal" instruction stream, but that seems much harder.

Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
in the coming years.

>> and that
>> the loop can do almost anything, but (I think) it must be a simple
>> loop without further control structures.
>
>It allows predicated instructions within the loop

Sure, predication is not a control structure.

>> I think he also allows
>> recurrences (in particular, reductions), but I don't understand how
>> his hardware auto-vectorizes that; e.g.:
>>
>> double r=0.0;
>> for (i=0; i<n; i++)
>> r += a[i];
>>
>> This is particularly nasty given that FP addition is not associative;
>> but even if you allow fast-math-style reassociation, doing this in
>> hardware seems to be quite a bit harder than the rest of VVM.
>
> From what I understand, while you can do reductions in a VVM loop, and
>it takes advantage of wide fetch etc., it doesn't auto parallelize the
>reduction, thus avoids the problem you mention. That does cost
>performance if the reduction could be parallelized, e.g. find the max
>value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware. For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:

double r;
double r0=0.0;
....
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
} .... deal with the remaining iterations ...
r = r0+...+r15;

But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Cray style vectors

<1ced67909718e4403a8610ed9daf1f99@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37404&group=comp.arch#37404

  copy link   Newsgroups: comp.arch
Date: Fri, 16 Feb 2024 18:35:01 +0000
Subject: Re: Cray style vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$pbclYR7Dy6ukFvEJgHm7cOKVxm20pxIvvx7VSSfApmaOW9EBm7Aea
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <1ced67909718e4403a8610ed9daf1f99@www.novabbs.org>
 by: MitchAlsup1 - Fri, 16 Feb 2024 18:35 UTC

Quadibloc wrote:

> On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
>> On 2024-02-14, Quadibloc wrote:

>>> But there's also one very bad thing about a vector register file.

>>> Like any register file, it has to be *saved* and *restored* under
>>> certain circumstances. Most especially, it has to be saved before,
>>> and restored after, other user-mode programs run, even if they
>>> aren't _expected_ to use vectors, as a program interrupted by
>>> a real-time-clock interrupt to let other users do stuff has to
>>> be able to *rely* on its registers all staying undisturbed, as if
>>> no interrupts happened.

>> Yes, that is the major drawback of a vector register file, so it has to
>> be dealt with somehow.

> Yes, and therefore I am looking into ways to deal with it somehow.

> Why not just use Mitch Alsup's wonderful VVM?

> It is true that the state of the art has advanced since the Cray I
> was first introduced. So, perhaps Mitch Alsup has indeed found,
> through improving data forwarding, as I understand it, a way to make
> the performance of a memory-memory vector machine (like the Control
> Data STAR-100) match that of one with vector registers (like the
> Cray I, which succeeded where the STAR-100 failed).

VVM on My 66000 remains a RISC ISA--what it does is provide an implemen-
tation freedom to perform multiple loops (SIMD -style) at the same time.
CRAY nomenclature would call this "lanes".

> But because the historical precedent seems to indicate otherwise, and
> because while data forwarding is very definitely a good thing (and,
> indeed, necessary to have for best performance _on_ a vector register
> machine too) it has its limits.

> What _could_ substitute for vector registers isn't data forwarding,
> it's the cache, since that does the same thing vector registers do:
> it brings in vector operands closer to the CPU where they're more
> quickly accessible. So a STAR-100 with a *really good cache* as well
> as data forwarding could, I suppose, compete with a Cray I.

Cache buffers to be more precise.

> My first question, though, is whether or not we can really make caches
> that good.

Once a memory reference in a vectorized loop starts to miss, you quit
storing the data in the cache and just strip mine it through the cache
buffers, and avoid polluting the DCache with data that will be displaced
before the loop completes.

> But skepticism about VVM isn't actually helpful if Cray-style vectors
> are now impossible to be made to work given current memory speeds.

> The basic way in which I originally felt I could make it work was really
> quite simple. The operating system, from privileged code, could set a
> bit in the PSW that turns on, or off, the ability to run instructions that
> access the vector registers.

> The details of how one may have to make use of that capability... well,
> that's software. So maybe the OS has to stipulate that one can only have
> one process at a time that uses these vectors - and that process has to
> run as a batch process!

> Hey, the GPU in a computer these days is also a singular resource.

> Having resources that have to be treated that way is not really what
> people are used to, but a computer that _can_ run your CFD codes
> efficiently is better than a computer that *can't* run your CFD codes.

> Given _that_, obviously if VVM is a better fit to the regular computer
> model, and it offers nearly the same performance, then what I should do
> is offer VVM or something very much like it _in addition_ to Cray-style
> vectors, so that the best possible vector performance for conventional
> non-batch programs is also available.

> Now, what would I think of as being "something very much like VVM" without
> actually being VVM?

> Basically, Mitch has his architecture designed for implementation on
> CPUs that are smart enough to notice certain combinations of instructions
> and execute them as though they're single instructions doing the same
> thing, which can then be executed more efficiently.

> So this makes those exact combinations part of the... ISA syntax...
> which I think is too hard for assembler programmers to remember, and
> I think it's also too hard for at least some implementors. I see it
> as asking for trouble in a way that I'd rather avoid.

> So my substitute for VVM should now be obvious - explicit memory-to-memory
> vector instructions, like on an old STAR-100.

Gasp........

> John Savard

Re: Cray style vectors

<16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37405&group=comp.arch#37405

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 18:45:59 +0000
Organization: Rocksolid Light
Message-ID: <16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3047548"; mail-complaints-to="usenet@i2pn2.org";
posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Utbo2WRdanZxG3Gg.GlfGO5t7EiR5DSwfmkmHu.48NVLh/svjmvgq
 by: MitchAlsup1 - Fri, 16 Feb 2024 18:45 UTC

Marcus wrote:

> On 2024-02-16, Quadibloc wrote:
>> On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
>>> On 2024-02-14, Quadibloc wrote:
>>
>>>> But there's also one very bad thing about a vector register file.
>>
>>>> Like any register file, it has to be *saved* and *restored* under
>>>> certain circumstances. Most especially, it has to be saved before,
>>>> and restored after, other user-mode programs run, even if they
>>>> aren't _expected_ to use vectors, as a program interrupted by
>>>> a real-time-clock interrupt to let other users do stuff has to
>>>> be able to *rely* on its registers all staying undisturbed, as if
>>>> no interrupts happened.
>>
>>> Yes, that is the major drawback of a vector register file, so it has to
>>> be dealt with somehow.
>>
>> Yes, and therefore I am looking into ways to deal with it somehow.
>>
>> Why not just use Mitch Alsup's wonderful VVM?
>>
>> It is true that the state of the art has advanced since the Cray I
>> was first introduced. So, perhaps Mitch Alsup has indeed found,
>> through improving data forwarding, as I understand it, a way to make
>> the performance of a memory-memory vector machine (like the Control
>> Data STAR-100) match that of one with vector registers (like the
>> Cray I, which succeeded where the STAR-100 failed).
>>
>> But because the historical precedent seems to indicate otherwise, and
>> because while data forwarding is very definitely a good thing (and,
>> indeed, necessary to have for best performance _on_ a vector register
>> machine too) it has its limits.
>>
>> What _could_ substitute for vector registers isn't data forwarding,
>> it's the cache, since that does the same thing vector registers do:
>> it brings in vector operands closer to the CPU where they're more
>> quickly accessible. So a STAR-100 with a *really good cache* as well
>> as data forwarding could, I suppose, compete with a Cray I.
>>
>> My first question, though, is whether or not we can really make caches
>> that good.
>>

> I think that you are missing some of the points that I'm trying to make.
> In my recent comments I have been talking about very low end machines,
> the kinds that can execute at most one instruction per clock cycle, or
> maybe less, and that may not even have a cache at all.

> I'm saying that I believe that within this category there is an
> opportunity for improving performance with very little cost by adding
> vector operations.

> E.g. imagine a non-pipelined implementation with a single memory port,
> shared by instruction fetch and data load/store, that requires perhaps
> two cycles to fetch and decode an instruction, and executes the
> instruction in the third cycle (possibly accessing the memory, which
> precludes fetching a new instruction until the fourth or even fifth
> cycle).

> Now imagine if a single instruction could iterate over several elements
> of a vector register. This would mean that the execution unit could
> execute up to one operation every clock cycle, approaching similar
> performance levels as a pipelined 1 CPI machine. The memory port would
> be free for data traffic as no new instructions have to be fetched
> during the vector loop. And so on.

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

> Similarly, imagine a very simple strictly in-order pipelined
> implementation, where you have to resolve hazards by stalling the
> pipeline every time there is RAW hazard for instance, and you have to
> throw away cycles every time you mispredict a branch (which may be
> quite often if you only have a very primitive predictor).

> With vector operations you pause the front end (fetch and decode) while
> iterating over vector elements, which eliminates branch misprediction
> penalties. You also magically do away with RAW hazards as by the time
> you start issuing a new instruction the vector elements needed from the
> previous instruction have already been written to the register file.
> And of course you do away with loop overhead instructions (increment,
> compare, branch).

VVM does not use branch prediction--it uses a zero-loss ADD-CMP-BC
instruction I call LOOP.

And you do not have to lose precise exceptions, either.

> As a bonus, I believe that a vector solution like that would be more
> energy efficient, as less work has to be done for each operation than if
> you have to fetch and decode an instruction for every operation that you
> do.

More energy efficient, but consumes more energy because it is running
more data in less time.

> As I said, VVM has many similar properties, but I am currently exploring
> if a VRF solution can be made sufficiently cheap to be feasible in this
> very low end space, where I believe that VVM may be a bit too much (this
> assumption is mostly based on my own ignorance, so take it with a grain
> of salt).

> For reference, the microarchitectural complexity that I'm thinking about
> is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

> https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

> /Marcus

Re: Cray style vectors

<aa3f01b9f6125e3c5f50117e1c67f6dd@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37406&group=comp.arch#37406

  copy link   Newsgroups: comp.arch
Date: Fri, 16 Feb 2024 18:53:00 +0000
Subject: Re: Cray style vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$T09H3lTxEA3AuSFWk11at.s/w0jYmhnXvPzHuoeMm8EHYcLgKO9ga
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at> <uqnmue$3o4m9$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <aa3f01b9f6125e3c5f50117e1c67f6dd@www.novabbs.org>
 by: MitchAlsup1 - Fri, 16 Feb 2024 18:53 UTC

Stephen Fuld wrote:

> On 2/15/2024 11:27 PM, Anton Ertl wrote:
>> Quadibloc <quadibloc@servername.invalid> writes:
>>> Why not just use Mitch Alsup's wonderful VVM?
>>>
>>> It is true that the state of the art has advanced since the Cray I
>>> was first introduced. So, perhaps Mitch Alsup has indeed found,
>>> through improving data forwarding, as I understand it, a way to make
>>> the performance of a memory-memory vector machine (like the Control
>>> Data STAR-100) match that of one with vector registers (like the
>>> Cray I, which succeeded where the STAR-100 failed).
>>
>> I don't think that's a proper characterization of VVM. One advantage
>> that vector registers have over memory-memory machines is that vector
>> registers, once loaded, can be used several times. And AFAIK VVM has
>> that advantage, too. E.g., if you have the loop
>>
>> for (i=0; i<n; i++) {
>> double b = a[i];
>> c[i] = b;
>> d[i] = b;
>> }
>>
>> a[i] is loaded only once (also in VVM), while a memory-memory
>> formulation would load a[i] twice. And on the microarchiectural
>> level, VVM may work with vector registers, but the nice part is that
>> it's only microarchiecture, and it avoids all the nasty consequences
>> of making it architectural, such as more expensive context switches.
>>
>>> Basically, Mitch has his architecture designed for implementation on
>>> CPUs that are smart enough to notice certain combinations of instructions
>>> and execute them as though they're single instructions doing the same
>>> thing, which can then be executed more efficiently.
>>
>> My understanding is that he requires explicit marking (why?),

Bookends on the loop provide the information the HW needs, the VEC
instruction at the top provides the IP for the LOOP instruction at
the bottom to branch to, and also provides a bit map of registers
which are live-out of the loop, discarding other used loop registers.

> Of course, Mitch can answer for himself, but ISTM that the explicit
> marking allows a more efficient implementation, specifically the
> instructions in the loop can be fetched and decoded only once, it allows
> the HW to elide some register writes, and saves an instruction by
> combining the loop count decrement and test and the return branch into a
> single instruction. Perhaps the HW could figure out all of that by
> analyzing a "normal" instruction stream, but that seems much harder.

All of that is correct.

>> and that
>> the loop can do almost anything, but (I think) it must be a simple
>> loop without further control structures.

> It allows predicated instructions within the loop

Predicated control flow--yes, branch flow-control no.

>> I think he also allows
>> recurrences (in particular, reductions), but I don't understand how
>> his hardware auto-vectorizes that; e.g.:
>>
>> double r=0.0;
>> for (i=0; i<n; i++)
>> r += a[i];
>>
>> This is particularly nasty given that FP addition is not associative;
>> but even if you allow fast-math-style reassociation, doing this in
>> hardware seems to be quite a bit harder than the rest of VVM.

> From what I understand, while you can do reductions in a VVM loop, and
> it takes advantage of wide fetch etc., it doesn't auto parallelize the
> reduction, thus avoids the problem you mention. That does cost
> performance if the reduction could be parallelized, e.g. find the max
> value in an array.

Right, register and memory dependencies are observed and obeyed. So,
in the above loop, the recurrence slows the loop down to the latency of
FADD, but the LD, ADD-CMP-BC run concurrently; so, you are still faster
than if you did no VVM the loop.

Re: Cray style vectors

<bb3e2eddf0b4bcab4d263426b62d8567@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37407&group=comp.arch#37407

  copy link   Newsgroups: comp.arch
Date: Fri, 16 Feb 2024 18:57:11 +0000
Subject: Re: Cray style vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$c7oeIYlsuPRwoe6PY5UQRu7Cd3H7Aq0UKoFfCIQLGsuwcLPB9vy.e
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at> <uqnmue$3o4m9$1@dont-email.me> <2024Feb16.152320@mips.complang.tuwien.ac.at>
Organization: Rocksolid Light
Message-ID: <bb3e2eddf0b4bcab4d263426b62d8567@www.novabbs.org>
 by: MitchAlsup1 - Fri, 16 Feb 2024 18:57 UTC

Anton Ertl wrote:

> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>On 2/15/2024 11:27 PM, Anton Ertl wrote:
>>> Quadibloc <quadibloc@servername.invalid> writes:
>>>> Basically, Mitch has his architecture designed for implementation on
>>>> CPUs that are smart enough to notice certain combinations of instructions
>>>> and execute them as though they're single instructions doing the same
>>>> thing, which can then be executed more efficiently.
>>>
>>> My understanding is that he requires explicit marking (why?),
>>
>>Of course, Mitch can answer for himself, but ISTM that the explicit
>>marking allows a more efficient implementation, specifically the
>>instructions in the loop can be fetched and decoded only once, it allows
>>the HW to elide some register writes, and saves an instruction by
>>combining the loop count decrement and test and the return branch into a
>>single instruction. Perhaps the HW could figure out all of that by
>>analyzing a "normal" instruction stream, but that seems much harder.

> Compared to the rest of the VVM stuff, recognizing it in hardware does
> not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
> in the coming years.

>>> and that
>>> the loop can do almost anything, but (I think) it must be a simple
>>> loop without further control structures.
>>
>>It allows predicated instructions within the loop

> Sure, predication is not a control structure.

>>> I think he also allows
>>> recurrences (in particular, reductions), but I don't understand how
>>> his hardware auto-vectorizes that; e.g.:
>>>
>>> double r=0.0;
>>> for (i=0; i<n; i++)
>>> r += a[i];
>>>
>>> This is particularly nasty given that FP addition is not associative;
>>> but even if you allow fast-math-style reassociation, doing this in
>>> hardware seems to be quite a bit harder than the rest of VVM.
>>
>> From what I understand, while you can do reductions in a VVM loop, and
>>it takes advantage of wide fetch etc., it doesn't auto parallelize the
>>reduction, thus avoids the problem you mention. That does cost
>>performance if the reduction could be parallelized, e.g. find the max
>>value in an array.

> My feeling is that, for max it's relatively easy to perform a wide
> reduction in hardware. For FP addition that should give the same
> result as the sequential code, it's probably much harder. Of course,
> you can ask the programmer to write:

> double r;
> double r0=0.0;
> ....
> double r15=0.0;
> for (i=0; i<n-15; i+=16) {
> r0 += a[i];
> ...
> r15 += a[i+15];
> }
> .... deal with the remaining iterations ...
> r = r0+...+r15;

> But then the point of auto-vectorization is that the programmers are
> unaware of what's going on behind the curtain, and that promise is not
> kept if they have to write code like above.

VVM is also adept at vectorizing str* and mem* functions from the C
library, and as such, you have to do it in a way that even ISRs can
use VVM (when it is to their advantage).

> - anton

Re: Cray style vectors

<uqobhv$3o4m9$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37408&group=comp.arch#37408

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 11:03:26 -0800
Organization: A noiseless patient Spider
Lines: 100
Message-ID: <uqobhv$3o4m9$2@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at>
<uqnmue$3o4m9$1@dont-email.me> <2024Feb16.152320@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 19:03:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="aac029dd4223c05227a529d622281c9e";
logging-data="3936969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183M0u7tJTr/j3bJd99GPJ738hw5NJ+WmA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:SOr77Kc9UukyeV9MQ2Pqbr2vwKo=
Content-Language: en-US
In-Reply-To: <2024Feb16.152320@mips.complang.tuwien.ac.at>
 by: Stephen Fuld - Fri, 16 Feb 2024 19:03 UTC

On 2/16/2024 6:23 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 2/15/2024 11:27 PM, Anton Ertl wrote:
>>> Quadibloc <quadibloc@servername.invalid> writes:
>>>> Basically, Mitch has his architecture designed for implementation on
>>>> CPUs that are smart enough to notice certain combinations of instructions
>>>> and execute them as though they're single instructions doing the same
>>>> thing, which can then be executed more efficiently.
>>>
>>> My understanding is that he requires explicit marking (why?),
>>
>> Of course, Mitch can answer for himself, but ISTM that the explicit
>> marking allows a more efficient implementation, specifically the
>> instructions in the loop can be fetched and decoded only once, it allows
>> the HW to elide some register writes, and saves an instruction by
>> combining the loop count decrement and test and the return branch into a
>> single instruction. Perhaps the HW could figure out all of that by
>> analyzing a "normal" instruction stream, but that seems much harder.
>
> Compared to the rest of the VVM stuff, recognizing it in hardware does
> not add much difficulty.

IANAHG, but if it were that simple, I would think Mitch would have
implemented it that way.

> Maybe we'll see it in some Intel or AMD CPU
> in the coming years.

One can hope!

>
>>> and that
>>> the loop can do almost anything, but (I think) it must be a simple
>>> loop without further control structures.
>>
>> It allows predicated instructions within the loop
>
> Sure, predication is not a control structure.

OK, but my point is that you can do conditional execution within a VVM loop.

>>> I think he also allows
>>> recurrences (in particular, reductions), but I don't understand how
>>> his hardware auto-vectorizes that; e.g.:
>>>
>>> double r=0.0;
>>> for (i=0; i<n; i++)
>>> r += a[i];
>>>
>>> This is particularly nasty given that FP addition is not associative;
>>> but even if you allow fast-math-style reassociation, doing this in
>>> hardware seems to be quite a bit harder than the rest of VVM.
>>
>> From what I understand, while you can do reductions in a VVM loop, and
>> it takes advantage of wide fetch etc., it doesn't auto parallelize the
>> reduction, thus avoids the problem you mention. That does cost
>> performance if the reduction could be parallelized, e.g. find the max
>> value in an array.
>
> My feeling is that, for max it's relatively easy to perform a wide
> reduction in hardware.

Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware
recognizing that the loop contains no instructions that can't be
parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.

> For FP addition that should give the same
> result as the sequential code, it's probably much harder. Of course,
> you can ask the programmer to write:
>
> double r;
> double r0=0.0;
> ...
> double r15=0.0;
> for (i=0; i<n-15; i+=16) {
> r0 += a[i];
> ...
> r15 += a[i+15];
> }
> ... deal with the remaining iterations ...
> r = r0+...+r15;
>
> But then the point of auto-vectorization is that the programmers are
> unaware of what's going on behind the curtain, and that promise is not
> kept if they have to write code like above.

Agreed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Cray style vectors

<AGOzN.80209$SyNd.74562@fx33.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37409&group=comp.arch#37409

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx33.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Cray style vectors
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me> <16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org>
In-Reply-To: <16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 18
Message-ID: <AGOzN.80209$SyNd.74562@fx33.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 16 Feb 2024 19:27:28 UTC
Date: Fri, 16 Feb 2024 14:27:19 -0500
X-Received-Bytes: 1577
 by: EricP - Fri, 16 Feb 2024 19:27 UTC

MitchAlsup1 wrote:
>
> You should think of it like:: VVM can execute as many operations per
> cycle as it has function units. In particular, the low end machine
> can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
> cycle. LDs operate at 128-bits wide, so one can execute a LD on even
> cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
>
> Bigger implementations can have more cache ports and more FMAC units;
> and include "lanes" in SIMD-like fashion.

Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
Is that done by disambiguation logic, checking for same cache line access?

Re: Cray style vectors

<uqon2i$1sp9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37416&group=comp.arch#37416

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.samoylyk.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 16:18:47 -0600
Organization: A noiseless patient Spider
Lines: 394
Message-ID: <uqon2i$1sp9$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 16 Feb 2024 22:20:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f1a7d434d719dc8773ba2789c864db52";
logging-data="62249"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/pVTfrCnRtAxKCyutlRa9gt2O57Fy5h1M="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:QPiQKavoWkrTgLJyGIgcUM4gQZM=
In-Reply-To: <uqngut$3r1tr$1@dont-email.me>
Content-Language: en-US
 by: BGB - Fri, 16 Feb 2024 22:18 UTC

On 2/16/2024 5:29 AM, Marcus wrote:
> On 2024-02-16, Quadibloc wrote:
>> On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
>>> On 2024-02-14, Quadibloc wrote:
>>
>>>> But there's also one very bad thing about a vector register file.
>>
>>>> Like any register file, it has to be *saved* and *restored* under
>>>> certain circumstances. Most especially, it has to be saved before,
>>>> and restored after, other user-mode programs run, even if they
>>>> aren't _expected_ to use vectors, as a program interrupted by
>>>> a real-time-clock interrupt to let other users do stuff has to
>>>> be able to *rely* on its registers all staying undisturbed, as if
>>>> no interrupts happened.
>>
>>> Yes, that is the major drawback of a vector register file, so it has to
>>> be dealt with somehow.
>>
>> Yes, and therefore I am looking into ways to deal with it somehow.
>>
>> Why not just use Mitch Alsup's wonderful VVM?
>>
>> It is true that the state of the art has advanced since the Cray I
>> was first introduced. So, perhaps Mitch Alsup has indeed found,
>> through improving data forwarding, as I understand it, a way to make
>> the performance of a memory-memory vector machine (like the Control
>> Data STAR-100) match that of one with vector registers (like the
>> Cray I, which succeeded where the STAR-100 failed).
>>
>> But because the historical precedent seems to indicate otherwise, and
>> because while data forwarding is very definitely a good thing (and,
>> indeed, necessary to have for best performance _on_ a vector register
>> machine too) it has its limits.
>>
>> What _could_ substitute for vector registers isn't data forwarding,
>> it's the cache, since that does the same thing vector registers do:
>> it brings in vector operands closer to the CPU where they're more
>> quickly accessible. So a STAR-100 with a *really good cache* as well
>> as data forwarding could, I suppose, compete with a Cray I.
>>
>> My first question, though, is whether or not we can really make caches
>> that good.
>>
>
> I think that you are missing some of the points that I'm trying to make.
> In my recent comments I have been talking about very low end machines,
> the kinds that can execute at most one instruction per clock cycle, or
> maybe less, and that may not even have a cache at all.
>
> I'm saying that I believe that within this category there is an
> opportunity for improving performance with very little cost by adding
> vector operations.
>
> E.g. imagine a non-pipelined implementation with a single memory port,
> shared by instruction fetch and data load/store, that requires perhaps
> two cycles to fetch and decode an instruction, and executes the
> instruction in the third cycle (possibly accessing the memory, which
> precludes fetching a new instruction until the fourth or even fifth
> cycle).
>
> Now imagine if a single instruction could iterate over several elements
> of a vector register. This would mean that the execution unit could
> execute up to one operation every clock cycle, approaching similar
> performance levels as a pipelined 1 CPI machine. The memory port would
> be free for data traffic as no new instructions have to be fetched
> during the vector loop. And so on.
>

I guess possible.

Ironically, the first designs I did were pipelined, except some early
versions of my core did not use pipelined memory access (so, Load/Store
would have been a lot more expensive), in combination with a painfully
slow bus design.

> Similarly, imagine a very simple strictly in-order pipelined
> implementation, where you have to resolve hazards by stalling the
> pipeline every time there is RAW hazard for instance, and you have to
> throw away cycles every time you mispredict a branch (which may be
> quite often if you only have a very primitive predictor).
>

Yeah, the above is more like my core design...

> With vector operations you pause the front end (fetch and decode) while
> iterating over vector elements, which eliminates branch misprediction
> penalties. You also magically do away with RAW hazards as by the time
> you start issuing a new instruction the vector elements needed from the
> previous instruction have already been written to the register file.
> And of course you do away with loop overhead instructions (increment,
> compare, branch).
>
> As a bonus, I believe that a vector solution like that would be more
> energy efficient, as less work has to be done for each operation than if
> you have to fetch and decode an instruction for every operation that you
> do.
>

FWIW:
The first version of SIMD operations I did was to feed each vector
element through the main FPU.

So:
Scalar, 6-cycles, with 5-cycle internal latency, 1 overhead cycle.
2-wide SIMD, 8 cycles, 1+6+1
4-wide SIMD, 10 cycles, 1+8+1
The relative LUT cost increase of supporting SIMD in this way is fairly
low (and, still worthwhile, as it is faster than non-pipelined FPU ops).

For the later low-precision unit, latency was reduced by implementing
low-precision FADD/FMUL units with a 3-cycle latency, and running 4 sets
in parallel. This is not cheap, but faster.

Another feature was to allow the low-precision unit to accept
double-precision values, though processing them at low precision.

This was the incentive to adding FADDA/FSUBA/FMULA, since these can do
low-precision ops on Binary64 with a 3-cycle latency (just with the
tradeoff of being partial precision and truncate only).

Though, through experimentation, noted for single-precision stuff that
one needs at least single precision, as, say, S.E8.F16.Z7, is not
sufficient...
Both Quake and my experimental BGBTech3 engine started breaking in
obvious ways with a 16-bit mantissa for "float".

Internally, IIRC, I ended up using a 25 bit mantissa, as this was just
enough to get Binary32 working correctly.

Though, the low-precision unit doesn't currently support integer
conversion, where it can be noted that the mantissa needs to be wide
enough to handle the widest supported integer type (so, say, if wants
Int32, they also need a 32-bit mantissa).

Going from my original SIMD implementation to a vector implementation
would likely require some mechanism for the FPU to access memory. This,
however, is the "evil" part.

Also, annoyingly, adding memory ports as not as easy as simply adding
duplicate L1 caches, as then one has to deal with memory coherence
between them.

Similarly, Block-RAM arrays don't support multiple read ports, so trying
to simply access the same Block-RAM again will effectively duplicate all
of the block-RAM for the L1 cache (horrible waste).

Though, simply mirroring the L1 contents across both ports would be the
simplest approach...

One possible way to have a multi-port L1 cache might be, say:
Add an L0 cache level, which deals with the Load/Store mechanism;
Natively caches a small number of cache lines (say, 2..8);
Would likely need to be fully-associative, for "reasons";
A L0 Miss would be handled by initiating a request to the L1 cache.
Each memory port gets its own L0 cache,
which arbitrate access to the L1 cache.
A store into an L0 cache causes it to signal the others about it.
Each L0 cache will then need to flush the offending cache-line.

The obvious drawbacks:
More expensive;
One needs to effectively hold several cache lines of data in FFs.
Will add latency.
2x 4-way lines, would miss often;
Will add 1+ cycles for every L0 miss.

One other considered variant being:
The Lane1 port accessed the L1 directly (Does Load/Store);
The Lane2 port has a small asymmetric cache, and only does Load.
Store via Lane 1 port invalidates lines in the Lane 2 L0.
In this case, the Lane2 cache could be bigger and use LUTRAM.
Say, 2x 32-lines, 1K.

However, the harder part is how to best arbitrate misses to the Lane 2 port.

IIRC, I had experimentally implemented this latter strategy, with an
approach like:
A Lane 2 miss signals a cache-miss as before.
During a Lane 2 miss, the Lane 1 array-fetch is redirected to Lane 2's
index;
If Lane 2's miss is serviced from Lane 1, it is copied over to Lane 2;
Else, a L1 miss happens as before, and the line was added to both the
main L1 cache, and the Lane 2 sub-cache.

I think one other idea I had considered was to implement a 2-way
set-associative L1 (rather than direct-mapped), but then allow dual-lane
memory access to use each way of the set independently (in the case of a
single memory access, it functions as a 2-way cache).

If an L1 miss occurs, both "ways" will be brought to look at the same
index (corresponding to the lane that had missed), and then the L1 miss
handling would do its thing (either resolving within the L1 cache, or
sending a request out on the bus, loading the cache-line into the way
associated with the lane that had missed).


Click here to read the complete article
Re: Cray style vectors

<1067c5b46cebaa18a0fc50fc423aa86a@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37417&group=comp.arch#37417

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 23:22:08 +0000
Organization: novaBBS
Message-ID: <1067c5b46cebaa18a0fc50fc423aa86a@www.novabbs.com>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at> <uqnmue$3o4m9$1@dont-email.me> <2024Feb16.152320@mips.complang.tuwien.ac.at> <uqobhv$3o4m9$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3070530"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Rslight-Site: $2y$10$tJ5KFbZJiq21rjN.QXzv9.slOEY7XPSw/7uOjIjyXPxLdNs22SiT2
 by: MitchAlsup - Fri, 16 Feb 2024 23:22 UTC

Stephen Fuld wrote:

> On 2/16/2024 6:23 AM, Anton Ertl wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>> On 2/15/2024 11:27 PM, Anton Ertl wrote:
>>>> Quadibloc <quadibloc@servername.invalid> writes:
>>>>> Basically, Mitch has his architecture designed for implementation on
>>>>> CPUs that are smart enough to notice certain combinations of instructions
>>>>> and execute them as though they're single instructions doing the same
>>>>> thing, which can then be executed more efficiently.
>>>>
>>>> My understanding is that he requires explicit marking (why?),
>>>
>>> Of course, Mitch can answer for himself, but ISTM that the explicit
>>> marking allows a more efficient implementation, specifically the
>>> instructions in the loop can be fetched and decoded only once, it allows
>>> the HW to elide some register writes, and saves an instruction by
>>> combining the loop count decrement and test and the return branch into a
>>> single instruction. Perhaps the HW could figure out all of that by
>>> analyzing a "normal" instruction stream, but that seems much harder.
>>
>> Compared to the rest of the VVM stuff, recognizing it in hardware does
>> not add much difficulty.

> IANAHG, but if it were that simple, I would think Mitch would have
> implemented it that way.

>> Maybe we'll see it in some Intel or AMD CPU
>> in the coming years.

> One can hope!

>>
>>>> and that
>>>> the loop can do almost anything, but (I think) it must be a simple
>>>> loop without further control structures.
>>>
>>> It allows predicated instructions within the loop
>>
>> Sure, predication is not a control structure.

> OK, but my point is that you can do conditional execution within a VVM loop.

>>>> I think he also allows
>>>> recurrences (in particular, reductions), but I don't understand how
>>>> his hardware auto-vectorizes that; e.g.:
>>>>
>>>> double r=0.0;
>>>> for (i=0; i<n; i++)
>>>> r += a[i];
>>>>
>>>> This is particularly nasty given that FP addition is not associative;
>>>> but even if you allow fast-math-style reassociation, doing this in
>>>> hardware seems to be quite a bit harder than the rest of VVM.
>>>
>>> From what I understand, while you can do reductions in a VVM loop, and
>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the
>>> reduction, thus avoids the problem you mention. That does cost
>>> performance if the reduction could be parallelized, e.g. find the max
>>> value in an array.
>>
>> My feeling is that, for max it's relatively easy to perform a wide
>> reduction in hardware.

> Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware
> recognizing that the loop contains no instructions that can't be
> parallelized. There are also some issues like doing a sum of signed
> integer values and knowing whether overflow occurred, etc. The
> programmer may know that overflow cannot occur, but the HW doesn't.

The HW does not need preceding knowledge. If an exception happens, the
vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.

>> For FP addition that should give the same
>> result as the sequential code, it's probably much harder. Of course,
>> you can ask the programmer to write:
>>
>> double r;
>> double r0=0.0;
>> ...
>> double r15=0.0;
>> for (i=0; i<n-15; i+=16) {
>> r0 += a[i];
>> ...
>> r15 += a[i+15];
>> }
>> ... deal with the remaining iterations ...
>> r = r0+...+r15;
>>
>> But then the point of auto-vectorization is that the programmers are
>> unaware of what's going on behind the curtain, and that promise is not
>> kept if they have to write code like above.

> Agreed.

Re: Cray style vectors

<192c54d3d7ecce21832bf5785afd2597@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37418&group=comp.arch#37418

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 23:34:33 +0000
Organization: novaBBS
Message-ID: <192c54d3d7ecce21832bf5785afd2597@www.novabbs.com>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me> <16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org> <AGOzN.80209$SyNd.74562@fx33.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="3071278"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Rslight-Site: $2y$10$4oLC9AMY7GqhGlOYWluZ6.t8N478GPA1d.aeAo/H0p1kHkpv7hhUe
X-Spam-Checker-Version: SpamAssassin 4.0.0
 by: MitchAlsup - Fri, 16 Feb 2024 23:34 UTC

EricP wrote:

> MitchAlsup1 wrote:
>>
>> You should think of it like:: VVM can execute as many operations per
>> cycle as it has function units. In particular, the low end machine
>> can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
>> cycle. LDs operate at 128-bits wide, so one can execute a LD on even
>> cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
>>
>> Bigger implementations can have more cache ports and more FMAC units;
>> and include "lanes" in SIMD-like fashion.

> Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
> two consecutive 64-bit LD or ST to consecutive addresses and merges
> them into a single cache access?

first: memory is inherently misaligned in My 66000 architecture. So, since
the width of the machine is 64-bits, we read or write in 128-bit quantities
so that we have enough bits to extract the misaligned data from or a container
large enough to store a 64-bit value into. {{And there are all the associated
corner cases}}

Second: over in VVM-land, the implementation can decide to read and write
wider, but is architecturally constrained not to shrink below 128-bits.

A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values.
This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.

A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
to support this rate.

> Is that done by disambiguation logic, checking for same cache line access?

Before I have said that the front end observes the first iteration of the
loop and makes some determinations as to how wide the loop can be run on
the machine at hand. One of those observations is whether memory addresses
are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.

Re: Cray style vectors

<uqp9o8$8hb6$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37430&group=comp.arch#37430

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 21:37:32 -0600
Organization: A noiseless patient Spider
Lines: 149
Message-ID: <uqp9o8$8hb6$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com> <uqln04$3e9bp$2@dont-email.me>
<20240215230033.00000e64@yahoo.com> <uqnhej$3r1tr$2@dont-email.me>
<20240216140425.00003744@yahoo.com> <uqnkak$3rgmh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Feb 2024 03:38:48 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4762732a80d7cde146b10fa0ada97e32";
logging-data="279910"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+0/G9Orh3XQ26hDABMtYGzvWKtWGTcUOY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:SLv5nXUzJ/AaH9Lc7EQfq+txIWs=
In-Reply-To: <uqnkak$3rgmh$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sat, 17 Feb 2024 03:37 UTC

On 2/16/2024 6:27 AM, Marcus wrote:
> On 2024-02-16, Michael S wrote:
>> On Fri, 16 Feb 2024 12:37:55 +0100
>> Marcus <m.delete@this.bitsnbites.eu> wrote:
>>
>>>
>>> Then what would you call it?
>>>
>>> I just use the term "Cray-style" to differentiate the style of vector
>>> ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
>>> memory-memory vector ISA:s, etc.
>>>
>>> /Marcus
>>
>> I'd call it a variant of SIMD.
>> For me everything with vector register width to ALU width ratio <= 4 is
>> SIMD. 8 is borderline, above 8 is vector.
>> It means that sometimes I classify by implementation instead of by
>> architecture which in theory is problematic. But I don't care, I am not
>> in academy.
>
> Ok, I am generally talking about the ISA, which dictates the semantics
> and what kind of implementations that are possible (or at least
> feasible).
>
> For my current MRISC32-A1 implementation, the vector register width to
> ALU width ratio is 16, so it would definitely qualify as "vector" then.
>
> The ISA is designed, however, to support wider execution, but the idea
> is to *not require* very wide execution, but rather encourage sequential
> execution (up to a point where things like hazard resolution become
> less of a problem and OoO is not really a necessity for high
> throughput).
>

Hmm, in my case the ratio would be 1 or 2:
64-bit ALU, 128-bit via the ALUX extension;
64 or 128 bit vectors, via 1 or 2 GPRs, 2 or 4 elements.

But, generally I consider it as SIMD:
Int16 , Int32
Binary16 , Binary32

For Int cases:
No packed byte;
No saturate;
Various converter ops / etc.

Some Packed-Byte use-cases can be faked with packed Int16 ops, but
packed-byte as a working ALU type was not particularly compelling (and,
ironically, having byte ops would have led to a stronger need for
saturating arithmetic).

But, eliminating these sorts of cases significantly reduces the number
of implied SIMD instructions that need to exist.

So, one might instead have converter ops, say:
4x Signed Byte -> 4x Int16 (Sign Extend)
4x Unsigned Byte -> 4x Int16 (Zero Extend)
4x Byte -> 4x Int16 (Copy into high bits, zero low bits).
4x Int16 -> 4x Byte (Low bits)
4x Int16 -> 4x Byte (High bits)

Some converters for middle bits could make sense, say:
zzzz-zzzz -> 0000-zzzz-zzzz-0000
But, not added as of yet.

For RGBA, I had typically been using the high-bits / unit range
approach, as this is better for interpolation, but currently has no way
to deal with overflow.

One strategy is to map 8->16 bit values as:
zzzz-zzzz -> zzzz-zzzz-1000-0000
Which provided enough space on the upper and lower end that interpolated
values were unlikely to go out of range.

Besides ADD/SUB, there are packed multiply cases for Int16:
Low bits (modulo);
High bits (sign extended);
High bits (zero extended).

FP-SIMD:
PADD.H, PSUB.H, PMUL.H
PADD.F, PSUB.F, PMUL.F
PADDX.F, PSUBX.F, PMULX.F
And, various converter ops.

Unlike SSE, I did not add redundant operations based on element type, if
the observable effect was the same. For example, things like
MOV/Load/Store/Shuffle/etc, do not need to care about the expected
element types.

Much of the ops end up being converter ops, but I did not go down the
x^2 path of converter ops, rather, the ops are generally only between
associated formats (such as Binary16<->Binary32), and if one needs to
get between more distantly or unrelated element types, they need to go
further.

This generally keeping the complexity more manageable.

For example, with the available instructions, it is possible to support
A-Law SIMD vectors in memory.

But, am I going to go and add dedicated SIMD instructions for doing math
on A-Law vectors? No. These sorts of cases can be left to converter ops
(and, say, if one needs to chain multiple instructions to get from
Packed Int16 to A-Law; they can at least try to be happy that it is a
lot faster to use a chain of SIMD converter ops to do this, than it is
to do Int -> A-Law via a function call and normal integer code).

Though, some of this is relevant partly because my audio hardware design
was primarily using A-Law (allows for quality closer to 16-bit PCM at a
storage size comparable to 8-bit PCM).

But, the same operations can be used to support unit-range
floating-point vectors in A-Law format if one feels so inclined:
S.E3.F4, Bias=8;
Can represent roughly +/- 1.0;
No concept of NaN or Inf;
Theoretically, format has demormals, but these converters ignore this.
Rather, 0 is treated as a special case.
Smallest non-zero value: ~ 0.0042
Though, within WAV files or similar, it is usually stored XOR'ed with
0x55, but this isn't a big issue.

For audio, I had also used ADPCM a lot in the past, but ADPCM is a bit
more involved (and isn't great for direct use by hardware).

Had come up with a "simple" small-block audio format, but quality wasn't
great, say:
Encode a Start/End sample (A-Law), Center Exponent, and Side Exponent;
Use 2-bits/sample to encode center channel relative to line;
Use 2-bits sample (subsampled) to encode a side-channel.
Did have the merit though that one could extract sample values at
arbitrary sample positions with hardware, at a similar bitrate to ADPCM
(say, 16 mono|stereo samples per 64-bit block).

Had imagined this partly as a format for things like tracker or
wavetable mixing, but still ended up mostly still just using A-Law
(which had the advantage of better audio quality).

....

> /Marcus

Re: Cray style vectors

<uqpcpd$8t7c$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37431&group=comp.arch#37431

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Sat, 17 Feb 2024 04:30:37 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <uqpcpd$8t7c$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me>
<1ced67909718e4403a8610ed9daf1f99@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 17 Feb 2024 04:30:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a0a1efc8ac3e311b2a508acd18042e89";
logging-data="292076"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19mpgMz9juYCVLJM3mFUw4hMVzQO+emIS0="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:E2Y6kEhjfPZiYI3qUcVMoMnizu8=
 by: Quadibloc - Sat, 17 Feb 2024 04:30 UTC

On Fri, 16 Feb 2024 18:35:01 +0000, MitchAlsup1 wrote:
> Quadibloc wrote:

>> So my substitute for VVM should now be obvious - explicit memory-to-memory
>> vector instructions, like on an old STAR-100.

> Gasp........

Oh, dear. But, yes, old-style memory-to-memory vector instructions omit
at least one very important thing that VVM provides, which I do indeed
want to make sure I include.

So there would need to be instructions like

multiply v1 by v2 giving scratch-1
add scratch-1 to scratch-2 giving scratch-3
divide scratch-2 by v1 giving v4

.... that is, instead of vector registers, there would still be another
kind of thing that isn't a vector in memory, but instead an *explicit*
reference to a forwarding node.

And so these vector instructions would have to be in explicitly
delimited groups (since forwarding nodes, unlike vector registers, aren't
intended to be _persistent_, so a group of vector instructions would have
to combine into a clause which for some purposes acts like a single
instruction)... which then makes it look a whole lot _more_ like VVM,
even though the inside of the sandwich is now special instructiions,
instead of ordinary arithmetic instructions as in VVM.

I think there _may_ have been something like this already in the
original Concertina.

John Savard

Re: Cray style vectors

<uqpngc$3o4m9$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37433&group=comp.arch#37433

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Fri, 16 Feb 2024 23:33:32 -0800
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <uqpngc$3o4m9$3@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at>
<uqnmue$3o4m9$1@dont-email.me> <2024Feb16.152320@mips.complang.tuwien.ac.at>
<uqobhv$3o4m9$2@dont-email.me>
<1067c5b46cebaa18a0fc50fc423aa86a@www.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 17 Feb 2024 07:33:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="148572980edf1baed085cf4b35090ccd";
logging-data="3936969"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uTn3Kg3wP9zXqmEW8n3rKFutdMebAdjU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:TAOipBzdyd40Ojnravz893biYIY=
In-Reply-To: <1067c5b46cebaa18a0fc50fc423aa86a@www.novabbs.com>
Content-Language: en-US
 by: Stephen Fuld - Sat, 17 Feb 2024 07:33 UTC

On 2/16/2024 3:22 PM, MitchAlsup wrote:
> Stephen Fuld wrote:
>
>> On 2/16/2024 6:23 AM, Anton Ertl wrote:
>>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>>> On 2/15/2024 11:27 PM, Anton Ertl wrote:

snip

>>>>> I think he also allows
>>>>> recurrences (in particular, reductions), but I don't understand how
>>>>> his hardware auto-vectorizes that; e.g.:
>>>>>
>>>>> double r=0.0;
>>>>> for (i=0; i<n; i++)
>>>>>     r += a[i];
>>>>>
>>>>> This is particularly nasty given that FP addition is not associative;
>>>>> but even if you allow fast-math-style reassociation, doing this in
>>>>> hardware seems to be quite a bit harder than the rest of VVM.
>>>>
>>>>  From what I understand, while you can do reductions in a VVM loop, and
>>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the
>>>> reduction, thus avoids the problem you mention.  That does cost
>>>> performance if the reduction could be parallelized, e.g. find the max
>>>> value in an array.
>>>
>>> My feeling is that, for max it's relatively easy to perform a wide
>>> reduction in hardware.
>
>> Sure.  ISTM, and again, IANAHG, that the problem for VVM is the
>> hardware recognizing that the loop contains no instructions that can't
>> be parallelized.  There are also some issues like doing a sum of
>> signed integer values and knowing whether overflow occurred, etc.  The
>> programmer may know that overflow cannot occur, but the HW doesn't.
>
> The HW does not need preceding knowledge. If an exception happens, the
> vectorized loop collapses into a scalar loop precisely, and can be
> handled in the standard fashion.

I think you might have missed my point. If you are summing the signed
integer elements of an array, whether you get an overflow or not can
depend on the order the additions are done. Thus, without knowledge
that only the programmer has (i.e. that with the size of the actual data
used, overflow is impossible) the hardware cannot parallelize such an
operation. If the programmer knows that overflow cannot occur, he has
no way to communicate that to the VVM hardware, such that the HW could
parallelize the summation.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Cray style vectors

<uqptpc$be1c$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37434&group=comp.arch#37434

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Sat, 17 Feb 2024 10:20:44 +0100
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <uqptpc$be1c$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me>
<uqon2i$1sp9$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Feb 2024 09:20:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dbc745b329d03c91506fe7daa6f38488";
logging-data="374828"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1nGL4d+chUGqI7+cf1VgEZdaWNPGUE45numg4LsE6mw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.1
Cancel-Lock: sha1:qmlbY/AAw7kOezBTMeBfW80u+vE=
In-Reply-To: <uqon2i$1sp9$1@dont-email.me>
 by: Terje Mathisen - Sat, 17 Feb 2024 09:20 UTC

BGB wrote:
> On 2/16/2024 5:29 AM, Marcus wrote:
>> I'm saying that I believe that within this category there is an
>> opportunity for improving performance with very little cost by adding
>> vector operations.
>>
>> E.g. imagine a non-pipelined implementation with a single memory port,
>> shared by instruction fetch and data load/store, that requires perhaps
>> two cycles to fetch and decode an instruction, and executes the
>> instruction in the third cycle (possibly accessing the memory, which
>> precludes fetching a new instruction until the fourth or even fifth
>> cycle).
>>
>> Now imagine if a single instruction could iterate over several elements
>> of a vector register. This would mean that the execution unit could
>> execute up to one operation every clock cycle, approaching similar
>> performance levels as a pipelined 1 CPI machine. The memory port would
>> be free for data traffic as no new instructions have to be fetched
>> during the vector loop. And so on.
>>
>
> I guess possible.

Absolutely possible. After all, the IBM block move and all the 1978 x86
string ops were designed to make an internal, interruptible, loop. No
need to load more instructions, just let the internal state machine run
until completion.

The current state of the art (i.e. VMM) is of course far more capable,
but the original idea is old.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Cray style vectors

<uqpuid$bhg0$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37435&group=comp.arch#37435

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Sat, 17 Feb 2024 10:34:05 +0100
Organization: A noiseless patient Spider
Lines: 91
Message-ID: <uqpuid$bhg0$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
<uqmn7c$3n35k$1@dont-email.me> <2024Feb16.082736@mips.complang.tuwien.ac.at>
<uqnmue$3o4m9$1@dont-email.me> <2024Feb16.152320@mips.complang.tuwien.ac.at>
<uqobhv$3o4m9$2@dont-email.me>
<1067c5b46cebaa18a0fc50fc423aa86a@www.novabbs.com>
<uqpngc$3o4m9$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Sat, 17 Feb 2024 09:34:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="dbc745b329d03c91506fe7daa6f38488";
logging-data="378368"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CYn50/Afn9QXE2vjxiR7a72NGn2avwFKciHdmldVX9g=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.1
Cancel-Lock: sha1:Y3lMzax4yhvdvCB6nKv8GKb6g8E=
In-Reply-To: <uqpngc$3o4m9$3@dont-email.me>
 by: Terje Mathisen - Sat, 17 Feb 2024 09:34 UTC

Stephen Fuld wrote:
> On 2/16/2024 3:22 PM, MitchAlsup wrote:
>> Stephen Fuld wrote:
>>
>>> On 2/16/2024 6:23 AM, Anton Ertl wrote:
>>>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>>>> On 2/15/2024 11:27 PM, Anton Ertl wrote:
>
> snip
>
>>>>>> I think he also allows
>>>>>> recurrences (in particular, reductions), but I don't understand how
>>>>>> his hardware auto-vectorizes that; e.g.:
>>>>>>
>>>>>> double r=0.0;
>>>>>> for (i=0; i<n; i++)
>>>>>>     r += a[i];
>>>>>>
>>>>>> This is particularly nasty given that FP addition is not associative;
>>>>>> but even if you allow fast-math-style reassociation, doing this in
>>>>>> hardware seems to be quite a bit harder than the rest of VVM.
>>>>>
>>>>>  From what I understand, while you can do reductions in a VVM
>>>>> loop, and
>>>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the
>>>>> reduction, thus avoids the problem you mention.  That does cost
>>>>> performance if the reduction could be parallelized, e.g. find the max
>>>>> value in an array.
>>>>
>>>> My feeling is that, for max it's relatively easy to perform a wide
>>>> reduction in hardware.
>>
>>> Sure.  ISTM, and again, IANAHG, that the problem for VVM is the
>>> hardware recognizing that the loop contains no instructions that
>>> can't be parallelized.  There are also some issues like doing a sum
>>> of signed integer values and knowing whether overflow occurred,
>>> etc.  The programmer may know that overflow cannot occur, but the HW
>>> doesn't.
>>
>> The HW does not need preceding knowledge. If an exception happens, the
>> vectorized loop collapses into a scalar loop precisely, and can be
>> handled in the standard fashion.
>
> I think you might have missed my point.  If you are summing the signed
> integer elements of an array, whether you get an overflow or not can
> depend on the order the additions are done.  Thus, without knowledge
> that only the programmer has (i.e. that with the size of the actual data
> used, overflow is impossible) the hardware cannot parallelize such an
> operation.  If the programmer knows that overflow cannot occur, he has
> no way to communicate that to the VVM hardware, such that the HW could
> parallelize the summation.
>
I am not sure, but I strongly believe that VMM cannot be caught out this
way, simply because it would observe the accumulator loop dependency.

I.e. it could do all the other loop instructions (load/add/loop counter
decrement & branch) completeley overlapped, but the actual adds to the
accumulator register would limit total throughput to the ADD-to-ADD latency.

So, on the first hand, VMM cannot automagically parallelize this to use
multiple accumulators, on the other hand a programmer would be free to
use a pair of wider accumulators to sidestep the issue.

On the third (i.e gripping) hand you could have a language like Java
where it would be illegal to transform a temporarily trapping loop into
one that would not trap and give the mathematically correct answer.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Cray style vectors

<4Y1AN.329547$PuZ9.21942@fx11.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37437&group=comp.arch#37437

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Cray style vectors
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me> <uqmn7c$3n35k$1@dont-email.me> <uqngut$3r1tr$1@dont-email.me> <16dcb6b6bc6d703cdd95c5f0aea5d164@www.novabbs.org> <AGOzN.80209$SyNd.74562@fx33.iad> <192c54d3d7ecce21832bf5785afd2597@www.novabbs.com>
In-Reply-To: <192c54d3d7ecce21832bf5785afd2597@www.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 64
Message-ID: <4Y1AN.329547$PuZ9.21942@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 17 Feb 2024 12:50:08 UTC
Date: Sat, 17 Feb 2024 07:49:34 -0500
X-Received-Bytes: 3810
 by: EricP - Sat, 17 Feb 2024 12:49 UTC

MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup1 wrote:
>>>
>>> You should think of it like:: VVM can execute as many operations per
>>> cycle as it has function units. In particular, the low end machine
>>> can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
>>> cycle. LDs operate at 128-bits wide, so one can execute a LD on even
>>> cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
>>>
>>> Bigger implementations can have more cache ports and more FMAC units;
>>> and include "lanes" in SIMD-like fashion.
>
>> Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
>> two consecutive 64-bit LD or ST to consecutive addresses and merges
>> them into a single cache access?
>
> first: memory is inherently misaligned in My 66000 architecture. So, since
> the width of the machine is 64-bits, we read or write in 128-bit quantities
> so that we have enough bits to extract the misaligned data from or a
> container
> large enough to store a 64-bit value into. {{And there are all the
> associated
> corner cases}}
>
> Second: over in VVM-land, the implementation can decide to read and write
> wider, but is architecturally constrained not to shrink below 128-bits.
>
> A 1-wide My66160 would read pairs of double precision FP values, or quads
> of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values.
> This supports loops of 6IPC or greater in a 1-wide machine. This machine
> would process suitable loops at 128-bits per cycle--depending on "other
> things" that are generally allowable.
>
> A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
> per cycle. This supports 20 IPC or greater in the 6-wide machine. As
> many as
> 8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
> to support this rate.

Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop.
That's more straight forward than fusing LD's or ST's in LSQ.

>> Is that done by disambiguation logic, checking for same cache line
>> access?
>
> Before I have said that the front end observes the first iteration of
> the loop and makes some determinations as to how wide the loop can be
> run on
> the machine at hand. One of those observations is whether memory addresses
> are dense, whether they all go in the same direction, and what registers
> carry loop-to-loop dependencies.

How does it know when to use LDP/STP uOps?
That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.

But the decision on the legality to use LDP/STP depends on knowing the
current loop counter >= 2 and address(es) aligned on a 16 byte boundary,
which are multiple dynamic, possibly calculated, values only available
much later to the back end.


devel / comp.arch / Re: Cray style vectors

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor