Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

The study of non-linear physics is like the study of non-elephant biology.


devel / comp.arch / A Very Bad Idea

SubjectAuthor
* A Very Bad IdeaQuadibloc
+- Re: A Very Bad IdeaChris M. Thomasson
+* Vectors (was: A Very Bad Idea)Anton Ertl
|+* Re: Vectors (was: A Very Bad Idea)Quadibloc
||+- Re: Vectors (was: A Very Bad Idea)Anton Ertl
||`- Re: VectorsMitchAlsup1
|`- Re: VectorsMitchAlsup1
+* Re: A Very Bad IdeaBGB
|`* Re: A Very Bad IdeaMitchAlsup1
| `- Re: A Very Bad IdeaBGB-Alt
+- Re: A Very Bad IdeaMitchAlsup1
+* Re: A Very Bad Idea?Lawrence D'Oliveiro
|`* Re: A Very Bad Idea?MitchAlsup1
| `- Re: A Very Bad Idea?BGB-Alt
`* Re: Cray style vectors (was: A Very Bad Idea)Marcus
 +* Re: Cray style vectors (was: A Very Bad Idea)Quadibloc
 |+- Re: Cray style vectors (was: A Very Bad Idea)Quadibloc
 |+* Re: Cray style vectors (was: A Very Bad Idea)Scott Lurndal
 ||`* Re: Cray style vectors (was: A Very Bad Idea)Thomas Koenig
 || `* Re: Cray style vectorsMitchAlsup1
 ||  `- Re: Cray style vectorsQuadibloc
 |`* Re: Cray style vectorsMarcus
 | +- Re: Cray style vectorsMitchAlsup1
 | `* Re: Cray style vectorsQuadibloc
 |  +- Re: Cray style vectorsQuadibloc
 |  +* Re: Cray style vectorsAnton Ertl
 |  |`* Re: Cray style vectorsStephen Fuld
 |  | +* Re: Cray style vectorsAnton Ertl
 |  | |+- Re: Cray style vectorsMitchAlsup1
 |  | |`* Re: Cray style vectorsStephen Fuld
 |  | | `* Re: Cray style vectorsMitchAlsup
 |  | |  `* Re: Cray style vectorsStephen Fuld
 |  | |   `* Re: Cray style vectorsTerje Mathisen
 |  | |    `* Re: Cray style vectorsAnton Ertl
 |  | |     +* Re: Cray style vectorsTerje Mathisen
 |  | |     |+- Re: Cray style vectorsMitchAlsup1
 |  | |     |+* Re: Cray style vectorsTim Rentsch
 |  | |     ||+* Re: Cray style vectorsMitchAlsup1
 |  | |     |||`* Re: Cray style vectorsTim Rentsch
 |  | |     ||| +* Re: Cray style vectorsOpus
 |  | |     ||| |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||| +* Re: Cray style vectorsScott Lurndal
 |  | |     ||| |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||| `* Re: Cray style vectorsMitchAlsup1
 |  | |     |||  `- Re: Cray style vectorsTim Rentsch
 |  | |     ||`* Re: Cray style vectorsTerje Mathisen
 |  | |     || `* Re: Cray style vectorsTim Rentsch
 |  | |     ||  `* Re: Cray style vectorsTerje Mathisen
 |  | |     ||   +* Re: Cray style vectorsTerje Mathisen
 |  | |     ||   |+* Re: Cray style vectorsMichael S
 |  | |     ||   ||`* Re: Cray style vectorsMitchAlsup1
 |  | |     ||   || `- Re: Cray style vectorsScott Lurndal
 |  | |     ||   |`- Re: Cray style vectorsTim Rentsch
 |  | |     ||   `- Re: Cray style vectorsTim Rentsch
 |  | |     |+- Re: Cray style vectorsAnton Ertl
 |  | |     |`* Re: Cray style vectorsDavid Brown
 |  | |     | +* Re: Cray style vectorsTerje Mathisen
 |  | |     | |+* Re: Cray style vectorsMitchAlsup1
 |  | |     | ||+* Re: Cray style vectorsAnton Ertl
 |  | |     | |||`* What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | ||| `* Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  |`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | +* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |+- Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | |`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | | `* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |  `* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  | |   +- Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |   `* Re: What integer C type to use (was: Cray style vectors)Tim Rentsch
 |  | |     | |||  | |    `* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  | |     `- Re: What integer C type to use (was: Cray style vectors)Tim Rentsch
 |  | |     | |||  | `- Re: What integer C type to useMitchAlsup1
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  |+* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||+- Re: What integer C type to useDavid Brown
 |  | |     | |||  ||`* Re: What integer C type to useTerje Mathisen
 |  | |     | |||  || `* Re: What integer C type to useTim Rentsch
 |  | |     | |||  ||  `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||   +- Re: What integer C type to useTim Rentsch
 |  | |     | |||  ||   `* Re: What integer C type to useDavid Brown
 |  | |     | |||  ||    `- Re: What integer C type to useThomas Koenig
 |  | |     | |||  |+* Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  ||+* Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  |||+* Re: What integer C type to use (was: Cray style vectors)Michael S
 |  | |     | |||  ||||+- Re: What integer C type to use (was: Cray style vectors)Scott Lurndal
 |  | |     | |||  ||||`- Re: What integer C type to use (was: Cray style vectors)David Brown
 |  | |     | |||  |||`- Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  ||`* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  || `* Re: What integer C type to useDavid Brown
 |  | |     | |||  ||  `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  ||   `- Re: What integer C type to useDavid Brown
 |  | |     | |||  |`* Re: What integer C type to use (was: Cray style vectors)Thomas Koenig
 |  | |     | |||  | +* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | |+* Re: What integer C type to useDavid Brown
 |  | |     | |||  | ||`* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | || `* Re: What integer C type to useDavid Brown
 |  | |     | |||  | ||  `* Re: What integer C type to useMichael S
 |  | |     | |||  | ||   +* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | ||   |`* Re: What integer C type to useMichael S
 |  | |     | |||  | ||   | `* Re: What integer C type to useMitchAlsup1
 |  | |     | |||  | ||   `- Re: What integer C type to useThomas Koenig
 |  | |     | |||  | |`* Re: What integer C type to useThomas Koenig
 |  | |     | |||  | `* Re: What integer C type to use (was: Cray style vectors)Anton Ertl
 |  | |     | |||  +* Re: What integer C type to use (was: Cray style vectors)Brian G. Lucas
 |  | |     | |||  `- Re: What integer C type to useBGB
 |  | |     | ||+- Re: Cray style vectorsDavid Brown
 |  | |     | ||`- Re: Cray style vectorsTim Rentsch
 |  | |     | |+- Re: Cray style vectorsDavid Brown
 |  | |     | |`- Re: Cray style vectorsTim Rentsch
 |  | |     | `* Re: Cray style vectorsThomas Koenig
 |  | |     `* Re: Cray style vectorsBGB
 |  | `- Re: Cray style vectorsMitchAlsup1
 |  +- Re: Cray style vectorsBGB
 |  +* Re: Cray style vectorsMarcus
 |  `* Re: Cray style vectorsMitchAlsup1
 `* Re: Cray style vectors (was: A Very Bad Idea)Michael S

Pages:12345678910
A Very Bad Idea

<upq0cr$6b5m$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37212&group=comp.arch#37212

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: A Very Bad Idea
Date: Mon, 5 Feb 2024 06:48:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Feb 2024 06:48:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9dfbb09dfb45af57278db46b8f87cdcc";
logging-data="208054"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LFNkqwg5QQa9ClmHzl/q/mmhciVuvK94="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:zAq/XD36pHNxNypfv5nz5Fu7jZs=
 by: Quadibloc - Mon, 5 Feb 2024 06:48 UTC

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.

But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

John Savard

Re: A Very Bad Idea

<upq0gs$6e8m$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37213&group=comp.arch#37213

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.samoylyk.net!nntp.comgw.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: A Very Bad Idea
Date: Sun, 4 Feb 2024 22:51:08 -0800
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <upq0gs$6e8m$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 5 Feb 2024 06:51:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e85c8ff53645b7d4213f2214e8730b82";
logging-data="211222"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/xuWVopK7OEfJrrPv/HxBKiRahSyFUJRc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:+u2cJfALsqtU+L7eT4nDGNspVaE=
In-Reply-To: <upq0cr$6b5m$1@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Mon, 5 Feb 2024 06:51 UTC

On 2/4/2024 10:48 PM, Quadibloc wrote:
[...]

> The original CELL was thought of as being useful
> for graphics applications, but GPUs are much better at that.

The CELL wrt playstation was an interesting arch. DMA to the cell
processors. Some games did not even use them. They used the PPC's instead.

Vectors (was: A Very Bad Idea)

<2024Feb5.084424@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37215&group=comp.arch#37215

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Vectors (was: A Very Bad Idea)
Date: Mon, 05 Feb 2024 07:44:24 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 104
Message-ID: <2024Feb5.084424@mips.complang.tuwien.ac.at>
References: <upq0cr$6b5m$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="21e8e08af84274a3b4770214f7c13a8a";
logging-data="250493"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/WMAKZ8hO0inl9SWLpYgdr"
Cancel-Lock: sha1:t0vPCzmPz/bNHsjXpSI7jDGxLyE=
X-newsreader: xrn 10.11
 by: Anton Ertl - Mon, 5 Feb 2024 07:44 UTC

Quadibloc <quadibloc@servername.invalid> writes:
>I am very fond of the vector architecture of the Cray I and
>similar machines, because it seems to me the one way of
>increasing computer performance that proved effective in
>the past that still isn't being applied to microprocessors
>today.

To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or
implementing 128-bit operations with 64-bit units (e.g., on the K8).

Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.
Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to
implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations;
<https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/>
reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).

Anyway, the size of the register files is one reason for avoiding
longer vectors.

Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit
instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.

>Mitch Alsup, however, has noted that such an architecture is
>unworkable today due to memory bandwidth issues.

My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.

As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC
applications are among them). And once you are limited by main memory
bandwidth, the ISA makes little difference.

But for those applications where caches work (e.g., dense matrix
multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.

>The one
>extant example of this architecture these days, the NEC
>SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>gigabytes on the same card as the CPU, with a form factor
>resembling a video card - it doesn't try to use the main
>memory bus of a PC motherboard. So that seems to confirm
>this.

Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more
main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have
long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.

>So, perhaps it might be possible to design a chip that is
>basically similar to the IBM/SONY CELL microprocessor,
>except that the satellite processors handle Cray-style vectors,
>and have multiple megabytes of individual local storage.

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.

>The original CELL was thought of as being useful
>for graphics applications, but GPUs are much better at that.

The Playstation 3 has a separate GPU based on the Nvidia G70
<https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: A Very Bad Idea

<upq9va$7s40$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37216&group=comp.arch#37216

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: A Very Bad Idea
Date: Mon, 5 Feb 2024 03:32:16 -0600
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <upq9va$7s40$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 5 Feb 2024 09:32:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8f3543bcc87a2e7786a95b59cae0d67a";
logging-data="258176"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/z50UaYJYAQ1UKJX4orsyZ"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Bt5bF0L8cTz1PIbCt8StkNwfGq0=
Content-Language: en-US
In-Reply-To: <upq0cr$6b5m$1@dont-email.me>
 by: BGB - Mon, 5 Feb 2024 09:32 UTC

On 2/5/2024 12:48 AM, Quadibloc wrote:
> I am very fond of the vector architecture of the Cray I and
> similar machines, because it seems to me the one way of
> increasing computer performance that proved effective in
> the past that still isn't being applied to microprocessors
> today.
>
> Mitch Alsup, however, has noted that such an architecture is
> unworkable today due to memory bandwidth issues. The one
> extant example of this architecture these days, the NEC
> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
> gigabytes on the same card as the CPU, with a form factor
> resembling a video card - it doesn't try to use the main
> memory bus of a PC motherboard. So that seems to confirm
> this.
>
> These days, Moore's Law has limped along well enough to allow
> putting a lot of cache memory on a single die and so on.
>
> So, perhaps it might be possible to design a chip that is
> basically similar to the IBM/SONY CELL microprocessor,
> except that the satellite processors handle Cray-style vectors,
> and have multiple megabytes of individual local storage.
>
> It might be possible to design such a chip. The main processor
> with access to external DRAM would be a conventional processor,
> with only ordinary SIMD vector capabilities. And such a chip
> might well be able to execute lots of instructions if one runs
> a suitable benchmark on it.
>

One doesn't need to disallow access to external RAM, but maybe:
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".

Or, say, there is a region on RAM that is divided among the cores, where
the core has fast access to its own local chunk, but slow access to any
of the other chunks (which are treated more like external RAM).

Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.

As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
is cheap and simple.

The Cell cores were, if anything, more of a "SIMD First, ALU Second"
approach, building it around 128-bit registers but only using part of
these for integer code.

I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general purpose
code, rather than used almost entirely for SIMD.

I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...

It is arguably more scalable, but as I see it, much past 64 or 128 bit
vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.

Or, maybe a hybrid strategy, where the vector operations are applied
over types that may just so happen to include SIMD vectors.

Granted, with the usual caveat that one needs to be careful in the
design of SIMD to not allow it to eat too much of ones encoding space.

Well, and there seems to be a trap of people trying to stick "extra
stuff" into the SIMD ops that bloats their requirements, such as gluing
range-saturation or type-conversions onto the SIMD operations.

Granted, I did experiment with bolting shuffle onto SIMD ops, but this
was exclusive to larger encodings (64 and 96 bit operations). And,
likely, not really worth it outside of some niche cases.

> But try as I might, I can't see a useful application for such
> a chip. The restricted access to memory would basically hobble
> it for anything but a narrow class of embarassingly parallel
> applications. The original CELL was thought of as being useful
> for graphics applications, but GPUs are much better at that.
>

Yes, tradeoffs.

But, I can also note that even for semi-general use, an ISA design like
RV64G is suffers a significant disadvantage, say, vs my own ISA, in the
area of doing an OpenGL implementation. Needing to fall back to scalar
operations really hurts things (even as much as the RV64 F/D extensions
are "fairly aggressive" in some areas, with a lot of features that I
would have otherwise deemed "mostly unnecessary").

Or, some features it does have, partly backfire:
The ABI and compiler treat 'float' and 'double' as distinct types;
In turn requiring a lot of conversion back and forth.

My own approach was to always keep scalar values as Binary64 in
registers, since this limits having a bunch of back-and-forth conversion
for 'float' in local variables (this type was more relevant to in-memory
storage, and as a hint for when and where it is safe to trade off
precision for speed).

Though, looks like "-ffast-math" in GCC does help this issue (the float
variables remain as float, rather than endlessly bouncing back and forth
between float and double).

Re: Vectors (was: A Very Bad Idea)

<upqn9d$a0i2$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37221&group=comp.arch#37221

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Vectors (was: A Very Bad Idea)
Date: Mon, 5 Feb 2024 13:19:41 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <upqn9d$a0i2$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me>
<2024Feb5.084424@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Feb 2024 13:19:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9dfbb09dfb45af57278db46b8f87cdcc";
logging-data="328258"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Odb2ffE1o/PFEtvw5VBS/13AbtMrZdvc="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:7LN3LCAGfWs5dKg8BzjunIYhES4=
 by: Quadibloc - Mon, 5 Feb 2024 13:19 UTC

On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

> Who would buy such a microprocessor? Megabytes? Laughable. If
> that's intended to be a buffer for main memory, you need the
> main-memory bandwidth;

Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.

I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).

Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an
eight-channel memory bus.

Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.

Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.

John Savard

Re: Vectors (was: A Very Bad Idea)

<2024Feb5.144456@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37223&group=comp.arch#37223

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Vectors (was: A Very Bad Idea)
Date: Mon, 05 Feb 2024 13:44:56 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 72
Message-ID: <2024Feb5.144456@mips.complang.tuwien.ac.at>
References: <upq0cr$6b5m$1@dont-email.me> <2024Feb5.084424@mips.complang.tuwien.ac.at> <upqn9d$a0i2$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="21e8e08af84274a3b4770214f7c13a8a";
logging-data="360092"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wG2pVwZdvID1KAkVeBFnZ"
Cancel-Lock: sha1:f+U0+Bbq+ssdOIlnsNUH4hZImR4=
X-newsreader: xrn 10.11
 by: Anton Ertl - Mon, 5 Feb 2024 13:44 UTC

Quadibloc <quadibloc@servername.invalid> writes:
>On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:
>
>> Who would buy such a microprocessor? Megabytes? Laughable. If
>> that's intended to be a buffer for main memory, you need the
>> main-memory bandwidth;
>
>Well, the original Cray I had a main memory of eight megabytes

If you want to compete with a 1976 supercomputer, Megabytes may be
enough. However, if you want to compete with something from 2024,
better look at how much local memory the likes of these NEC cards, or
Nvidia or AMD GPGPUs provide. And that's Gigabytes.

>I was keeping as close to the original CELL design as possible, but
>certainly one could try to improve. After all, if Intel could make
>a device like the Xeon Phi, having multiple CPUs on a chip all sharing
>access to external memory, however inadequate, could still be done

You don't have to look for the Xeon Phi. The lowly Athlon 64 X2 or
Pentium D from 2005 already have several cores sharing access the
external memory (and the UltraSPARC T1 from the same year even has 8
cores.

The Xeon Phis are interesting:

* Knight's Corner is a PCIe card with up to 16GB local memory and
bandwidths up to 352GB/s (plus access to the host system's DRAM and
anemic bandwidth (PCIe 2.0 x16)).

* Knight's Landing was available as PCIe card or as socketed CPU with
16GB of local memory with "400+ GB/s" bandwidth and up to 384GB of
DDR4 memory with 102.4GB/s.

* Knight's Mill was only available in a socketed version with similar
specs.

* Eventually they were replaced by the big mainstream Xeons without
local memory, stuff like the Xeon Platinum 8180 with about 128GB/s
DRAM bandwidth.

It seems that running the HPC processor as a coprocessor was not good
enough for the Xeon Phi, and that the applications that needed lots of
bandwidth to local memory also did not provide enough revenue to
sustain Xeon Phi development; OTOH, Nvidia has great success with its
GPGPU line, so maybe the market is there, but the Xeon Phi was
uncompetetive.

If you are interested in such things, the recently announced AMD
Instinct MI300A (CPUs+GPUs) with 128GB local memory or MI300X (GPUs
only) with 192GB local memory with 5300GB/s bandwidth may be of
interest to you.

>Instead of imitating the CELL, or the Xeon Phi, for that matter, what
>I think of as a more practical way to make a consumer Cray-like chip
>would be to put only one core in a package, and give that core an
>eight-channel memory bus.

IBM sells Power systems with few cores and the full memory system.
Similarly, you can buy AMD EPYCs with few active cores and the full
memory system. Some of them have a lot of cache, too (e.g., 72F3 and
7373X).

>Some older NEC designs used a sixteen-channel memory bus, but I felt
>that eight channels will already be expensive for a consumer product.

If you want high bandwidth in a consumer product, buy a graphics card.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: A Very Bad Idea

<6860be35f9ff44a0a2cc01a4dac81fac@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37226&group=comp.arch#37226

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: A Very Bad Idea
Date: Mon, 5 Feb 2024 19:20:51 +0000
Organization: Rocksolid Light
Message-ID: <6860be35f9ff44a0a2cc01a4dac81fac@www.novabbs.org>
References: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1807823"; mail-complaints-to="usenet@i2pn2.org";
posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$LBPltiuBSZTpnFQu4MzSfevLv19bphnf09AUNNCcrNlMnjHo2ITc6
 by: MitchAlsup1 - Mon, 5 Feb 2024 19:20 UTC

Quadibloc wrote:

> I am very fond of the vector architecture of the Cray I and
> similar machines, because it seems to me the one way of
> increasing computer performance that proved effective in
> the past that still isn't being applied to microprocessors
> today.

> Mitch Alsup, however, has noted that such an architecture is
> unworkable today due to memory bandwidth issues. The one

Memory LATENCY issues not BW issues. The length of the vector
has to be able to absorb a miss at all cache levels without
stalling the core. 5GHz processors, 60 ns DRAM access times
means the minimum vector length is 300 registers in a single
vector. Which also means it takes a loop of count 300+ to
reach peak efficiency.

To a certain extent the B registers of the CRAY 2 were to
do that (absorb longer and longer memory latencies) but
this B register set is now considered a failure.

> extant example of this architecture these days, the NEC
> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
> gigabytes on the same card as the CPU, with a form factor
> resembling a video card - it doesn't try to use the main
> memory bus of a PC motherboard. So that seems to confirm
> this.

They also increased vector length as memory latency increased.
Ending up at (IIRC) 256 entry VRF[k].

> These days, Moore's Law has limped along well enough to allow
> putting a lot of cache memory on a single die and so on.

Consider FFT:: sooner or later you are reading and writing
vastly scattered memory containers considerably smaller than
any cache line. FFT is one you want peak efficiency on !
So, if you want FFT to run at core peak efficiency, your
interconnect has to be able to pass either containers from
different memory banks on alternating cycles, or whole
cache lines in a single cycle. {{The later is easier to do}}

A vector machine (done properly) is a bandwidth machine rather
than a latency based machine (which can be optimized by cache
hierarchy).

> So, perhaps it might be possible to design a chip that is
> basically similar to the IBM/SONY CELL microprocessor,
> except that the satellite processors handle Cray-style vectors,
> and have multiple megabytes of individual local storage.

Generally microprocessors are pin limited as are DRAM chips,
so in order to get the required BW--2LD1ST per cycle continuously
with latency less than vector length--you end up needing a way
to access 16-to-64 DRAM DIMMs simultaneously. Yo might be able
to do with with PCIe 6.0 is you have 64 twisted quads, one for
each DRAM DIMM. Minimum memory size is 64 DIMMs !

A processor box with 64 DIMMs (as its minimum) is not mass market.

One reason CRAY sold a lot of supercomputers is that its I/O
system was also up to the task--CRAY YMP had 4× the I/O BW
of NEC SX{4,5,6} so when the application became I/O bound
the 6ns YMP was faster than the SX.

It is perfectly OK to try to build a CRAY-like vector processor.
But designing a vector processor is a lot more about the memory
system (feeding the beast) than about the processor (the beast).

> It might be possible to design such a chip. The main processor
> with access to external DRAM would be a conventional processor,
> with only ordinary SIMD vector capabilities. And such a chip
> might well be able to execute lots of instructions if one runs
> a suitable benchmark on it.

If you figure this out, there is a market for 100-200 vector
supercomputers mainframes per year. If you can build a company
that makes money on this volume-- go for it !

> But try as I might, I can't see a useful application for such
> a chip. The restricted access to memory would basically hobble
> it for anything but a narrow class of embarassingly parallel
> applications. The original CELL was thought of as being useful
> for graphics applications, but GPUs are much better at that.

> John Savard

Re: Vectors

<acd169068436b7c7988e46b4bce954cd@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37227&group=comp.arch#37227

  copy link   Newsgroups: comp.arch
Date: Mon, 5 Feb 2024 19:30:20 +0000
Subject: Re: Vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$g1Prj72OgoFlBIrw69MPWuQyGq.FlY8edv/NnN.9TwNyhCjt3eSl6
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <2024Feb5.084424@mips.complang.tuwien.ac.at>
Organization: Rocksolid Light
Message-ID: <acd169068436b7c7988e46b4bce954cd@www.novabbs.org>
 by: MitchAlsup1 - Mon, 5 Feb 2024 19:30 UTC

Anton Ertl wrote:

> Quadibloc <quadibloc@servername.invalid> writes:
>>I am very fond of the vector architecture of the Cray I and
>>similar machines, because it seems to me the one way of
>>increasing computer performance that proved effective in
>>the past that still isn't being applied to microprocessors
>>today.

> To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
> 512-bit registers to the 256-bit units in two successive cycles.
> Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
> register and AFAIK it split 256-bit operations into two 128-bit
> operations that could be scheduled arbitrarily by the OoO engine
> (while Zen4 treats the 512-bit operation as a unit that consumes two
> cycles of a pipelined 256-bit unit). Similar things have been done by
> Intel and AMD in other CPUs, implementing 256-bit operations with
> 128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or
> implementing 128-bit operations with 64-bit units (e.g., on the K8).

> Why are they not using longer vectors with the same FUs or narrower
> FUs? For Gracemont, that's really the question; they even disabled
> AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
> with disabled Gracemont) because Gracemont does not do AVX-512.

They wanted to keep core power under some <thermal> limit 256-bits
fit under this limit, 513 did not.

> Supposedly the reason is that Gracemont does not have enough physical
> 128-bit registers for AVX-512 (128 such registers would be needed to
> implement the 32 logical ZMM registers, and probably some more to
> avoid deadlocks and maybe for some microcoded operations;
> <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/>
> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
> doubt that explanation).

> Anyway, the size of the register files is one reason for avoiding
> longer vectors.

> Also, the question is how much it buys. For Zen4, I remember seeing
> results that coding the same stuff as using two 256-bit instructions
> rather than one 512-bit instruction increased power consumption a
> little, resulting in the CPU (running at the power limit) lowering the
> clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
> benefit. How much would the benefit be from longer vectors? Probably
> not more than another 100MHz: From 256-bit instructions to 512-bit
> instructions already halves the number of instructions to process in
> the front end; eliminating the other half would require infinitely
> long vectors.

>>Mitch Alsup, however, has noted that such an architecture is
>>unworkable today due to memory bandwidth issues.

> My memory says is that he mentioned memory latency. He did not
> explain why he thinks so, but caches and prefetchers seem to be doing
> ok for bridging the latency from DRAM to L2 or L1.

As seen by scalar cores, yes, as seen by vector cores (like CRAY) no.

I might note:: RISC-V has a CRAY-like vector extension and a SIMD-like
vector extension. ... make of that what you may.

> As for main memory bandwidth, that is certainly a problem for
> applications that have frequent cache misses (many, but not all HPC
> applications are among them). And once you are limited by main memory
> bandwidth, the ISA makes little difference.

My point in the previous post.

> But for those applications where caches work (e.g., dense matrix
> multiplication in the HPC realm), I don't see a reason why a
> long-vector architecture would be unworkable. It's just that, as
> discussed above, the benefits are small.

TeraByte 2D and 3D FFTs are not cache friendly...

>>The one
>>extant example of this architecture these days, the NEC
>>SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>>gigabytes on the same card as the CPU, with a form factor
>>resembling a video card - it doesn't try to use the main
>>memory bus of a PC motherboard. So that seems to confirm
>>this.

> Caches work well for most applications. So mainstream CPUs are
> designed with a certain amount of cache and enough main-memory
> bandwidth to satisfy most applications. For the niche that needs more
> main-memory bandwidth, there are GPGPUs which have high bandwidth
> because their original application needs it (and AFAIK GPGPUs have

And can afford to absorb the latency.

> long vectors). For the remaining niche, having a CPU with several
> stacks of HBM memory attached (like the NEC vector CPUs) is a good
> idea; and given that there is legacy software for NEC vector CPUs,
> providing that ISA also covers that need.

>>So, perhaps it might be possible to design a chip that is
>>basically similar to the IBM/SONY CELL microprocessor,
>>except that the satellite processors handle Cray-style vectors,
>>and have multiple megabytes of individual local storage.

> Who would buy such a microprocessor? Megabytes? Laughable. If
> that's intended to be a buffer for main memory, you need the
> main-memory bandwidth; and why would you go for explicitly managed
> local memory (which deservedly vanished from the market, see below)
> rather than the well-working setup of cache and prefetchers? BTW,
> Raptor Cove gives you 2MB of private L2.

>>The original CELL was thought of as being useful
>>for graphics applications, but GPUs are much better at that.

> The Playstation 3 has a separate GPU based on the Nvidia G70
> <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

> What I heard/read about the Cell CPU is that the SPEs were too hard to
> make good use of and that consequently they were not used much.

> - anton

Re: A Very Bad Idea

<694fe8eeffcb95294990ef3c8e3e5494@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37228&group=comp.arch#37228

  copy link   Newsgroups: comp.arch
Date: Mon, 5 Feb 2024 19:43:15 +0000
Subject: Re: A Very Bad Idea
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$.cj/QDNMcpVJt9TvVOJwGeXAlnNFFlU9Qc1YcLrd0I.UN.IJoOeEm
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <upq9va$7s40$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <694fe8eeffcb95294990ef3c8e3e5494@www.novabbs.org>
 by: MitchAlsup1 - Mon, 5 Feb 2024 19:43 UTC

BGB wrote:

> On 2/5/2024 12:48 AM, Quadibloc wrote:
>> I am very fond of the vector architecture of the Cray I and
>> similar machines, because it seems to me the one way of
>> increasing computer performance that proved effective in
>> the past that still isn't being applied to microprocessors
>> today.
>>
>> Mitch Alsup, however, has noted that such an architecture is
>> unworkable today due to memory bandwidth issues. The one
>> extant example of this architecture these days, the NEC
>> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>> gigabytes on the same card as the CPU, with a form factor
>> resembling a video card - it doesn't try to use the main
>> memory bus of a PC motherboard. So that seems to confirm
>> this.
>>
>> These days, Moore's Law has limped along well enough to allow
>> putting a lot of cache memory on a single die and so on.
>>
>> So, perhaps it might be possible to design a chip that is
>> basically similar to the IBM/SONY CELL microprocessor,
>> except that the satellite processors handle Cray-style vectors,
>> and have multiple megabytes of individual local storage.
>>
>> It might be possible to design such a chip. The main processor
>> with access to external DRAM would be a conventional processor,
>> with only ordinary SIMD vector capabilities. And such a chip
>> might well be able to execute lots of instructions if one runs
>> a suitable benchmark on it.
>>

> One doesn't need to disallow access to external RAM, but maybe:
> Memory coherence is fairly weak for these cores;
> The local RAM addresses are treated as "strongly preferable".

> Or, say, there is a region on RAM that is divided among the cores, where
> the core has fast access to its own local chunk, but slow access to any
> of the other chunks (which are treated more like external RAM).

Large FFTs do not fit n this category. FFTs are one of the most valuable
means of calculating Great Big Physics "stuff". We used FFT back in the
NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
VERY Many big physics simulations do similarly.

That problem was matrix-matrix multiplication !!

MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );

{where pattern was FFTd earlier }

Lookup the data access pattern and apply that knowledge to TB sized
matrixes and then ask yourself if caches bring anything to the party ?

> Here, threads would be assigned to particular cores, and the scheduler
> may not move a thread from one core to another if it is assigned to a
> given core.

> As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
> is cheap and simple.

If you are happy adding 1,000+ instructions to your ISA, then yes.

> The Cell cores were, if anything, more of a "SIMD First, ALU Second"
> approach, building it around 128-bit registers but only using part of
> these for integer code.

> I went a slightly different direction, using 64-bit registers that may
> be used in pairs for 128-bit ops. This may make more sense if one
> assumes that the core is going to be used for a lot more general purpose
> code, rather than used almost entirely for SIMD.

> I have some hesitation about "vector processing", as it seems fairly
> alien to how this stuff normally sort of works; seems more complicated
> than SIMD for an implementation; ...

Vector design is a lot more about the memory system (feeding the beast)
than the core (the beast) consuming memory BW.

> It is arguably more scalable, but as I see it, much past 64 or 128 bit
> vectors, SIMD rapidly goes into diminishing returns, and it makes more
> sense to be like "128-bit is good enough" than to try to chase after
> ever wider SIMD vectors.

Architecture is more about "what to leave OUT" as about "what to put in".

> But, I can also note that even for semi-general use, an ISA design like
> RV64G is suffers a significant disadvantage, say, vs my own ISA, in the

They disobeyed the "what to leave out" and "what to put in" rules.

>

Re: Vectors

<5ab4fe83b3768e97b884e3ab04956d36@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37229&group=comp.arch#37229

  copy link   Newsgroups: comp.arch
Date: Mon, 5 Feb 2024 19:46:55 +0000
Subject: Re: Vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$IeBU1pHbHpBZIZWSth35EekGH.L.RnxiCOwDEILWitumXuOwwQgxC
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <2024Feb5.084424@mips.complang.tuwien.ac.at> <upqn9d$a0i2$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <5ab4fe83b3768e97b884e3ab04956d36@www.novabbs.org>
 by: MitchAlsup1 - Mon, 5 Feb 2024 19:46 UTC

Quadibloc wrote:

> On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

>> Who would buy such a microprocessor? Megabytes? Laughable. If
>> that's intended to be a buffer for main memory, you need the
>> main-memory bandwidth;

> Well, the original Cray I had a main memory of eight megabytes, and the
> Cray Y-MP had up to 512 megabytes of memory.

CRAY-1 could access one 64-bit memory container per cycle continuously.
CRAY-XMP could access 3 64-bit memory containers in 2LD, 1 SR per cycle
continuously.
Where memory started at about 16 cycles away (12.5ns version) and ended
up about 30 cycles away (6ns version) and a memory bank could be accessed
about every 7 cycles.

> I was keeping as close to the original CELL design as possible, but
> certainly one could try to improve. After all, if Intel could make
> a device like the Xeon Phi, having multiple CPUs on a chip all sharing
> access to external memory, however inadequate, could still be done (but
> then I wouldn't be addressing Mitch Alsup's objection).

> Instead of imitating the CELL, or the Xeon Phi, for that matter, what
> I think of as a more practical way to make a consumer Cray-like chip
> would be to put only one core in a package, and give that core an
> eight-channel memory bus.

> Some older NEC designs used a sixteen-channel memory bus, but I felt
> that eight channels will already be expensive for a consumer product.

> Given Mitch Alsup's objection, though, I threw out the opposite kind
> of design, one patterned after the CELL, as one that maybe could allow
> a vector CPU to churn out more FLOPs. But as I noted, it seems to have
> the fatal flaw of very little capacity for any kind of useful work...
> which is kind of the whole point of any CPU.

> John Savard

Re: A Very Bad Idea

<upropl$g561$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37235&group=comp.arch#37235

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bohannonindustriesllc@gmail.com (BGB-Alt)
Newsgroups: comp.arch
Subject: Re: A Very Bad Idea
Date: Mon, 5 Feb 2024 16:51:33 -0600
Organization: A noiseless patient Spider
Lines: 243
Message-ID: <upropl$g561$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <upq9va$7s40$1@dont-email.me>
<694fe8eeffcb95294990ef3c8e3e5494@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Feb 2024 22:51:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="812c9dbf097286612dcdf56c29f941a7";
logging-data="529601"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/y50X2nZAz8mIlynvVXQfGLAP9adA//co="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:v+dOjF9blil7JTkFKf/eaeiPK9c=
Content-Language: en-US
In-Reply-To: <694fe8eeffcb95294990ef3c8e3e5494@www.novabbs.org>
 by: BGB-Alt - Mon, 5 Feb 2024 22:51 UTC

On 2/5/2024 1:43 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 2/5/2024 12:48 AM, Quadibloc wrote:
>>> I am very fond of the vector architecture of the Cray I and
>>> similar machines, because it seems to me the one way of
>>> increasing computer performance that proved effective in
>>> the past that still isn't being applied to microprocessors
>>> today.
>>>
>>> Mitch Alsup, however, has noted that such an architecture is
>>> unworkable today due to memory bandwidth issues. The one
>>> extant example of this architecture these days, the NEC
>>> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>>> gigabytes on the same card as the CPU, with a form factor
>>> resembling a video card - it doesn't try to use the main
>>> memory bus of a PC motherboard. So that seems to confirm
>>> this.
>>>
>>> These days, Moore's Law has limped along well enough to allow
>>> putting a lot of cache memory on a single die and so on.
>>>
>>> So, perhaps it might be possible to design a chip that is
>>> basically similar to the IBM/SONY CELL microprocessor,
>>> except that the satellite processors handle Cray-style vectors,
>>> and have multiple megabytes of individual local storage.
>>>
>>> It might be possible to design such a chip. The main processor
>>> with access to external DRAM would be a conventional processor,
>>> with only ordinary SIMD vector capabilities. And such a chip
>>> might well be able to execute lots of instructions if one runs
>>> a suitable benchmark on it.
>>>
>
>> One doesn't need to disallow access to external RAM, but maybe:
>> Memory coherence is fairly weak for these cores;
>> The local RAM addresses are treated as "strongly preferable".
>
>> Or, say, there is a region on RAM that is divided among the cores,
>> where the core has fast access to its own local chunk, but slow access
>> to any of the other chunks (which are treated more like external RAM).
>
> Large FFTs do not fit n this category. FFTs are one of the most valuable
> means of calculating Great Big Physics "stuff". We used FFT back in the
> NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
> VERY Many big physics simulations do similarly.
> That problem was matrix-matrix multiplication !!
> MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );
>
> {where pattern was FFTd earlier }
>
> Lookup the data access pattern and apply that knowledge to TB sized
> matrixes and then ask yourself if caches bring anything to the party ?
>

AFAIK, it is typical for FFT-style transforms (such as DCT) beyond a
certain size to decompose them into a grid and then perform it in
multiple stages and layers?...

Say, if you want a 64x64, you decompose it into 8x8, then apply a second
level on the DC coefficients of the first level.

>> Here, threads would be assigned to particular cores, and the scheduler
>> may not move a thread from one core to another if it is assigned to a
>> given core.
>
>
>> As for SIMD vs vectors, as I see it, SIMD seems to make sense in that
>> it is cheap and simple.
>
> If you are happy adding 1,000+ instructions to your ISA, then yes.
>

If you leave out stuff like op+convert, saturating arithmetic, ...
The number of needed instructions drops significantly.

Basically, most of the stuff that leads to an MxN expansion in the
number of instructions.

So, for example, I left out packed-byte operations and saturating ops,
instead:
Packed Integer:
4x Int16
2x|4x Int32
Packed Float:
4x Binary16
2x|4x Binary32
2x Binary64

I also decided that there would be no native vector sizes other than 64
and 128 bits.

>> The Cell cores were, if anything, more of a "SIMD First, ALU Second"
>> approach, building it around 128-bit registers but only using part of
>> these for integer code.
>
>> I went a slightly different direction, using 64-bit registers that may
>> be used in pairs for 128-bit ops. This may make more sense if one
>> assumes that the core is going to be used for a lot more general
>> purpose code, rather than used almost entirely for SIMD.
>
>
>> I have some hesitation about "vector processing", as it seems fairly
>> alien to how this stuff normally sort of works; seems more complicated
>> than SIMD for an implementation; ...
>
> Vector design is a lot more about the memory system (feeding the beast)
> than the core (the beast) consuming memory BW.
>

Yeah.

Seems to have the weirdness that the vectors are often more defined in
terms of memory arrays, rather than the SIMD-like "well, this register
has 4 datums".

SIMD works well if one also has a lot of data with 3 or 4 elements,
which is typical.

A lot of the larger-vector SIMD and autovectorization stuff seems more
focused on the vector-style usage, rather than datasets based on 3D and
4D vector math though (say, with a lot of DotProduct and CrossProduct
and similar).

But, as for implementation:
SIMD is basically the same as normal scalar stuff...
Vector, is not.

>> It is arguably more scalable, but as I see it, much past 64 or 128 bit
>> vectors, SIMD rapidly goes into diminishing returns, and it makes more
>> sense to be like "128-bit is good enough" than to try to chase after
>> ever wider SIMD vectors.
>
> Architecture is more about "what to leave OUT" as about "what to put in".
>

Yeah.

As I see it, AVX-256 was already in diminishing returns.
Then AVX-512 is more sort of a gimmick that ends up left out of most of
the CPUs anyways, and for most embedded CPUs, they are 64 or 128-bit
vectors and call it done.

>> But, I can also note that even for semi-general use, an ISA design
>> like RV64G is suffers a significant disadvantage, say, vs my own ISA,
>> in the
>
> They disobeyed the "what to leave out" and "what to put in" rules.
>

As I can note, needing to add a whole bunch of stuff to the FPU, and
still not seeing much gain (even if I simulate superscalar operation in
the emulator, still "not particularly fast" at this).

Granted, GCC seems pretty bent on using FMADD.S and similar, even though
in my core this is slower than separate FMUL.S+FADD.S would be, and it
seems that options for other targets (like "-fno-fused-multiply-add" or
"-fno-madd4" and similar, are not recognized for RISC-V).

It is kind of a crappy balance in some areas.
Does pretty good for code-size at least, but performance seems lacking
in everything I am testing.

Though, ironically, it is not so much a "uniform badness", as often the
"hot spots" seems to be more concentrated, like, over-all it is doing
OK, except in the spots where it ends up shooting itself in the foot...

Contrast, for a lot of the same programs on BJX2, things are slightly
less concentrated on the hot-spots.

Well, and the other difference that code for BJX2 seems to spend roughly
3x as many clock-cycles in memory ops (rather than being more ALU bound;
even after adding a special-case to reduce ALU ops to 1 cycle).

So, Doom's breakdown for RV64 was more like:
ADD/Shift
Branch
Mem
Quake's is more like:
ADD
FMADD/etc, FDIV.S
Mem
Branch

For BJX2, it was generally:
Mem
Branch
ALU and CMP
...

For Software Quake, it should have been more fare, since both are mostly
using the C version of Quake's software renderer, which doesn't really
make much use of any special features in my ISA.

RV64G still doesn't win...

For scalar-FPU intensive stuff, theoretically RV64G could have had an
advantage, as it has a lot more special-case ops, vs, say:
Well, you got Binary64 FADD, FSUB, and FMUL?...

Well, and Compare, and a few converter ops, ...
And, then nominally running all the "float" operations through the
double precision FPU operations, ...

Meanwhile, for GLQuake, TKRA-GL makes more significant use of features
in my ISA, and it shows. It is mostly high-single-digit to
low-double-digit on BJX2, and around 0/1 fps on RV64G (and, basically,
entirely unplayable; by another stat, averaging roughly 0.8 fps).

Maybe one could do a decent OpenGL implementation with the 'V'
extension, but I have my doubts, and don't currently have any intent to
mess with the V extension.

Main difference between RV64G and BJX2 regarding OpenGL:
Mostly that BJX2 has SIMD ops;
Well, and twice the GPRs, this code is very register-hungry.
well, with some functions with 100+ local variables.


Click here to read the complete article
Re: A Very Bad Idea?

<uprp2k$g01t$5@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37236&group=comp.arch#37236

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: A Very Bad Idea?
Date: Mon, 5 Feb 2024 22:56:20 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <uprp2k$g01t$5@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Feb 2024 22:56:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f9f0728b32a146e99d37fd009fb7e64d";
logging-data="524349"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+n66uq8GkT0aIonY2rPxKQ"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:kL4+viqa/rh3cW0XaFm/JY7OIgg=
 by: Lawrence D'Oliv - Mon, 5 Feb 2024 22:56 UTC

On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

> I am very fond of the vector architecture of the Cray I and similar
> machines, because it seems to me the one way of increasing computer
> performance that proved effective in the past that still isn't being
> applied to microprocessors today.
>
> Mitch Alsup, however, has noted that such an architecture is unworkable
> today due to memory bandwidth issues.

RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many
architectures these days precisely because the resulting combinatorial
explosion in added instructions makes a mockery of the “R” in “RISC”.

Re: A Very Bad Idea?

<931eb925d511299017093364c2ac3b63@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37282&group=comp.arch#37282

  copy link   Newsgroups: comp.arch
Date: Sat, 10 Feb 2024 23:27:35 +0000
Subject: Re: A Very Bad Idea?
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$19RDGejwcT9tHRpYcugZkuNeLqMKc6/zgVZeDoHuAvRiXTSFv/Lja
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uprp2k$g01t$5@dont-email.me>
Organization: Rocksolid Light
Message-ID: <931eb925d511299017093364c2ac3b63@www.novabbs.org>
 by: MitchAlsup1 - Sat, 10 Feb 2024 23:27 UTC

Lawrence D'Oliveiro wrote:

> On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

>> I am very fond of the vector architecture of the Cray I and similar
>> machines, because it seems to me the one way of increasing computer
>> performance that proved effective in the past that still isn't being
>> applied to microprocessors today.
>>
>> Mitch Alsup, however, has noted that such an architecture is unworkable
>> today due to memory bandwidth issues.

> RISC-V has a long-vector feature very consciously modelled on the Cray
> one. It eschews the short-vector SIMD fashion that has infested so many
> architectures these days precisely because the resulting combinatorial
> explosion in added instructions makes a mockery of the “R” in “RISC”.

So does the C extension--its all redundant...

Re: Cray style vectors (was: A Very Bad Idea)

<uqge2p$279ql$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37327&group=comp.arch#37327

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Date: Tue, 13 Feb 2024 19:57:28 +0100
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <uqge2p$279ql$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 13 Feb 2024 18:57:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e50da17f5e9e60acbd33fe10dd394c91";
logging-data="2336597"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/rG5EakLCc2s0qIxUdrfib9ZUdasjPoZ0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:HcN93YJZSNYiRjKYiHKaCBzLq4Q=
In-Reply-To: <upq0cr$6b5m$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 13 Feb 2024 18:57 UTC

On 2024-02-05, Quadibloc wrote:
> I am very fond of the vector architecture of the Cray I and
> similar machines, because it seems to me the one way of
> increasing computer performance that proved effective in
> the past that still isn't being applied to microprocessors
> today.
>
> Mitch Alsup, however, has noted that such an architecture is
> unworkable today due to memory bandwidth issues. The one
> extant example of this architecture these days, the NEC
> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
> gigabytes on the same card as the CPU, with a form factor
> resembling a video card - it doesn't try to use the main
> memory bus of a PC motherboard. So that seems to confirm
> this.
>

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).

My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.

Thus, it always feels like using vector instructions would not give any
performance gains. Yet, every time I vectorize a scalar loop (basically
change scalar registers for vector registers), I see a very healthy
performance increase.

I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by simply
adding MRISC32-style vector support. I think that the gains would be
even bigger for non-pipelined machines, that could start "pumping" the
execute stage on every cycle when processing vectors, skipping the fetch
and decode cycles.

BTW, I have also noticed that I often only need a very limited number of
vector registers in the core vectorized loops (e.g. 2-4 registers), so I
don't think that the VRF has to be excruciatingly big to add value to a
small core. I also envision that for most cases you never have to
preserve vector registers over function calls. I.e. there's really no
need to push/pop vector registers to the stack, except for context
switches (which I believe should be optimized by tagging unused vector
registers to save on stack bandwidth).

/Marcus

Re: Cray style vectors (was: A Very Bad Idea)

<uqhiqb$2grub$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37330&group=comp.arch#37330

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Date: Wed, 14 Feb 2024 05:24:27 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <uqhiqb$2grub$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Feb 2024 05:24:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="23e59836ac4037c374a27f9c8f671db7";
logging-data="2650059"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+gtLXiq4y2mXv36HPl3cvbWle4ZTpRK2Y="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:Oiirdi7z+UTNgtwIouYYbBpPrsM=
 by: Quadibloc - Wed, 14 Feb 2024 05:24 UTC

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

> (I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.

I've come up with _one_ possible solution. Remember the Texas Instruments
9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?

Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files,
indicated by a pointer register?

But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.

The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.

Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.

Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!

The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.

Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.

People who can afford a terabyte of DRAM can also afford an eight-channel
memory bus, so it should be possible to manage something.

John Savard

Re: Cray style vectors (was: A Very Bad Idea)

<uqhkft$2h7hd$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37332&group=comp.arch#37332

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.nntp4.net!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Date: Wed, 14 Feb 2024 05:53:02 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <uqhkft$2h7hd$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Feb 2024 05:53:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="23e59836ac4037c374a27f9c8f671db7";
logging-data="2661933"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TsZE8IyY3p+pr7dPH1ctJaypCbpKM0sM="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:4fzok9dFjConKDMuqgFymXWMppE=
 by: Quadibloc - Wed, 14 Feb 2024 05:53 UTC

On Wed, 14 Feb 2024 05:24:27 +0000, Quadibloc wrote:

> Of course, if a Cray I is a *batch* processing computer, that sort
> of justifies the notion I came up with earlier - in a thread I
> aptly titled "A Very Bad Idea"

Didn't look very carefully. That's _this_ thread.

John Savard

Re: Cray style vectors (was: A Very Bad Idea)

<20240214111422.0000453c@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37334&group=comp.arch#37334

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Date: Wed, 14 Feb 2024 11:14:22 +0200
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <20240214111422.0000453c@yahoo.com>
References: <upq0cr$6b5m$1@dont-email.me>
<uqge2p$279ql$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="1265d72520b4338966dc099920a0b1a1";
logging-data="2711759"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Rrgs01deg8BQwAsjxk3/p9cnL5Yz4BQM="
Cancel-Lock: sha1:2Aky90GdNtqyZGND2cw87Q2zh28=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Wed, 14 Feb 2024 09:14 UTC

On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

> On 2024-02-05, Quadibloc wrote:
> > I am very fond of the vector architecture of the Cray I and
> > similar machines, because it seems to me the one way of
> > increasing computer performance that proved effective in
> > the past that still isn't being applied to microprocessors
> > today.
> >
> > Mitch Alsup, however, has noted that such an architecture is
> > unworkable today due to memory bandwidth issues. The one
> > extant example of this architecture these days, the NEC
> > SX-Aurora TSUBASA, keeps its entire main memory of up to 48
> > gigabytes on the same card as the CPU, with a form factor
> > resembling a video card - it doesn't try to use the main
> > memory bus of a PC motherboard. So that seems to confirm
> > this.
> >
>
> FWIW I would just like to share my positive experience with MRISC32
> style vectors (very similar to Cray 1, except 32-bit instead of
> 64-bit).
>

Does it means that you have 8 VRs and each VR is 2048 bits?

> My machine can start and finish at most one 32-bit operation on every
> clock cycle, so it is very simple. The same thing goes for vector
> operations: at most one 32-bit vector element per clock cycle.
>
> Thus, it always feels like using vector instructions would not give
> any performance gains. Yet, every time I vectorize a scalar loop
> (basically change scalar registers for vector registers), I see a
> very healthy performance increase.
>
> I attribute this to reduced loop overhead, eliminated hazards, reduced
> I$ pressure and possibly improved cache locality and reduced register
> pressure.
>
> (I know very well that VVM gives similar gains without the VRF)
>
> I guess my point here is that I think that there are opportunities in
> the very low end space (e.g. in order) to improve performance by
> simply adding MRISC32-style vector support. I think that the gains
> would be even bigger for non-pipelined machines, that could start
> "pumping" the execute stage on every cycle when processing vectors,
> skipping the fetch and decode cycles.
>
> BTW, I have also noticed that I often only need a very limited number
> of vector registers in the core vectorized loops (e.g. 2-4
> registers), so I don't think that the VRF has to be excruciatingly
> big to add value to a small core.

It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.

> I also envision that for most cases
> you never have to preserve vector registers over function calls. I.e.
> there's really no need to push/pop vector registers to the stack,
> except for context switches (which I believe should be optimized by
> tagging unused vector registers to save on stack bandwidth).
>
> /Marcus

If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM
Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from
optimal. Ratio of 8 looks like more optimal in case when performance of
vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.

Re: Cray style vectors (was: A Very Bad Idea)

<_65zN.88066$GX69.80216@fx46.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37336&group=comp.arch#37336

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Newsgroups: comp.arch
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me>
Lines: 18
Message-ID: <_65zN.88066$GX69.80216@fx46.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 14 Feb 2024 15:37:30 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 14 Feb 2024 15:37:30 GMT
X-Received-Bytes: 1338
 by: Scott Lurndal - Wed, 14 Feb 2024 15:37 UTC

Quadibloc <quadibloc@servername.invalid> writes:
>On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
>
>> (I know very well that VVM gives similar gains without the VRF)
>
>Other than the Cray I being around longer than VVM, what good is
>a vector register file?
>
>The obvious answer is that it's internal storage, rather than main
>memory, so it's useful for the same reason that cache memory is
>useful - access to frequently used values is much faster.
>
>But there's also one very bad thing about a vector register file.
>
>Like any register file, it has to be *saved* and *restored* under
>certain circumstances.

The Cray systems weren't used as general purpose timesharing systems.

Re: Cray style vectors (was: A Very Bad Idea)

<uqisbo$56hp$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37341&group=comp.arch#37341

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-48d9-0-9963-a4b9-4b7e-e912.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Cray style vectors (was: A Very Bad Idea)
Date: Wed, 14 Feb 2024 17:13:28 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uqisbo$56hp$1@newsreader4.netcologne.de>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <_65zN.88066$GX69.80216@fx46.iad>
Injection-Date: Wed, 14 Feb 2024 17:13:28 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-48d9-0-9963-a4b9-4b7e-e912.ipv6dyn.netcologne.de:2001:4dd7:48d9:0:9963:a4b9:4b7e:e912";
logging-data="170553"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 14 Feb 2024 17:13 UTC

Scott Lurndal <scott@slp53.sl.home> schrieb:
> Quadibloc <quadibloc@servername.invalid> writes:
>>On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
>>
>>> (I know very well that VVM gives similar gains without the VRF)
>>
>>Other than the Cray I being around longer than VVM, what good is
>>a vector register file?
>>
>>The obvious answer is that it's internal storage, rather than main
>>memory, so it's useful for the same reason that cache memory is
>>useful - access to frequently used values is much faster.
>>
>>But there's also one very bad thing about a vector register file.
>>
>>Like any register file, it has to be *saved* and *restored* under
>>certain circumstances.
>
> The Cray systems weren't used as general purpose timesharing systems.

They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.

Or so I heard, ~ 30 years ago.

Re: Cray style vectors

<d98606045bb45556ec1c1b5fc6ef9aae@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37349&group=comp.arch#37349

  copy link   Newsgroups: comp.arch
Date: Wed, 14 Feb 2024 20:45:36 +0000
Subject: Re: Cray style vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$7bqrf83yPpHzn25SK5UsgeihGEULtXKdTL7B4N36Ia6Kutb38GEkO
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <_65zN.88066$GX69.80216@fx46.iad> <uqisbo$56hp$1@newsreader4.netcologne.de>
Organization: Rocksolid Light
Message-ID: <d98606045bb45556ec1c1b5fc6ef9aae@www.novabbs.org>
 by: MitchAlsup1 - Wed, 14 Feb 2024 20:45 UTC

Thomas Koenig wrote:

> Scott Lurndal <scott@slp53.sl.home> schrieb:
>> Quadibloc <quadibloc@servername.invalid> writes:
>>>On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
>>>
>>>> (I know very well that VVM gives similar gains without the VRF)
>>>
>>>Other than the Cray I being around longer than VVM, what good is
>>>a vector register file?
>>>
>>>The obvious answer is that it's internal storage, rather than main
>>>memory, so it's useful for the same reason that cache memory is
>>>useful - access to frequently used values is much faster.
>>>
>>>But there's also one very bad thing about a vector register file.
>>>
>>>Like any register file, it has to be *saved* and *restored* under
>>>certain circumstances.
>>
>> The Cray systems weren't used as general purpose timesharing systems.

> They were used as database server, though - fast I/O, cheaper than
> an IBM machine of the same performance.

The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.

> Or so I heard, ~ 30 years ago.

Should be closer to 40 years ago.

Re: Cray style vectors

<uqks3a$39imk$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37365&group=comp.arch#37365

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!nntp.comgw.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadibloc@servername.invalid (Quadibloc)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Thu, 15 Feb 2024 11:21:14 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <uqks3a$39imk$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me> <_65zN.88066$GX69.80216@fx46.iad>
<uqisbo$56hp$1@newsreader4.netcologne.de>
<d98606045bb45556ec1c1b5fc6ef9aae@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 15 Feb 2024 11:21:14 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="95c462c73efcd962144cad33948fa09d";
logging-data="3459796"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19IFjF1rwyWbQC5wCY98Wl0Zc4HWdT4sq0="
User-Agent: Pan/0.146 (Hic habitat felicitas; d7a48b4
gitlab.gnome.org/GNOME/pan.git)
Cancel-Lock: sha1:/3Zbc/I3bLnpfvaA4o18uXTwHZo=
 by: Quadibloc - Thu, 15 Feb 2024 11:21 UTC

On Wed, 14 Feb 2024 20:45:36 +0000, MitchAlsup1 wrote:
> Thomas Koenig wrote:
>> Scott Lurndal <scott@slp53.sl.home> schrieb:
>>> Quadibloc <quadibloc@servername.invalid> writes:

>>>>But there's also one very bad thing about a vector register file.
>>>>
>>>>Like any register file, it has to be *saved* and *restored* under
>>>>certain circumstances.
>>>
>>> The Cray systems weren't used as general purpose timesharing systems.

I wasn't intending this as a criticism of the Cray systems, but
of my plan to copy their vector architecture in a chip intended
for general purpose desktop computer use.

>> They were used as database server, though - fast I/O, cheaper than
>> an IBM machine of the same performance.

Interesting.

> The only thing they lacked for timesharing was paging:: CRAYs had a
> base and bounds memory map. They made up for lack of paging with an
> stupidly fast I/O system.

Good to know; the Cray I was a success, so it's good to learn from
it.

John Savard

Re: Cray style vectors

<uqlm2c$3e9bp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37370&group=comp.arch#37370

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Thu, 15 Feb 2024 19:44:27 +0100
Organization: A noiseless patient Spider
Lines: 126
Message-ID: <uqlm2c$3e9bp$1@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<uqhiqb$2grub$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 15 Feb 2024 18:44:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="89949e3b051bb218247139dd8a13752c";
logging-data="3614073"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196YQsfyK3wLqcDQeyhFNI1p0YT/qq71O0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:DHbVpd1VrC3Vt7PSBxlFdpgbyyQ=
In-Reply-To: <uqhiqb$2grub$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Thu, 15 Feb 2024 18:44 UTC

On 2024-02-14, Quadibloc wrote:
> On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
>
>> (I know very well that VVM gives similar gains without the VRF)
>
> Other than the Cray I being around longer than VVM, what good is
> a vector register file?
>
> The obvious answer is that it's internal storage, rather than main
> memory, so it's useful for the same reason that cache memory is
> useful - access to frequently used values is much faster.
>
> But there's also one very bad thing about a vector register file.
>
> Like any register file, it has to be *saved* and *restored* under
> certain circumstances. Most especially, it has to be saved before,
> and restored after, other user-mode programs run, even if they
> aren't _expected_ to use vectors, as a program interrupted by
> a real-time-clock interrupt to let other users do stuff has to
> be able to *rely* on its registers all staying undisturbed, as if
> no interrupts happened.
>

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as
Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.

I scribbled down a suggestion here:

* https://gitlab.com/-/snippets/3673883

In particular, pay attention to the sections "Vector state on context
switches" and "Thread context".

My idea is not new, but I think that it takes some old ideas a few steps
further. So here goes...

There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the
integer register file of RV32I).

Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.

The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are
scratch registers.

I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).

I do not yet know if this will fly, though...

> So, the vector register file being a _large shared resource_, one
> faces the dilemma... make extra copies for as many programs as may
> be running, or save and restore it.
>
> I've come up with _one_ possible solution. Remember the Texas Instruments
> 9900, which kept its registers in memory, because it was a 16-bit CPU
> back when there weren't really enough gates on a die to make one
> possible... leading to fast context switching?
>
> Well, why not have an on-chip memory, smaller than L2 cache but made
> of similar memory cells, and use it for multiple vector register files,
> indicated by a pointer register?
>

I have had a similar idea for "big" implementations that have a huge
vector register file. My idea, though, is more of a hybrid: Basically
keep a few copies (e.g. 4-8 copies?) of vector registers for hot threads
that can be quickly switched between (no cost - just a logical "vector
register file ID" that is changed), and then have a more or less
separate memory path to a bigger vector register file cache, and swap
register file copies in/out of the hot storage asynchronously.

I'm not sure if it would be feasible to either implement next-thread
prediction in hardware, or get help from the OS in the form of hints
about the next likely thread(s) to execute, but the idea is that it
should be possible to hide most of the context switch overhead this way.

> But then the on-chip memory has to be divided into areas locked off
> from different users, just like external DRAM, and _that_ becomes
> a bit painful to contemplate.
>

Wouldn't a kernel space "thread ID" or "vector register file ID" do?

> The Cray I was intended to be used basically in *batch* mode. Having
> a huge vector register file in an ISA meant for *timesharing* is the
> problem.
>
> Perhaps what is really needed is VVM combined with some very good
> cache hinting mechanisms. I don't have the expertise needed to work
> that out, so I'll have to settle for something rather more kludgey
> instead.
>
> Of course, if a Cray I is a *batch* processing computer, that sort
> of justifies the notion I came up with earlier - in a thread I
> aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
> vector registers an auxilliary processor after the fashion of those
> in the IBM/Sony CELL processor. But one wants high-bandwidth access
> to DRAM, not no access to DRAM!
>
> The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
> inside a module that looks a lot like a video card. You just have to
> settle for 48 gigabytes of memory that won't be expandable.
>
> Some database computers, of course, have as much as a terabyte of
> DRAM - which used to be the size of a large magnetic hard drive.
>
> People who can afford a terabyte of DRAM can also afford an eight-channel
> memory bus, so it should be possible to manage something.
>
> John Savard

Re: Cray style vectors

<uqln04$3e9bp$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37371&group=comp.arch#37371

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Thu, 15 Feb 2024 20:00:20 +0100
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <uqln04$3e9bp$2@dont-email.me>
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 15 Feb 2024 19:00:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="89949e3b051bb218247139dd8a13752c";
logging-data="3614073"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UGg/C30Iu53yUbn76O7OsISJwzyvxoXk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:oZvybx6kR5Ku/x25mjGrYzTEtqU=
Content-Language: en-US
In-Reply-To: <20240214111422.0000453c@yahoo.com>
 by: Marcus - Thu, 15 Feb 2024 19:00 UTC

On 2024-02-14, Michael S wrote:
> On Tue, 13 Feb 2024 19:57:28 +0100
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>
>> On 2024-02-05, Quadibloc wrote:
>>> I am very fond of the vector architecture of the Cray I and
>>> similar machines, because it seems to me the one way of
>>> increasing computer performance that proved effective in
>>> the past that still isn't being applied to microprocessors
>>> today.
>>>
>>> Mitch Alsup, however, has noted that such an architecture is
>>> unworkable today due to memory bandwidth issues. The one
>>> extant example of this architecture these days, the NEC
>>> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
>>> gigabytes on the same card as the CPU, with a form factor
>>> resembling a video card - it doesn't try to use the main
>>> memory bus of a PC motherboard. So that seems to confirm
>>> this.
>>>
>>
>> FWIW I would just like to share my positive experience with MRISC32
>> style vectors (very similar to Cray 1, except 32-bit instead of
>> 64-bit).
>>
>
> Does it means that you have 8 VRs and each VR is 2048 bits?

No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been thinking
about reducing it to 16 vector registers, and find some clever use for
the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not there yet.

The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).

Each vector element is 32 bits wide.

So, in total: 32 x 16 x 32 bits = 16384 bits

This is, incidentally, exactly the same as for AVX-512.

>> My machine can start and finish at most one 32-bit operation on every
>> clock cycle, so it is very simple. The same thing goes for vector
>> operations: at most one 32-bit vector element per clock cycle.
>>
>> Thus, it always feels like using vector instructions would not give
>> any performance gains. Yet, every time I vectorize a scalar loop
>> (basically change scalar registers for vector registers), I see a
>> very healthy performance increase.
>>
>> I attribute this to reduced loop overhead, eliminated hazards, reduced
>> I$ pressure and possibly improved cache locality and reduced register
>> pressure.
>>
>> (I know very well that VVM gives similar gains without the VRF)
>>
>> I guess my point here is that I think that there are opportunities in
>> the very low end space (e.g. in order) to improve performance by
>> simply adding MRISC32-style vector support. I think that the gains
>> would be even bigger for non-pipelined machines, that could start
>> "pumping" the execute stage on every cycle when processing vectors,
>> skipping the fetch and decode cycles.
>>
>> BTW, I have also noticed that I often only need a very limited number
>> of vector registers in the core vectorized loops (e.g. 2-4
>> registers), so I don't think that the VRF has to be excruciatingly
>> big to add value to a small core.
>
> It depends on what you are doing.
> If you want good performance in matrix multiply type of algorithm then
> 8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
> VR can help somewhat, but the difference between 32 and 16 (in this
> type of kernels) is much much smaller than difference between 8 and
> 16.
> Radix-4 and mixed-radix FFT are probably similar except that I never
> profiled as thoroughly as I did SGEMM.
>

I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that they
would even have floating-point support. It's more a question of simple
loop optimizations - e.g. the kinds you find in libc or software
rasterization kernels. For those you will often get lots of work done
with just four vector registers.

>> I also envision that for most cases
>> you never have to preserve vector registers over function calls. I.e.
>> there's really no need to push/pop vector registers to the stack,
>> except for context switches (which I believe should be optimized by
>> tagging unused vector registers to save on stack bandwidth).
>>
>> /Marcus
>
> If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM
> Helium-style, would not work as well or better.
> My personal opinion is that even for low ens in-order cores the
> CRAY-like huge ratio between VR width and execution width is far from
> optimal. Ratio of 8 looks like more optimal in case when performance of
> vectorized loops is a top priority. Ratio of 4 is a wise choice
> otherwise.

For MRISC32 I'm aiming for splitting a vector operation into four. That
seems to eliminate most RAW hazards as execution pipelines tend to be at
most four stages long (or thereabout). So, with a pipeline width of 128
bits (which seems to be the goto width for many implementations), you
want registers that have 4 x 128 = 512 bits, which is one of the reasons
that I mandate at least 512-bit vector registers in MRISC32.

Of course, nothing is set in stone, but so far that has been my
thinking.

/Marcus

Re: Cray style vectors

<33223fbf85b2f6b478a658b186f3e9cd@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37372&group=comp.arch#37372

  copy link   Newsgroups: comp.arch
Date: Thu, 15 Feb 2024 19:12:00 +0000
Subject: Re: Cray style vectors
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$Tm36BEuzdScLAB55xUA5uuVgp9EnH4Ur7NcelSLNpmYaommGdne6C
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upq0cr$6b5m$1@dont-email.me> <uqge2p$279ql$1@dont-email.me> <uqhiqb$2grub$1@dont-email.me> <uqlm2c$3e9bp$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <33223fbf85b2f6b478a658b186f3e9cd@www.novabbs.org>
 by: MitchAlsup1 - Thu, 15 Feb 2024 19:12 UTC

Marcus wrote:

> On 2024-02-14, Quadibloc wrote:
>>
>> Like any register file, it has to be *saved* and *restored* under
>> certain circumstances. Most especially, it has to be saved before,
>> and restored after, other user-mode programs run, even if they
>> aren't _expected_ to use vectors, as a program interrupted by
>> a real-time-clock interrupt to let other users do stuff has to
>> be able to *rely* on its registers all staying undisturbed, as if
>> no interrupts happened.
>>

> Yes, that is the major drawback of a vector register file, so it has to
> be dealt with somehow.

> My current vision (not MRISC32), which is a very simple
> microcontroller type implementation (basically in the same ballpark as
> Cortex-M or small RV32I implementations), would have a relatively
> limited vector register file.

> I scribbled down a suggestion here:

> * https://gitlab.com/-/snippets/3673883

> In particular, pay attention to the sections "Vector state on context
> switches" and "Thread context".

> My idea is not new, but I think that it takes some old ideas a few steps
> further. So here goes...

> There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
> for a grand total of 128 bytes of vector thread context state. To start
> with, this is not an enormous amount of state (it's the same size as the
> integer register file of RV32I).

> Each vector register is associated with a "vector in use" flag, which is
> set as soon as the vector register is written to.

> The novel part (AFAIK) is that all "vector in use" flags are cleared as
> soon as a function returns (rts) or another function is called (bl/jl),
> which takes advantage of the ABI that says that all vector registers are
> scratch registers.

> I then predict that the ISA will have some sort of intelligent store
> and restore state instructions, that will only waste memory cycles
> for vector registers that are marked as "in use". I also predict that
> most vector registers will be unused most of the time (except for
> threads that use up 100% CPU time with heavy data processing, which
> should hopefully be in minority - especially in the kind of systems
> where you want to put a microcontroller style CPU).

VVM is designed such that even ISRs can use the vectorized parts of the
implementation. Move data, clear pages, string.h, ... so allowing GuestOSs
to use vectorization fall out for free.

> I do not yet know if this will fly, though...

Re: Cray style vectors

<20240215230033.00000e64@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37379&group=comp.arch#37379

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Cray style vectors
Date: Thu, 15 Feb 2024 23:00:33 +0200
Organization: A noiseless patient Spider
Lines: 126
Message-ID: <20240215230033.00000e64@yahoo.com>
References: <upq0cr$6b5m$1@dont-email.me>
<uqge2p$279ql$1@dont-email.me>
<20240214111422.0000453c@yahoo.com>
<uqln04$3e9bp$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="2acd11dd6dad20e960f6c52b0cd6e357";
logging-data="3649740"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18m7QjNmd/TqUxpUE2XxFYTMD6tYggMk8M="
Cancel-Lock: sha1:Rf2AM6TQ1qu9DLfZYngpbpxEny4=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Thu, 15 Feb 2024 21:00 UTC

On Thu, 15 Feb 2024 20:00:20 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

> On 2024-02-14, Michael S wrote:
> > On Tue, 13 Feb 2024 19:57:28 +0100
> > Marcus <m.delete@this.bitsnbites.eu> wrote:
> >
> >> On 2024-02-05, Quadibloc wrote:
> >>> I am very fond of the vector architecture of the Cray I and
> >>> similar machines, because it seems to me the one way of
> >>> increasing computer performance that proved effective in
> >>> the past that still isn't being applied to microprocessors
> >>> today.
> >>>
> >>> Mitch Alsup, however, has noted that such an architecture is
> >>> unworkable today due to memory bandwidth issues. The one
> >>> extant example of this architecture these days, the NEC
> >>> SX-Aurora TSUBASA, keeps its entire main memory of up to 48
> >>> gigabytes on the same card as the CPU, with a form factor
> >>> resembling a video card - it doesn't try to use the main
> >>> memory bus of a PC motherboard. So that seems to confirm
> >>> this.
> >>>
> >>
> >> FWIW I would just like to share my positive experience with MRISC32
> >> style vectors (very similar to Cray 1, except 32-bit instead of
> >> 64-bit).
> >>
> >
> > Does it means that you have 8 VRs and each VR is 2048 bits?
>
> No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
> number of registers as I have five-bit vector address fields in the
> instruction encoding (because 32 scalar registers). I have been
> thinking about reducing it to 16 vector registers, and find some
> clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
> there yet.
>
> The number of vector elements in each register is implementation
> defined, but currently the minimum number of vector elements is set to
> 16 (I wanted to set it relatively high to push myself to come up with
> solutions to problems related to large vector registers).
>
> Each vector element is 32 bits wide.
>
> So, in total: 32 x 16 x 32 bits = 16384 bits
>
> This is, incidentally, exactly the same as for AVX-512.
>
> >> My machine can start and finish at most one 32-bit operation on
> >> every clock cycle, so it is very simple. The same thing goes for
> >> vector operations: at most one 32-bit vector element per clock
> >> cycle.
> >>
> >> Thus, it always feels like using vector instructions would not give
> >> any performance gains. Yet, every time I vectorize a scalar loop
> >> (basically change scalar registers for vector registers), I see a
> >> very healthy performance increase.
> >>
> >> I attribute this to reduced loop overhead, eliminated hazards,
> >> reduced I$ pressure and possibly improved cache locality and
> >> reduced register pressure.
> >>
> >> (I know very well that VVM gives similar gains without the VRF)
> >>
> >> I guess my point here is that I think that there are opportunities
> >> in the very low end space (e.g. in order) to improve performance by
> >> simply adding MRISC32-style vector support. I think that the gains
> >> would be even bigger for non-pipelined machines, that could start
> >> "pumping" the execute stage on every cycle when processing vectors,
> >> skipping the fetch and decode cycles.
> >>
> >> BTW, I have also noticed that I often only need a very limited
> >> number of vector registers in the core vectorized loops (e.g. 2-4
> >> registers), so I don't think that the VRF has to be excruciatingly
> >> big to add value to a small core.
> >
> > It depends on what you are doing.
> > If you want good performance in matrix multiply type of algorithm
> > then 8 VRs would not take you very far. 16 VRs are ALOT better.
> > More than 16 VR can help somewhat, but the difference between 32
> > and 16 (in this type of kernels) is much much smaller than
> > difference between 8 and 16.
> > Radix-4 and mixed-radix FFT are probably similar except that I never
> > profiled as thoroughly as I did SGEMM.
> >
>
> I expect that people will want to do such things with an MRISC32 core.
> However, for the "small cores" that I'm talking about, I doubt that
> they would even have floating-point support. It's more a question of
> simple loop optimizations - e.g. the kinds you find in libc or
> software rasterization kernels. For those you will often get lots of
> work done with just four vector registers.
>
> >> I also envision that for most cases
> >> you never have to preserve vector registers over function calls.
> >> I.e. there's really no need to push/pop vector registers to the
> >> stack, except for context switches (which I believe should be
> >> optimized by tagging unused vector registers to save on stack
> >> bandwidth).
> >>
> >> /Marcus
> >
> > If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
> > ARM Helium-style, would not work as well or better.
> > My personal opinion is that even for low ens in-order cores the
> > CRAY-like huge ratio between VR width and execution width is far
> > from optimal. Ratio of 8 looks like more optimal in case when
> > performance of vectorized loops is a top priority. Ratio of 4 is a
> > wise choice otherwise.
>
> For MRISC32 I'm aiming for splitting a vector operation into four.
> That seems to eliminate most RAW hazards as execution pipelines tend
> to be at most four stages long (or thereabout). So, with a pipeline
> width of 128 bits (which seems to be the goto width for many
> implementations), you want registers that have 4 x 128 = 512 bits,
> which is one of the reasons that I mandate at least 512-bit vector
> registers in MRISC32.
>
> Of course, nothing is set in stone, but so far that has been my
> thinking.
>
> /Marcus

Sounds quite reasonable, but I wouldn't call it "Cray-style".

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor