Rocksolid Light - comp.arch - Adding Flexibility to the BSP

Adding Flexibility to the BSP

<78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>

https://news.novabbs.org/devel/article-flat.php?id=33791&group=comp.arch#33791

X-Received: by 2002:a05:622a:309:b0:410:8976:562b with SMTP id q9-20020a05622a030900b004108976562bmr147875qtw.3.1692837186455;
Wed, 23 Aug 2023 17:33:06 -0700 (PDT)
X-Received: by 2002:a63:7b10:0:b0:56a:36ac:3234 with SMTP id
w16-20020a637b10000000b0056a36ac3234mr2151030pgc.1.1692837186019; Wed, 23 Aug
2023 17:33:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Aug 2023 17:33:05 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:8929:380a:1ae8:fda2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:8929:380a:1ae8:fda2
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
Subject: Adding Flexibility to the BSP
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 24 Aug 2023 00:33:06 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3682

by: Quadibloc - Thu, 24 Aug 2023 00:33 UTC

The Burroughs Scientific Processor was a supercomputer design
that attempted to solve the problem of accessing memory with
stride through a unique memory design.
There were seventeen memory banks, each with its own data bus
and address bus, and consecutive words in memory rotated through
the memory banks.
So if one was retrieving a large block of consecutive words from the
memory, one could always use the full width of the bus on every
cycle (except the last). And if one wanted to access words occurring
at regular intervals - with a stride - as long as the displacement
between consecutive words wasn't a multiple of 17, one could also
always use the full width.
So unlike a conventional memory organization, doing matrix
multiplication on a 2048 by 2048 matrix would be no problem;
2048, as a power of two, isn't relatively prime to 16, but it is
relatively prime to 17.
An easier fix would be to store the matrix in a (2048,2049) matrix -
or (2049,2048) depending on whether your language is row-major
or column-major - but, hey, it was a thought.
Now, I've noted that a 36-bit single-precision float, using the IEEE-754
hidden first bit, could combine the precisiion of IBM 7090 floats with
the exponent range of IBM 360 floats. And that the Control Data
6600 showed that 60-bit double precision is accurate.
But there's no way to generalize the BSP design to handle both of
these widths with the same advantage for stride, unless one accepts
a serious inefficiency (i.e. have 17 memory banks, each with a 12-bit
data bus, store successive words of each float in the same memory
bank).
But if one used different lengths of numbers, maybe something is
possible.
37 and 61 are both prime numbers. So one just has a 37 times 61-bit
data bus chopped both ways!
But that's awfully wide. Can we do better?
38 and 62 are 2*19 and 2*31. The 2 is embedded in both widths,
so it doesn't interfere with avoiding issues - have a memory bank that
looks like 31 38-bit wide memory banks, or 19 62-bit wide memory
banks. That cuts the width almost in half!
We can go for a third the width - 39 is 3 times 13, and 63 is 3 times 21;
21 isn't a prime, but 3 and 7 are also tolerable as values that can be
excluded as factors of efficient strides - it's really only 2 and 5 that
we're worried about. So now the width is 13 times 63, or 21 times 39.
I was hoping for something that would also allow an
intermediate-length float, with a precision of about 10 decimal digits,
to work as well. I could see an arrangement handling intermediates
and doubles: 49 bits and 63 bits, so now we have 9 times 49, or
7 times 63.

John Savard

Re: Adding Flexibility to the BSP

<2023Aug24.095257@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33793&group=comp.arch#33793

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Adding Flexibility to the BSP
Date: Thu, 24 Aug 2023 07:52:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 23
Message-ID: <2023Aug24.095257@mips.complang.tuwien.ac.at>
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
Injection-Info: dont-email.me; posting-host="83ae7c255a44b8daf530215373cd4207";
logging-data="3568282"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/LbWSKmWsOSejN5xhHCQzp"
Cancel-Lock: sha1:SSQbTCj/lXTJJRk5YztdfhxHabg=
X-newsreader: xrn 10.11

by: Anton Ertl - Thu, 24 Aug 2023 07:52 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>So unlike a conventional memory organization, doing matrix
>multiplication on a 2048 by 2048 matrix would be no problem;

A correct statement would be: "Like a conventional memory
organization, ...".

Here's a way to implement matrix multiplication that uses stride-1
accesses on the c and b arrays and always accesses the same element of
a in the inner loop:

for (i=0; i<n; i++)
for (k=0; k<m; k++)
for (j=0; j<p; j++)
c[i*p+j]+=a[i*m+k]*b[k*p+j];

Here the two-dimensional array accesses are explicitly converted into
one-dimensional accesses to make it more obvious what is going on.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Adding Flexibility to the BSP

<uc77lk$3dhsb$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33794&group=comp.arch#33794

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Adding Flexibility to the BSP
Date: Thu, 24 Aug 2023 11:25:07 +0200
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <uc77lk$3dhsb$1@dont-email.me>
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Aug 2023 09:25:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ba54136a327f4e07b63fc027abf7ac67";
logging-data="3590027"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+alZ5TElyqPZOgRn2wIcYRsONoe34kagpq6Xt+WWqA7Q=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:Zv/DKmAEnJoTP/M6W5C2pTj6XTo=
In-Reply-To: <2023Aug24.095257@mips.complang.tuwien.ac.at>

by: Terje Mathisen - Thu, 24 Aug 2023 09:25 UTC

Anton Ertl wrote:
> Quadibloc <jsavard@ecn.ab.ca> writes:
>> So unlike a conventional memory organization, doing matrix
>> multiplication on a 2048 by 2048 matrix would be no problem;
>
> A correct statement would be: "Like a conventional memory
> organization, ...".
>
> Here's a way to implement matrix multiplication that uses stride-1
> accesses on the c and b arrays and always accesses the same element of
> a in the inner loop:
>
> for (i=0; i<n; i++)
> for (k=0; k<m; k++)
> for (j=0; j<p; j++)
> c[i*p+j]+=a[i*m+k]*b[k*p+j];
>
> Here the two-dimensional array accesses are explicitly converted into
> one-dimensional accesses to make it more obvious what is going on.

That is a nice way to organize it.

A compiler would probably convert/optimize it to something similar to this?

ai = a
ci = c
for (i=0; i<n; i++) {
bk = b
for (k=0; k<m; k++) {
aik = ai[k];
for (j=0; j<p; j++) {
ci[j] += aik * bk[j]
}
bk += p;
}
ai += m;
ci += p;
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Adding Flexibility to the BSP

<8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33795&group=comp.arch#33795

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:8c08:b0:76e:f03a:201b with SMTP id qz8-20020a05620a8c0800b0076ef03a201bmr17525qkn.0.1692871233233;
Thu, 24 Aug 2023 03:00:33 -0700 (PDT)
X-Received: by 2002:a17:902:f68e:b0:1bb:c9e3:6d59 with SMTP id
l14-20020a170902f68e00b001bbc9e36d59mr6681220plg.13.1692871233018; Thu, 24
Aug 2023 03:00:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Aug 2023 03:00:32 -0700 (PDT)
In-Reply-To: <2023Aug24.095257@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com> <2023Aug24.095257@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
Subject: Re: Adding Flexibility to the BSP
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Thu, 24 Aug 2023 10:00:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3920

by: Michael S - Thu, 24 Aug 2023 10:00 UTC

On Thursday, August 24, 2023 at 11:04:35 AM UTC+3, Anton Ertl wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
> >So unlike a conventional memory organization, doing matrix
> >multiplication on a 2048 by 2048 matrix would be no problem;
> A correct statement would be: "Like a conventional memory
> organization, ...".
>
> Here's a way to implement matrix multiplication that uses stride-1
> accesses on the c and b arrays and always accesses the same element of
> a in the inner loop:
>
> for (i=0; i<n; i++)
> for (k=0; k<m; k++)
> for (j=0; j<p; j++)
> c[i*p+j]+=a[i*m+k]*b[k*p+j];
>
> Here the two-dimensional array accesses are explicitly converted into
> one-dimensional accesses to make it more obvious what is going on.
>
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Daxpy-based matrix multiplication is simple and performance is relatively
robust, but it is not likely to reach top performance on any of modern
general-purpose CPUs. There are too many stores in the inner loop :(
Last time I tried it on Skylake(C), even more sophisticated variations of
this idea, like updating 2 lines of c[] by 3-4 lines of b[] in the common
inner loop were 20-30% slower than the best schemes.

For non-small matrices all best schemes were variations of multiple dot
product - m rows of a[] multiplied by n columns of b[] in the inner loop;
where n is typically 4 or 5 multiplied by number of lanes in SIMD register,
i.e. n=16 or 20 for DGEMM on AVX2.
These schemes are relatively fragile due to two factors:
- they do contain non-unit-stride access to b[] in the inner loop
- one has to have enough of accumulators to hide the latency of FMA.
Despite fragility with proper tuning they are fastest.
There are two main approaches for making non-unit-stride access to be
bearable:
1. Create compact copy of n columns of b[] and reuse it over many
rows of a[] and c[].
2. Divide a[] into vertical stripes, divide b[] into horizontal stripes.
Select the size of the stripe Ks such that n*Ks easily fit in L1D cache.

The 1st approach is best when all dimensions of a[] and b[] are big.
The 2nd approach can be preferable for relatively smaller matrices,
but it is sensitive to physical width of matrix b[]. It would fail badly
when width of b[] is an exact multiple of size of L1D way.

I'd imagine that on something like POWER 7/8/9/10, because of write through
policy of their L1D cache, the disadvantage of Daxpy vs dot is even bigger
than on Skylake(C).

Re: Adding Flexibility to the BSP

<2023Aug24.133019@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33796&group=comp.arch#33796

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Adding Flexibility to the BSP
Date: Thu, 24 Aug 2023 11:30:19 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 50
Message-ID: <2023Aug24.133019@mips.complang.tuwien.ac.at>
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com> <2023Aug24.095257@mips.complang.tuwien.ac.at> <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
Injection-Info: dont-email.me; posting-host="83ae7c255a44b8daf530215373cd4207";
logging-data="3642448"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Fhx4k6tkNTfT9k0GKRjEh"
Cancel-Lock: sha1:yfMU9eOWkWtv6VKuCy5Tynb5uks=
X-newsreader: xrn 10.11

by: Anton Ertl - Thu, 24 Aug 2023 11:30 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Thursday, August 24, 2023 at 11:04:35=E2=80=AFAM UTC+3, Anton Ertl wrote=
>:
>> Quadibloc <jsa...@ecn.ab.ca> writes:=20
>> >So unlike a conventional memory organization, doing matrix=20
>> >multiplication on a 2048 by 2048 matrix would be no problem;
>> A correct statement would be: "Like a conventional memory=20
>> organization, ...".=20
....
>Daxpy-based matrix multiplication is simple and performance is relatively
>robust, but it is not likely to reach top performance on any of modern
>general-purpose CPUs.

My point was that you don't need funny memory organizations in order
to get good matrix multiplication performance.

"DAXPY-based" matrix multiplication shows that in four lines of code.

If I use five lines of code to implement a "DDOT-based" matrix
multiplication (in both cases I call neither DAXPY nor DDOT), I fall
prey to the limitations of the memory subsystem:

for (i=0; i<n; i++)
for (j=0; j<p; j++) {
for (k=0, r=0.0; k<m; k++)
r += a[i*m+k]*b[k*p+j];
c[i*p+j]=r;
}

On an Ivy Bridge this takes 23.3 cycles per inner-loop iteration with
n,p,m=700 (probably similar for n,p,m=2048 without transparent huge
pages. THP can be hit or miss; when they hit, it's 4.4 cycles per
iteration.

By contrast, the "DAXPY-based" variant I showed earlier takes
1.4cycles per original iteration with auto-vectorization from gcc -O3.

OpenBLAS with one thread takes 0.36cycles/original iteration, so yes,
there is room for improvement beyond the 4-line variant shown above,
but OpenBLAS is much bigger and more complex than the 4-line variant.

0.36 cycles/original iteration means 11.11 DP FLOPS/cycle.
<https://en.wikichip.org/wiki/flops> reports that Ivy Bridge can
perform no more than 8 DP FLOPS per cycle, so this makes me wonder
what's wrong.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Adding Flexibility to the BSP

<5d1bf48d-b534-4100-a12a-681de71c0aebn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33797&group=comp.arch#33797

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:58a9:0:b0:64a:8c2f:39f4 with SMTP id ea9-20020ad458a9000000b0064a8c2f39f4mr175061qvb.6.1692883614293;
Thu, 24 Aug 2023 06:26:54 -0700 (PDT)
X-Received: by 2002:a17:90a:ce15:b0:262:da02:8a27 with SMTP id
f21-20020a17090ace1500b00262da028a27mr3554805pju.6.1692883613990; Thu, 24 Aug
2023 06:26:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Aug 2023 06:26:53 -0700 (PDT)
In-Reply-To: <2023Aug24.133019@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at> <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
<2023Aug24.133019@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d1bf48d-b534-4100-a12a-681de71c0aebn@googlegroups.com>
Subject: Re: Adding Flexibility to the BSP
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Thu, 24 Aug 2023 13:26:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4230

by: Michael S - Thu, 24 Aug 2023 13:26 UTC

On Thursday, August 24, 2023 at 3:12:33 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >On Thursday, August 24, 2023 at 11:04:35=E2=80=AFAM UTC+3, Anton Ertl wrote=
> >:
> >> Quadibloc <jsa...@ecn.ab.ca> writes:=20
> >> >So unlike a conventional memory organization, doing matrix=20
> >> >multiplication on a 2048 by 2048 matrix would be no problem;
> >> A correct statement would be: "Like a conventional memory=20
> >> organization, ...".=20
> ...
> >Daxpy-based matrix multiplication is simple and performance is relatively
> >robust, but it is not likely to reach top performance on any of modern
> >general-purpose CPUs.
> My point was that you don't need funny memory organizations in order
> to get good matrix multiplication performance.
>
> "DAXPY-based" matrix multiplication shows that in four lines of code.
>
> If I use five lines of code to implement a "DDOT-based" matrix
> multiplication (in both cases I call neither DAXPY nor DDOT), I fall
> prey to the limitations of the memory subsystem:
> for (i=0; i<n; i++)
> for (j=0; j<p; j++) {
> for (k=0, r=0.0; k<m; k++)
> r += a[i*m+k]*b[k*p+j];
> c[i*p+j]=r;
> }
>
> On an Ivy Bridge this takes 23.3 cycles per inner-loop iteration with
> n,p,m=700 (probably similar for n,p,m=2048 without transparent huge
> pages. THP can be hit or miss; when they hit, it's 4.4 cycles per
> iteration.
>
> By contrast, the "DAXPY-based" variant I showed earlier takes
> 1.4cycles per original iteration with auto-vectorization from gcc -O3.
>
> OpenBLAS with one thread takes 0.36cycles/original iteration, so yes,
> there is room for improvement beyond the 4-line variant shown above,
> but OpenBLAS is much bigger and more complex than the 4-line variant.
>
> 0.36 cycles/original iteration means 11.11 DP FLOPS/cycle.
> <https://en.wikichip.org/wiki/flops> reports that Ivy Bridge can
> perform no more than 8 DP FLOPS per cycle, so this makes me wonder
> what's wrong.

2 FLOP / 0.36 = 5.56 DP FLOPS/cycle. Rmax/Rpeak = 69%
I didn't try, but would think that for n,p,m=700 I can do a little better,
may be Rmax/Rpeak = 75%.
Of course, it's much easier to optimize for specific matrix size than to write
DGEMM that is not too bad at all sizes and shapes.

To your bigger point, I fully agree that achieving reasonable performance by
DAXPY approach is much easier. Just saying that it does not cut for very
best. And what does cut for best is somewhat sensitive to stride aliasing.
Not horribly so, but you can't ignore the issue completely.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Adding Flexibility to the BSP

<uc7sjg$2rtq5$1@newsreader4.netcologne.de>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33798&group=comp.arch#33798

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd4-f2a7-0-eeb6-344d-47bf-5cf2.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Adding Flexibility to the BSP
Date: Thu, 24 Aug 2023 15:22:24 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <uc7sjg$2rtq5$1@newsreader4.netcologne.de>
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at>
<8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
<2023Aug24.133019@mips.complang.tuwien.ac.at>
<5d1bf48d-b534-4100-a12a-681de71c0aebn@googlegroups.com>
Injection-Date: Thu, 24 Aug 2023 15:22:24 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd4-f2a7-0-eeb6-344d-47bf-5cf2.ipv6dyn.netcologne.de:2001:4dd4:f2a7:0:eeb6:344d:47bf:5cf2";
logging-data="3012421"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 24 Aug 2023 15:22 UTC

Michael S <already5chosen@yahoo.com> schrieb:

> Of course, it's much easier to optimize for specific matrix size than to write
> DGEMM that is not too bad at all sizes and shapes.

For small sizes (2,3,4,5) a straightfoward loop is usually better,
becaue the highly-tuned BLAS libraries have a big startup time.
This is especially true if the size and strides are known at
compile time.

Example:

subroutine mm(a,b,c)
double precision, dimension(2,2) :: a, b, c
c = matmul(a,b)
end subroutine mm

will be translated, on my Rocket Lake home system, with -Ofast
-march=native, to

mm_:
..LFB0:
.cfi_startproc
vmovupd (%rdi), %ymm0
vmovupd (%rsi), %ymm1
vpermpd $68, %ymm0, %ymm3
vpermilpd $0, %ymm1, %ymm2
vpermpd $238, %ymm0, %ymm0
vpermilpd $15, %ymm1, %ymm1
vmulpd %ymm1, %ymm0, %ymm0
vfmadd231pd %ymm3, %ymm2, %ymm0
vmovupd %ymm0, (%rdx)
vzeroupper
ret

(I am confident that you or Terje will point out numerous glaring
inefficiencies in the compiler-generated code :-)

Re: Adding Flexibility to the BSP

<4d70c688-daf1-4965-aa36-6e1a9fe3dbc8n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33801&group=comp.arch#33801

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:b14:b0:63f:7a28:2abf with SMTP id u20-20020a0562140b1400b0063f7a282abfmr184872qvj.12.1692895431231;
Thu, 24 Aug 2023 09:43:51 -0700 (PDT)
X-Received: by 2002:a17:90a:ff05:b0:263:317f:7ca4 with SMTP id
ce5-20020a17090aff0500b00263317f7ca4mr3669839pjb.9.1692895430257; Thu, 24 Aug
2023 09:43:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Aug 2023 09:43:49 -0700 (PDT)
In-Reply-To: <uc7sjg$2rtq5$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at> <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
<2023Aug24.133019@mips.complang.tuwien.ac.at> <5d1bf48d-b534-4100-a12a-681de71c0aebn@googlegroups.com>
<uc7sjg$2rtq5$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4d70c688-daf1-4965-aa36-6e1a9fe3dbc8n@googlegroups.com>
Subject: Re: Adding Flexibility to the BSP
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Thu, 24 Aug 2023 16:43:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3301

by: Michael S - Thu, 24 Aug 2023 16:43 UTC

On Thursday, August 24, 2023 at 6:22:28 PM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > Of course, it's much easier to optimize for specific matrix size than to write
> > DGEMM that is not too bad at all sizes and shapes.
> For small sizes (2,3,4,5) a straightfoward loop is usually better,
> becaue the highly-tuned BLAS libraries have a big startup time.
> This is especially true if the size and strides are known at
> compile time.
>
> Example:
>
> subroutine mm(a,b,c)
> double precision, dimension(2,2) :: a, b, c
> c = matmul(a,b)
> end subroutine mm
>
> will be translated, on my Rocket Lake home system, with -Ofast
> -march=native, to
>
> mm_:
> .LFB0:
> .cfi_startproc
> vmovupd (%rdi), %ymm0
> vmovupd (%rsi), %ymm1
> vpermpd $68, %ymm0, %ymm3
> vpermilpd $0, %ymm1, %ymm2
> vpermpd $238, %ymm0, %ymm0
> vpermilpd $15, %ymm1, %ymm1
> vmulpd %ymm1, %ymm0, %ymm0
> vfmadd231pd %ymm3, %ymm2, %ymm0
> vmovupd %ymm0, (%rdx)
> vzeroupper
> ret
>
> (I am confident that you or Terje will point out numerous glaring
> inefficiencies in the compiler-generated code :-)

It is actually not bad at all.
I think that I can improve it a little by replacing
vmovupd (%rdi), %ymm0
vpermpd $68, %ymm0, %ymm3
vpermpd $238, %ymm0, %ymm0
by
vbroadcastf128 (%rdi), %ymm3
vbroadcastf128 16(%rdi), %ymm0

But it would be very close and in some less common situations
my variant could end up slower.

Also instead of
vpermilpd $0, %ymm1, %ymm2
vpermilpd $15, %ymm1, %ymm1
I'd write
vunpcklpd %ymm1,%ymm1,%ymm2
vunpckhpd %ymm1,%ymm1,%ymm1
It is not faster, but 4 bytes shorter, so can be of advantage
when function is used rarely.

But good work by compiler nevertheless.

Re: Adding Flexibility to the BSP

<uc9k47$3trlo$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33809&group=comp.arch#33809

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Adding Flexibility to the BSP
Date: Fri, 25 Aug 2023 09:09:59 +0200
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <uc9k47$3trlo$1@dont-email.me>
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at>
<8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
<2023Aug24.133019@mips.complang.tuwien.ac.at>
<5d1bf48d-b534-4100-a12a-681de71c0aebn@googlegroups.com>
<uc7sjg$2rtq5$1@newsreader4.netcologne.de>
<4d70c688-daf1-4965-aa36-6e1a9fe3dbc8n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 25 Aug 2023 07:09:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9d9bcfcfa8807bbd68b15dafd3cf84f5";
logging-data="4124344"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IhuJ95Npbi+WB2I6tDkpB9MuF6J2X0nbJ0EfuS7rkwQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:3U+mmqF+M3iIsXMczDh9dt6GdmY=
In-Reply-To: <4d70c688-daf1-4965-aa36-6e1a9fe3dbc8n@googlegroups.com>

by: Terje Mathisen - Fri, 25 Aug 2023 07:09 UTC

Michael S wrote:
> On Thursday, August 24, 2023 at 6:22:28â¯PM UTC+3, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>>> Of course, it's much easier to optimize for specific matrix size than to write
>>> DGEMM that is not too bad at all sizes and shapes.
>> For small sizes (2,3,4,5) a straightfoward loop is usually better,
>> becaue the highly-tuned BLAS libraries have a big startup time.
>> This is especially true if the size and strides are known at
>> compile time.
>>
>> Example:
>>
>> subroutine mm(a,b,c)
>> double precision, dimension(2,2) :: a, b, c
>> c = matmul(a,b)
>> end subroutine mm
>>
>> will be translated, on my Rocket Lake home system, with -Ofast
>> -march=native, to
>>
>> mm_:
>> .LFB0:
>> .cfi_startproc
>> vmovupd (%rdi), %ymm0
>> vmovupd (%rsi), %ymm1
>> vpermpd $68, %ymm0, %ymm3
>> vpermilpd $0, %ymm1, %ymm2
>> vpermpd $238, %ymm0, %ymm0
>> vpermilpd $15, %ymm1, %ymm1
>> vmulpd %ymm1, %ymm0, %ymm0
>> vfmadd231pd %ymm3, %ymm2, %ymm0
>> vmovupd %ymm0, (%rdx)
>> vzeroupper
>> ret
>>
>> (I am confident that you or Terje will point out numerous glaring
>> inefficiencies in the compiler-generated code :-)
>
>
> It is actually not bad at all.
> I think that I can improve it a little by replacing
> vmovupd (%rdi), %ymm0
> vpermpd $68, %ymm0, %ymm3
> vpermpd $238, %ymm0, %ymm0
> by
> vbroadcastf128 (%rdi), %ymm3
> vbroadcastf128 16(%rdi), %ymm0
>
> But it would be very close and in some less common situations
> my variant could end up slower.
>
> Also instead of
> vpermilpd $0, %ymm1, %ymm2
> vpermilpd $15, %ymm1, %ymm1
> I'd write
> vunpcklpd %ymm1,%ymm1,%ymm2
> vunpckhpd %ymm1,%ymm1,%ymm1
> It is not faster, but 4 bytes shorter, so can be of advantage
> when function is used rarely.
>
> But good work by compiler nevertheless.
>
I agree. Any time a compiler gets within 80-90% of my best hand-tuned
effort, I stop worrying about that particular algorithm and simple move
on to more useful challenges, leaving the compiler to do what it does so
well.(1*)

On a few (rare, but they do happen!) occations, I have even learned new
approaches from looking at compiler output.

1) The only exception is for progrmaming competitions where every single
cycle matters. For real code, even 50% of speed of light/theoretical max
is usually fine.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Adding Flexibility to the BSP

<25265258-e03d-4a21-8e88-d8a6bf3c723cn@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33810&group=comp.arch#33810

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5fc8:0:b0:403:22b8:6e21 with SMTP id k8-20020ac85fc8000000b0040322b86e21mr441550qta.10.1692970088530;
Fri, 25 Aug 2023 06:28:08 -0700 (PDT)
X-Received: by 2002:a05:6870:d246:b0:1c0:3b63:a436 with SMTP id
h6-20020a056870d24600b001c03b63a436mr307528oac.3.1692970088089; Fri, 25 Aug
2023 06:28:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Aug 2023 06:28:07 -0700 (PDT)
In-Reply-To: <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa34:c000:dc7e:1905:ec90:4814;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa34:c000:dc7e:1905:ec90:4814
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at> <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25265258-e03d-4a21-8e88-d8a6bf3c723cn@googlegroups.com>
Subject: Re: Adding Flexibility to the BSP
From: jsavard@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 25 Aug 2023 13:28:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2719

by: Quadibloc - Fri, 25 Aug 2023 13:28 UTC

On Thursday, August 24, 2023 at 4:00:35 AM UTC-6, Michael S wrote:

> Daxpy-based matrix multiplication is simple and performance is relatively
> robust, but it is not likely to reach top performance on any of modern
> general-purpose CPUs. There are too many stores in the inner loop :(

True, but my proposed memory architecture also doesn't work on modern
CPUs, because it depends on accessing individual array elements from
memory, so it's mismatched to DRAM. In any event, on further reflection,
the idea proposed here is simply too elaborate to be useful. If a derivative
of the BSP is to be used, it will have to be a much simpler one.

Basically, this kind of memory structure would need to be applied to
static RAM inside a module, using a technique such as HBM.

My first thought of a simplification was to allow some memory wastage.

19*5 = 95, 23*4 = 92, and 31*3 = 93, so one could have a memory array
with 95 12-bit data lanes as its width that would have good non-interfering
sizes for double (multiples of 19 excluded), intermediate (multiples of 23
excluded), and single (multiples of 31 excluded).

However, it makes more sense to just offer 60 lanes, each 12-bits wide, and
simply add one to the row (or column, if the language is row-major, as
FORTRAN usually is) size when it's a round number that would have problems.

John Savard

Re: Adding Flexibility to the BSP

<268a78a2-21b8-4fe4-a8ea-a636fb8cf895n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33813&group=comp.arch#33813

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:19aa:b0:40f:dc70:fdc9 with SMTP id u42-20020a05622a19aa00b0040fdc70fdc9mr473245qtc.13.1692986510314;
Fri, 25 Aug 2023 11:01:50 -0700 (PDT)
X-Received: by 2002:a17:902:c609:b0:1b8:7f21:6d3 with SMTP id
r9-20020a170902c60900b001b87f2106d3mr6205927plr.6.1692986509818; Fri, 25 Aug
2023 11:01:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Aug 2023 11:01:49 -0700 (PDT)
In-Reply-To: <25265258-e03d-4a21-8e88-d8a6bf3c723cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d57d:7c1:1d34:77be;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d57d:7c1:1d34:77be
References: <78a98e65-1043-4691-833f-32e048c22f03n@googlegroups.com>
<2023Aug24.095257@mips.complang.tuwien.ac.at> <8b6be7d2-58fd-4f29-bfa2-25dff7acc48dn@googlegroups.com>
<25265258-e03d-4a21-8e88-d8a6bf3c723cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <268a78a2-21b8-4fe4-a8ea-a636fb8cf895n@googlegroups.com>
Subject: Re: Adding Flexibility to the BSP
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Aug 2023 18:01:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3335

by: MitchAlsup - Fri, 25 Aug 2023 18:01 UTC

On Friday, August 25, 2023 at 8:28:10 AM UTC-5, Quadibloc wrote:
> On Thursday, August 24, 2023 at 4:00:35 AM UTC-6, Michael S wrote:
>
> > Daxpy-based matrix multiplication is simple and performance is relatively
> > robust, but it is not likely to reach top performance on any of modern
> > general-purpose CPUs. There are too many stores in the inner loop :(
<
> True, but my proposed memory architecture also doesn't work on modern
> CPUs, because it depends on accessing individual array elements from
> memory, so it's mismatched to DRAM.
<
And especially mismatched to Cache line sizes, sets, and sizes that make
DRAM accesses unpredictable. Also note:: BSP was inherently 16 CPUS
making 1 data access each per cycle, whereas a modern chip with 16
CPUs is making at most 1 line access every 4-8 beats.
<
> In any event, on further reflection,
> the idea proposed here is simply too elaborate to be useful. If a derivative
> of the BSP is to be used, it will have to be a much simpler one.
>
> Basically, this kind of memory structure would need to be applied to
> static RAM inside a module, using a technique such as HBM.
>
> My first thought of a simplification was to allow some memory wastage.
>
> 19*5 = 95, 23*4 = 92, and 31*3 = 93, so one could have a memory array
> with 95 12-bit data lanes as its width that would have good non-interfering
> sizes for double (multiples of 19 excluded), intermediate (multiples of 23
> excluded), and single (multiples of 31 excluded).
>
> However, it makes more sense to just offer 60 lanes, each 12-bits wide, and
> simply add one to the row (or column, if the language is row-major, as
> FORTRAN usually is) size when it's a round number that would have problems.
>
> John Savard

The only perfect science is hind-sight.

devel / comp.arch / Adding Flexibility to the BSP

Subject	Author
Adding Flexibility to the BSP	Quadibloc
Re: Adding Flexibility to the BSP	Anton Ertl
Re: Adding Flexibility to the BSP	Terje Mathisen
Re: Adding Flexibility to the BSP	Michael S
Re: Adding Flexibility to the BSP	Anton Ertl
Re: Adding Flexibility to the BSP	Michael S
Re: Adding Flexibility to the BSP	Thomas Koenig
Re: Adding Flexibility to the BSP	Michael S
Re: Adding Flexibility to the BSP	Terje Mathisen
Re: Adding Flexibility to the BSP	Quadibloc
Re: Adding Flexibility to the BSP	MitchAlsup