Rocksolid Light - comp.arch - hw gather vs emulation (was: Pros/cons of load/store many registers)

hw gather vs emulation (was: Pros/cons of load/store many registers)

<74719232-5e78-410d-815e-af8f72060e20n@googlegroups.com>

https://news.novabbs.org/devel/article-flat.php?id=33749&group=comp.arch#33749

X-Received: by 2002:a05:622a:309:b0:40f:e2a5:3100 with SMTP id q9-20020a05622a030900b0040fe2a53100mr127872qtw.6.1692716454938;
Tue, 22 Aug 2023 08:00:54 -0700 (PDT)
X-Received: by 2002:a17:902:cec9:b0:1b8:97ed:a437 with SMTP id
d9-20020a170902cec900b001b897eda437mr5423286plg.4.1692716454432; Tue, 22 Aug
2023 08:00:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Aug 2023 08:00:53 -0700 (PDT)
In-Reply-To: <2022Jul18.164832@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <t9d33r$n7t3$1@dont-email.me> <ba6d11ba-a703-480e-bc2f-8532c2774025n@googlegroups.com>
<2022Jul10.152942@mips.complang.tuwien.ac.at> <af0e4e8e-3131-4869-88aa-f9caf3fbe693n@googlegroups.com>
<2022Jul10.183037@mips.complang.tuwien.ac.at> <c8110e88-20af-4493-a454-7c249176273bn@googlegroups.com>
<2022Jul11.092211@mips.complang.tuwien.ac.at> <tah10g$q5n$1@gioia.aioe.org>
<5652880f-e4f3-42d2-87f7-ed0431b9dfa9n@googlegroups.com> <099f21c1-69ee-4d82-9dea-435af7fdc470n@googlegroups.com>
<2022Jul18.164832@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <74719232-5e78-410d-815e-af8f72060e20n@googlegroups.com>
Subject: hw gather vs emulation (was: Pros/cons of load/store many registers)
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Tue, 22 Aug 2023 15:00:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8110

by: Michael S - Tue, 22 Aug 2023 15:00 UTC

On Monday, July 18, 2022 at 6:06:02 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >> The best I came with is that:
> >> void bar(struct foo * a) {
> >> for (int i = 0; i < 100; i += 4) {
> >> double* src = (double*)&a[i];
> >> __m128d x0y0 = _mm_loadu_pd(&src[0]);
> >> __m128d z0x1 = _mm_loadu_pd(&src[2]);
> >> __m128d y1z1 = _mm_loadu_pd(&src[4]);
> >>
> >> __m128d x2y2 = _mm_loadu_pd(&src[6]);
> >> __m128d z2x3 = _mm_loadu_pd(&src[8]);
> >> __m128d y3z3 = _mm_loadu_pd(&src[10]);
> >>
> >> __m128d sum01 = _mm_hadd_pd(x0y0, y1z1); // x0+y0 y1+z1
> >> __m128d sum23 = _mm_hadd_pd(x2y2, y3z3); // x2+y2 y3+z3
> >>
> >> sum01 = _mm_add_pd(sum01, z0x1); // // x0+y0+z0 x1+y1+z1
> >> sum23 = _mm_add_pd(sum23, z2x3); // // x2+y2+z2 x3+y3+z3
> >>
> >> _mm_store_sd (&a[i+0].x, sum01);
> >> _mm_storeh_pd(&a[i+1].x, sum01);
> >> _mm_store_sd (&a[i+2].x, sum23);
> >> _mm_storeh_pd(&a[i+3].x, sum23);
> >> }
> >> }
> >>
> >> It is compiled with '-O2 -msse3'.
> ...
> >BTW, I managed to write variant that is slower than '-O3 -mavx2'.
> >void bar(struct foo * a) {
> > __m128i idx = _mm_setr_epi32(0, 3, 6, 9);
> > for (int i = 0; i < 100; i += 4) {
> > double* src = (double*)&a[i];
> > __m256d x = _mm256_i32gather_pd(&a[i].x, idx, 8); // x0 x1 x2 x3
> > __m256d y = _mm256_i32gather_pd(&a[i].y, idx, 8); // y0 y1 y2 y3
> > __m256d z = _mm256_i32gather_pd(&a[i].z, idx, 8); // z0 z1 z2 z3
> >
> > __m256d sum = _mm256_add_pd(_mm256_add_pd(x, y), z);
> >
> > __m128d sum01 = _mm256_castpd256_pd128(sum);
> > __m128d sum23 = _mm256_extractf128_pd (sum, 1);
> > _mm_store_sd (&a[i+0].x, sum01);
> > _mm_storeh_pd(&a[i+1].x, sum01);
> > _mm_store_sd (&a[i+2].x, sum23);
> > _mm_storeh_pd(&a[i+3].x, sum23);
> > }
> >}
> >
> >Almost 3 times slower than the fastest variant and 1.9x times slower than scalar.
> >Conclusion - in theory gather is nice, but slow gather is useless.
>
> Not completely: It means that people can write code with gather, and
> it will also run on the CPU with slow gather.
>
> >May be, on Rocket Lake it is different...
>
> I measured these two variants on Rocket Lake and inserted them in the
> table:
>
> cycles
> triples soa SIMD
> 3,321,291,963 3,307,463,383 scalar (-fno-tree-vectorize)
> 3,671,075,802 2,180,052,867 SSE (default)
> 3,419,182,771 1,985,494,293 AVX256 (-mavx)
> 3,186,257,089 1,239,727,179 AVX256 (-mavx -mtune=icelake-client)
> 2,232,211,548 manual AVX2 (-O2 -mavx2 -mtune=icelake-client)
> 2,220,491,357 manual sse3 (-O2 -msse3)
> 5,400,347,654 manual sg (-O2 -mavx2 -mtune=icelake-client)
> 2,142,562,755 506,829,616 AVX512 (-mavx512f)
>
> instructions
> triples soa SIMD
> 12,726,107,863 12,762,107,866 scalar (-fno-tree-vectorize)
> 14,526,107,591 8,370,107,504 SSE (default)
> 12,744,107,411 6,210,107,350 AVX256 (-mavx)
> 11,394,108,660 3,510,108,566 AVX256 (-mavx -mtune=icelake-client)
> 6,894,109,062 manual AVX2 (-O2 -mavx2 -mtune=icelake-client)
> 7,776,109,058 manual sse3 (-O2 -msse3)
> 7,848,109,210 manual sg (-O2 -mavx2 -mtune=icelake-client)
> 6,138,107,244 1,368,107,189 AVX512 (-mavx512f)
>
> So gather is very slow on Rocket Lake, too. Your SSE3 version is in
> the same ballpark as your AVX2 version and the automatic AVX512F
> version.
>
> In any case, we see that auto alone does not cut it; in this example,
> a manual layout change is necessary, and once we have that,
> auto-vectorization does ok.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Resurrecting last year's thread.
Last week in the other comp.arch thread I figured out cheap way to emulate gather
on AVX/AVX2. May be not cheap in the absolute way, but cheaper than methods
that I was able to thought about before that.
So I dug out this old microbenchmark and rewrote it with a new toy.

void bar(struct foo * a) {
for (int i = 0; i < 100; i += 4) {
__m256d x0 = _mm256_broadcast_sd(&a[i+0].x); // x0 x0 x0 x0
__m256d x1 = _mm256_broadcast_sd(&a[i+1].x); // x1 x1 x1 x1
__m256d x2 = _mm256_broadcast_sd(&a[i+2].x); // x2 x2 x2 x2
__m256d x3 = _mm256_broadcast_sd(&a[i+3].x); // x3 x3 x3 x3
__m256d x01 = _mm256_blend_pd(x0, x1, 2); // x0 x1 x0 x0
__m256d x23 = _mm256_blend_pd(x2, x3, 8); // x2 x2 x2 x3
__m256d x = _mm256_blend_pd(x01, x23, 12); // x0 x1 x2 x3

__m256d y0 = _mm256_broadcast_sd(&a[i+0].y); // y0 y0 y0 y0
__m256d y1 = _mm256_broadcast_sd(&a[i+1].y); // y1 y1 y1 y1
__m256d y2 = _mm256_broadcast_sd(&a[i+2].y); // y2 y2 y2 y2
__m256d y3 = _mm256_broadcast_sd(&a[i+3].y); // y3 y3 y3 y3
__m256d y01 = _mm256_blend_pd(y0, y1, 2); // y0 y1 y0 y0
__m256d y23 = _mm256_blend_pd(y2, y3, 8); // y2 y2 y2 y3
__m256d y = _mm256_blend_pd(y01, y23, 12); // y0 y1 y2 y3

__m256d z0 = _mm256_broadcast_sd(&a[i+0].z); // z0 z0 z0 z0
__m256d z1 = _mm256_broadcast_sd(&a[i+1].z); // z1 z1 z1 z1
__m256d z2 = _mm256_broadcast_sd(&a[i+2].z); // z2 z2 z2 z2
__m256d z3 = _mm256_broadcast_sd(&a[i+3].z); // z3 z3 z3 z3
__m256d z01 = _mm256_blend_pd(z0, z1, 2); // z0 z1 z0 z0
__m256d z23 = _mm256_blend_pd(z2, z3, 8); // z2 z2 z2 z3
__m256d z = _mm256_blend_pd(z01, z23, 12); // z0 z1 z2 z3

__m256d sum = _mm256_add_pd(_mm256_add_pd(x, y), z);

__m128d sum01 = _mm256_castpd256_pd128(sum);
__m128d sum23 = _mm256_extractf128_pd (sum, 1);
_mm_store_sd (&a[i+0].x, sum01);
_mm_storeh_pd(&a[i+1].x, sum01);
_mm_store_sd (&a[i+2].x, sum23);
_mm_storeh_pd(&a[i+3].x, sum23);
}
}

On Skylake client emulated gather variant ended up 1.8 times faster
than "real" gather. On Haswell it was 3.6 times faster and on Zen3
2.8 times faster.

Even considering that our case is rather bad for HW relatively to emulation
these numbers are not pretty.

I wonder if Rocket Lake HW gather fares any better.

Re: hw gather vs emulation (was: Pros/cons of load/store many registers)

<d311af20-fd8f-405f-b14d-c9ef7a96f5b8n@googlegroups.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=33750&group=comp.arch#33750

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5987:0:b0:403:edaf:5952 with SMTP id e7-20020ac85987000000b00403edaf5952mr83920qte.1.1692717576020;
Tue, 22 Aug 2023 08:19:36 -0700 (PDT)
X-Received: by 2002:a17:902:dac2:b0:1af:f80f:185d with SMTP id
q2-20020a170902dac200b001aff80f185dmr4814993plx.4.1692717575475; Tue, 22 Aug
2023 08:19:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Aug 2023 08:19:34 -0700 (PDT)
In-Reply-To: <74719232-5e78-410d-815e-af8f72060e20n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b1d0:3ff5:2adf:5c0c
References: <t9d33r$n7t3$1@dont-email.me> <ba6d11ba-a703-480e-bc2f-8532c2774025n@googlegroups.com>
<2022Jul10.152942@mips.complang.tuwien.ac.at> <af0e4e8e-3131-4869-88aa-f9caf3fbe693n@googlegroups.com>
<2022Jul10.183037@mips.complang.tuwien.ac.at> <c8110e88-20af-4493-a454-7c249176273bn@googlegroups.com>
<2022Jul11.092211@mips.complang.tuwien.ac.at> <tah10g$q5n$1@gioia.aioe.org>
<5652880f-e4f3-42d2-87f7-ed0431b9dfa9n@googlegroups.com> <099f21c1-69ee-4d82-9dea-435af7fdc470n@googlegroups.com>
<2022Jul18.164832@mips.complang.tuwien.ac.at> <74719232-5e78-410d-815e-af8f72060e20n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d311af20-fd8f-405f-b14d-c9ef7a96f5b8n@googlegroups.com>
Subject: Re: hw gather vs emulation (was: Pros/cons of load/store many registers)
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Aug 2023 15:19:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 9807

by: MitchAlsup - Tue, 22 Aug 2023 15:19 UTC

On Tuesday, August 22, 2023 at 10:00:57 AM UTC-5, Michael S wrote:
> On Monday, July 18, 2022 at 6:06:02 PM UTC+3, Anton Ertl wrote:
> > Michael S <already...@yahoo.com> writes:
> > >> The best I came with is that:
> > >> void bar(struct foo * a) {
> > >> for (int i = 0; i < 100; i += 4) {
> > >> double* src = (double*)&a[i];
> > >> __m128d x0y0 = _mm_loadu_pd(&src[0]);
> > >> __m128d z0x1 = _mm_loadu_pd(&src[2]);
> > >> __m128d y1z1 = _mm_loadu_pd(&src[4]);
> > >>
> > >> __m128d x2y2 = _mm_loadu_pd(&src[6]);
> > >> __m128d z2x3 = _mm_loadu_pd(&src[8]);
> > >> __m128d y3z3 = _mm_loadu_pd(&src[10]);
> > >>
> > >> __m128d sum01 = _mm_hadd_pd(x0y0, y1z1); // x0+y0 y1+z1
> > >> __m128d sum23 = _mm_hadd_pd(x2y2, y3z3); // x2+y2 y3+z3
> > >>
> > >> sum01 = _mm_add_pd(sum01, z0x1); // // x0+y0+z0 x1+y1+z1
> > >> sum23 = _mm_add_pd(sum23, z2x3); // // x2+y2+z2 x3+y3+z3
> > >>
> > >> _mm_store_sd (&a[i+0].x, sum01);
> > >> _mm_storeh_pd(&a[i+1].x, sum01);
> > >> _mm_store_sd (&a[i+2].x, sum23);
> > >> _mm_storeh_pd(&a[i+3].x, sum23);
> > >> }
> > >> }
> > >>
> > >> It is compiled with '-O2 -msse3'.
> > ...
> > >BTW, I managed to write variant that is slower than '-O3 -mavx2'.
> > >void bar(struct foo * a) {
> > > __m128i idx = _mm_setr_epi32(0, 3, 6, 9);
> > > for (int i = 0; i < 100; i += 4) {
> > > double* src = (double*)&a[i];
> > > __m256d x = _mm256_i32gather_pd(&a[i].x, idx, 8); // x0 x1 x2 x3
> > > __m256d y = _mm256_i32gather_pd(&a[i].y, idx, 8); // y0 y1 y2 y3
> > > __m256d z = _mm256_i32gather_pd(&a[i].z, idx, 8); // z0 z1 z2 z3
> > >
> > > __m256d sum = _mm256_add_pd(_mm256_add_pd(x, y), z);
> > >
> > > __m128d sum01 = _mm256_castpd256_pd128(sum);
> > > __m128d sum23 = _mm256_extractf128_pd (sum, 1);
> > > _mm_store_sd (&a[i+0].x, sum01);
> > > _mm_storeh_pd(&a[i+1].x, sum01);
> > > _mm_store_sd (&a[i+2].x, sum23);
> > > _mm_storeh_pd(&a[i+3].x, sum23);
> > > }
> > >}
> > >
> > >Almost 3 times slower than the fastest variant and 1.9x times slower than scalar.
> > >Conclusion - in theory gather is nice, but slow gather is useless.
> >
> > Not completely: It means that people can write code with gather, and
> > it will also run on the CPU with slow gather.
> >
> > >May be, on Rocket Lake it is different...
> >
> > I measured these two variants on Rocket Lake and inserted them in the
> > table:
> >
> > cycles
> > triples soa SIMD
> > 3,321,291,963 3,307,463,383 scalar (-fno-tree-vectorize)
> > 3,671,075,802 2,180,052,867 SSE (default)
> > 3,419,182,771 1,985,494,293 AVX256 (-mavx)
> > 3,186,257,089 1,239,727,179 AVX256 (-mavx -mtune=icelake-client)
> > 2,232,211,548 manual AVX2 (-O2 -mavx2 -mtune=icelake-client)
> > 2,220,491,357 manual sse3 (-O2 -msse3)
> > 5,400,347,654 manual sg (-O2 -mavx2 -mtune=icelake-client)
> > 2,142,562,755 506,829,616 AVX512 (-mavx512f)
> >
> > instructions
> > triples soa SIMD
> > 12,726,107,863 12,762,107,866 scalar (-fno-tree-vectorize)
> > 14,526,107,591 8,370,107,504 SSE (default)
> > 12,744,107,411 6,210,107,350 AVX256 (-mavx)
> > 11,394,108,660 3,510,108,566 AVX256 (-mavx -mtune=icelake-client)
> > 6,894,109,062 manual AVX2 (-O2 -mavx2 -mtune=icelake-client)
> > 7,776,109,058 manual sse3 (-O2 -msse3)
> > 7,848,109,210 manual sg (-O2 -mavx2 -mtune=icelake-client)
> > 6,138,107,244 1,368,107,189 AVX512 (-mavx512f)
> >
> > So gather is very slow on Rocket Lake, too. Your SSE3 version is in
> > the same ballpark as your AVX2 version and the automatic AVX512F
> > version.
> >
> > In any case, we see that auto alone does not cut it; in this example,
> > a manual layout change is necessary, and once we have that,
> > auto-vectorization does ok.
> > - anton
> > --
> > 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
> Resurrecting last year's thread.
> Last week in the other comp.arch thread I figured out cheap way to emulate gather
> on AVX/AVX2. May be not cheap in the absolute way, but cheaper than methods
> that I was able to thought about before that.
> So I dug out this old microbenchmark and rewrote it with a new toy.
>
> void bar(struct foo * a) {
> for (int i = 0; i < 100; i += 4) {
> __m256d x0 = _mm256_broadcast_sd(&a[i+0].x); // x0 x0 x0 x0
> __m256d x1 = _mm256_broadcast_sd(&a[i+1].x); // x1 x1 x1 x1
> __m256d x2 = _mm256_broadcast_sd(&a[i+2].x); // x2 x2 x2 x2
> __m256d x3 = _mm256_broadcast_sd(&a[i+3].x); // x3 x3 x3 x3
> __m256d x01 = _mm256_blend_pd(x0, x1, 2); // x0 x1 x0 x0
> __m256d x23 = _mm256_blend_pd(x2, x3, 8); // x2 x2 x2 x3
> __m256d x = _mm256_blend_pd(x01, x23, 12); // x0 x1 x2 x3
>
> __m256d y0 = _mm256_broadcast_sd(&a[i+0].y); // y0 y0 y0 y0
> __m256d y1 = _mm256_broadcast_sd(&a[i+1].y); // y1 y1 y1 y1
> __m256d y2 = _mm256_broadcast_sd(&a[i+2].y); // y2 y2 y2 y2
> __m256d y3 = _mm256_broadcast_sd(&a[i+3].y); // y3 y3 y3 y3
> __m256d y01 = _mm256_blend_pd(y0, y1, 2); // y0 y1 y0 y0
> __m256d y23 = _mm256_blend_pd(y2, y3, 8); // y2 y2 y2 y3
> __m256d y = _mm256_blend_pd(y01, y23, 12); // y0 y1 y2 y3
>
> __m256d z0 = _mm256_broadcast_sd(&a[i+0].z); // z0 z0 z0 z0
> __m256d z1 = _mm256_broadcast_sd(&a[i+1].z); // z1 z1 z1 z1
> __m256d z2 = _mm256_broadcast_sd(&a[i+2].z); // z2 z2 z2 z2
> __m256d z3 = _mm256_broadcast_sd(&a[i+3].z); // z3 z3 z3 z3
> __m256d z01 = _mm256_blend_pd(z0, z1, 2); // z0 z1 z0 z0
> __m256d z23 = _mm256_blend_pd(z2, z3, 8); // z2 z2 z2 z3
> __m256d z = _mm256_blend_pd(z01, z23, 12); // z0 z1 z2 z3
>
> __m256d sum = _mm256_add_pd(_mm256_add_pd(x, y), z);
>
> __m128d sum01 = _mm256_castpd256_pd128(sum);
> __m128d sum23 = _mm256_extractf128_pd (sum, 1);
> _mm_store_sd (&a[i+0].x, sum01);
> _mm_storeh_pd(&a[i+1].x, sum01);
> _mm_store_sd (&a[i+2].x, sum23);
> _mm_storeh_pd(&a[i+3].x, sum23);
> }
> }
>
>
> On Skylake client emulated gather variant ended up 1.8 times faster
> than "real" gather. On Haswell it was 3.6 times faster and on Zen3
> 2.8 times faster.
>
> Even considering that our case is rather bad for HW relatively to emulation
> these numbers are not pretty.
>
> I wonder if Rocket Lake HW gather fares any better.
<
Interesting::
<
One must make an intellectual distinction between a dense set of memory
references A[k..k+n] and an vacuous set of memory references -- as provided
by gather/scater. The former needs only (1 AGEN per cache line +1) versus
(1 AGEN per data access), This distinction divides the AGEN load by 8× or
more and correspondingly divides cache tag and TLB accesses similarly.
<
There are other complications for gather/scater such as 1 instruction needing
to access multiple AGEN and Cache ports per cycle, and depending on the
µArchitecture, this may be easy (seldom) or near impossible (more often).
<
On the other hand, dense linear accesses can often be performed along side
normal memory references because of "density". Access the whole cache
line then shuttle it out to the RF based on the available result/store-bus band-
width.
<
This reminds me of some code I wrote in 1982 where an application I wrote
could sustain higher file bandwidth when I read the file in through a Cache
I had programmed en massé and accessed characters and/or data structures
from the cache {{all written in FORTRAN 77 using FORTRAN I/O}}.

Beam me up, Scotty! It ate my phaser!

devel / comp.arch / hw gather vs emulation (was: Pros/cons of load/store many registers)

Subject	Author
hw gather vs emulation (was: Pros/cons of load/store many registers)	Michael S
Re: hw gather vs emulation (was: Pros/cons of load/store many registers)	MitchAlsup