Rocksolid Light - comp.lang.asm.x86

CLZERO

<t5tea0$6hu$1@dont-email.me>

https://news.novabbs.org/devel/article-flat.php?id=777&group=comp.lang.asm.x86#777

copy link Newsgroups: comp.lang.asm.x86

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.Montero@nospicedham.gmail.com (Bonita Montero)
Newsgroups: comp.lang.asm.x86
Subject: CLZERO
Date: Mon, 16 May 2022 13:58:56 +0200
Organization: A noiseless patient Spider
Lines: 71
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t5tea0$6hu$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="81d3b7cd7886e6ef1b29044dc9819df3";
logging-data="8021"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX192ot6Jpc4miTqSQ5V8Ha8MQ0SHkJV1pKM="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:lMwJzxRR/Nkwb9Lr/TCDHBW9/48=

by: Bonita Montero - Mon, 16 May 2022 11:58 UTC

x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.
According to Wikichip this is to recover from some memory-errors,
but this is pure nonsense. There was a posting in the LKML that
reveals the correct purpose: it's to fast zero memory without
polluting the cache, i.e. clzero is non-temporal.
I thought it would be nice to have a comparison betwen a looped
clzero and a plain memset, which itself is usually optimized
very good with today's compiler. So I wrote a little benchmark
in C++20 to compare both:

#include <iostream>
#include <chrono>
#include <vector>
#include <memory>
#include <chrono>
#include <cstring>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif

using namespace std;
using namespace chrono;

template<bool MemSet = false>
size_t clZeroRange( void *p, size_t n );

int main()
{ constexpr size_t
N = 0x4000000,
ROUNDS = 1'000;
vector<char> vc( N, 0 );
auto bench = [&]<bool MemSet>( bool_constant<MemSet> )
{
auto start = high_resolution_clock::now();
size_t n = 0;
for( size_t r = ROUNDS; r--; )
n += clZeroRange<MemSet>( to_address( vc.begin() ), N );
double GBS = (double)(ptrdiff_t)n / 0x1.0p30;
cout << GBS / ((double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0e9) << endl;
};
bench( false_type() );
bench( true_type() );
}

template<bool MemSet>
size_t clZeroRange( void *p, size_t n )
{ char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);
n -= pAlign - (char *)p;
n &= (ptrdiff_t)-64;
if constexpr( !MemSet )
for( char *end = pAlign + n; pAlign != end; pAlign += 64 )
_mm_clzero( pAlign );
else
memset( p, 0, n );
return n;
}

Interestingly I get the same performance for both variants with
MSVC++ 2022. With g++ / glibc I get a performance of about one
third of with memset() than with the clzero()-solution. I think
the memset() of glibc just not optimized so properly. The memset()
of Visual C++ uses non-temporal SSE stores which explains the good
performance.

Would someone here be so nice to post his values ?

Support bacteria -- it's the only culture some people have!

devel / comp.lang.asm.x86 / CLZERO