Rocksolid Light - comp.arch - AMD Cache speed funny

I've knocked up a little utility program to try to work out some
performance figures for my CPU.

It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache

What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.

A C++ fragment is this. I can post the whole thing if it would help.

// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;

Stopwatch s;
s.start();
while (1) // until break when mask runs out
{ for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time

if (mask == 0) break; // Stop if we've run out of mask

mask >>= 1; // shrink the mask
}

As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.

All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.

But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.

Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.

A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.

What am I missing?

Thanks
Andy

Vir Campestris <vir.campestris@invalid.invalid> writes:
> for (size_t index = 0; index < storeWordCount; ++index)
> {
> // read and write a word in store.
> Raw[index & mask] ^= index;
> }
....
>When the mask is very small (3) it slows to 18GB/s. With 1 it halves
>again, and with zero (so it only operates on the same word over and
>over) it's half again. A fifth of the size with a large block.
>
>Something odd is happening here when I hammer the same location (32
>bytes and on down) so that it's slower. Yet this ought to be in the L1
>data cache.
>
>A late thought was to replace that ^= index with something that reads
>the memory only, or that writes it only, instead of doing a
>read-modify-write cycle. That gives me much faster performance with
>writes than reads. And neither read only, nor write only, show this odd
>slow down with small masks.
>
>What am I missing?

When you do

raw[0] ^= index;

in every step you read the result of the pervious iteration, xor it,
and store it again. This means that you have one chain of RMW data
dependences, with one RMW per iteration. On the Zen2 (which your
3400G has), this requires 8 cycles (see column H of
<http://www.complang.tuwien.ac.at/anton/memdep/>). With mask=1, you
get 2 chains, each with one 8-cycle RMW every second iteration, so you
need 4 cycles per iteration (see my column C). With mask=3, you get 4
chains and 2 cycles per iteration. Looking at my results, I would
expect another doubling with mask=7, but maybe your loop is running
into resource limits at that point (mine does 4 RMWs per iteration).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

On Tue, 30 Jan 2024 16:36:17 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:

> I've knocked up a little utility program to try to work out some
> performance figures for my CPU.
>
> It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
> 4MB L3 cache
> 2MB L2 cache
> 384kb L1 cache
>

That's for the whole chip and it includes L1I caches.
For individual core and excluding L1I the numbers are:
4MB L3 cache
512 KB L2 cache
32 KB L1D cache

> What I do is to xor a location in memory in an array many times.
> The size of the area I xor over is set by a mask on the store index.
> The words in the store are 64 bit.
>
> A C++ fragment is this. I can post the whole thing if it would help.
>
> // Calculate a bit mask for the entire store
> Word mask = storeWordCount - 1;
>
> Stopwatch s;
> s.start();
> while (1) // until break when mask runs out
> {
> for (size_t index = 0; index < storeWordCount; ++index)
> {
> // read and write a word in store.
> Raw[index & mask] ^= index;
> }
> s.lap(mask); // records the current time
>
> if (mask == 0) break; // Stop if we've run out of mask
>
> mask >>= 1; // shrink the mask
> }
>
> As you can see it starts with a large mask (in fact for a whole GB)
> and halves it as it goes around.
>
> All looks fine at first. I get about 8GB per second with a large
> mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as
> the mask gets smaller. No apparent effect when it gets under the L1
> cache size.
>
> But...
> When the mask is very small (3) it slows to 18GB/s. With 1 it halves
> again, and with zero (so it only operates on the same word over and
> over) it's half again. A fifth of the size with a large block.
>
> Something odd is happening here when I hammer the same location (32
> bytes and on down) so that it's slower. Yet this ought to be in the
> L1 data cache.
>
> A late thought was to replace that ^= index with something that reads
> the memory only, or that writes it only, instead of doing a
> read-modify-write cycle. That gives me much faster performance with
> writes than reads. And neither read only, nor write only, show this
> odd slow down with small masks.
>
> What am I missing?
>
> Thanks
> Andy

First, I'd look at generated asm.
If compiler was doing a good job then at mask <= 4095 (32 KB) you should
see slightly less than 1 iteration of the loop per cycle, i.e. assuming
4.2 GHz clock, approximately 30 GB/s.
Since you see less, it's a sign that compiler did less than perfect job.
Try to help it with manual loop unrolling.

As to the problem with lower performance at very small masks, it's
expected. CPU tries to execute loads speculatively out of order under
assumption that they don't alias with preceding stores. So actual loads
runs few loop iterations ahead of the stores. We can't say for sure how
many iterations ahead, but 7 to 10 iterations sounds like a good guess.
When your mask=7 (32 bytes) then aliasing starts to happen. On old
primitive CPUs, like Pentium 4, it causes massive slowdown, because
those early loads has to be replayed after rather significant delay
of about 20 cycles (length of pipeline). Your Zen1+ CPU is much smarter,
it detects that things are no good and stops wild speculations. So, you
don't see huge slowdown. But without speculation every load starts only
after all stores that preceded it in program order were either
committed into L1D cache or their address was checked against the
speculative load address and no aliasing was found. Since you see only
mild slowdown, it seems that the later is done rather effectively and
your CPU is still able to run loads speculatively, but now only 2 or 3
steps ahead, which is not enough to get the same performance as before.

Vir Campestris wrote:

> As you can see it starts with a large mask (in fact for a whole GB) and
> halves it as it goes around.

> All looks fine at first. I get about 8GB per second with a large mask,
> at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
> gets smaller. No apparent effect when it gets under the L1 cache size.

The execution window is apparently able to absorb the latency of L3 miss,
and stream L3->L1 accesses.

Anton answered the question regarding small masks.

Re: AMD Cache speed funny

<20240130223705.00001d96@yahoo.com>

Subject	Author
AMD Cache speed funny	Vir Campestris
Re: AMD Cache speed funny	Anton Ertl
Re: AMD Cache speed funny	Michael S
Re: AMD Cache speed funny	MitchAlsup1
Re: AMD Cache speed funny	Michael S
Re: AMD Cache speed funny	Terje Mathisen
Re: AMD Cache speed funny	Anton Ertl
Re: AMD Cache speed funny	Michael S
Re: AMD Cache speed funny	Scott Lurndal
Rowhammer and CLFLUSH (was: AMD Cache speed funny)	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)	Michael S
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Thomas Koenig
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	Anton Ertl
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	Scott Lurndal
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	MitchAlsup1
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)	Anton Ertl
Re: Rowhammer and CLFLUSH	MitchAlsup
Re: Rowhammer and CLFLUSH	EricP
Re: Rowhammer and CLFLUSH	Michael S
Re: Rowhammer and CLFLUSH	Chris M. Thomasson
Re: Rowhammer and CLFLUSH	Chris M. Thomasson
Re: AMD Cache speed funny	aph
Re: AMD Cache speed funny	Michael S
Re: AMD Cache speed funny	aph

The clothes have no emperor. -- C. A. R. Hoare, commenting on ADA.

devel / comp.arch / AMD Cache speed funny