Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

The clothes have no emperor. -- C. A. R. Hoare, commenting on ADA.


devel / comp.arch / AMD Cache speed funny

SubjectAuthor
* AMD Cache speed funnyVir Campestris
+- Re: AMD Cache speed funnyAnton Ertl
+- Re: AMD Cache speed funnyMichael S
+* Re: AMD Cache speed funnyMitchAlsup1
|`- Re: AMD Cache speed funnyMichael S
`* Re: AMD Cache speed funnyTerje Mathisen
 +- Re: AMD Cache speed funnyAnton Ertl
 `* Re: AMD Cache speed funnyMichael S
  +- Re: AMD Cache speed funnyScott Lurndal
  +* Rowhammer and CLFLUSH (was: AMD Cache speed funny)Anton Ertl
  |+* Re: Rowhammer and CLFLUSHMitchAlsup
  ||`* Re: Rowhammer and CLFLUSHEricP
  || `* Re: Rowhammer and CLFLUSHMitchAlsup
  ||  +* Re: Rowhammer and CLFLUSHEricP
  ||  |`- Re: Rowhammer and CLFLUSHEricP
  ||  `- Re: Rowhammer and CLFLUSHEricP
  |+* Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)Michael S
  ||+* Re: Rowhammer and CLFLUSHMitchAlsup
  |||`* Re: Rowhammer and CLFLUSHEricP
  ||| +* Re: Rowhammer and CLFLUSHThomas Koenig
  ||| |`- Re: Rowhammer and CLFLUSHMitchAlsup
  ||| `* Re: Rowhammer and CLFLUSHAnton Ertl
  |||  +* Re: Rowhammer and CLFLUSHMitchAlsup
  |||  |`* Re: Rowhammer and CLFLUSHAnton Ertl
  |||  | `* Re: Rowhammer and CLFLUSHMitchAlsup
  |||  |  `* Re: Rowhammer and CLFLUSHAnton Ertl
  |||  |   `- Re: Rowhammer and CLFLUSHMitchAlsup1
  |||  +* Re: Rowhammer and CLFLUSHEricP
  |||  |`* Re: Rowhammer and CLFLUSHAnton Ertl
  |||  | `- Re: Rowhammer and CLFLUSHMitchAlsup1
  |||  `* Re: Rowhammer and CLFLUSHMitchAlsup1
  |||   `* Re: Rowhammer and CLFLUSHAnton Ertl
  |||    `* Re: Rowhammer and CLFLUSHEricP
  |||     `* Re: Rowhammer and CLFLUSHMitchAlsup1
  |||      +- Re: Rowhammer and CLFLUSHAnton Ertl
  |||      `* Re: Rowhammer and CLFLUSHEricP
  |||       `* Re: Rowhammer and CLFLUSHMitchAlsup1
  |||        +- Re: Rowhammer and CLFLUSHMichael S
  |||        `* Re: Rowhammer and CLFLUSHMichael S
  |||         +* Re: Rowhammer and CLFLUSHScott Lurndal
  |||         |`- Re: Rowhammer and CLFLUSHMichael S
  |||         `* Re: Rowhammer and CLFLUSHMitchAlsup1
  |||          `* Re: Rowhammer and CLFLUSHMichael S
  |||           `* Re: Rowhammer and CLFLUSHMitchAlsup1
  |||            `* Re: Rowhammer and CLFLUSHMichael S
  |||             +* Re: Rowhammer and CLFLUSHEricP
  |||             |`* Re: Rowhammer and CLFLUSHMichael S
  |||             | `* Re: Rowhammer and CLFLUSHEricP
  |||             |  `* Re: Rowhammer and CLFLUSHMichael S
  |||             |   `- Re: Rowhammer and CLFLUSHEricP
  |||             `* Re: Rowhammer and CLFLUSHEricP
  |||              `- Re: Rowhammer and CLFLUSHMichael S
  ||+- Re: Rowhammer and CLFLUSHEricP
  ||`* Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)Anton Ertl
  || `- Re: Rowhammer and CLFLUSHMitchAlsup
  |`* Re: Rowhammer and CLFLUSHEricP
  | +- Re: Rowhammer and CLFLUSHMichael S
  | `* Re: Rowhammer and CLFLUSHChris M. Thomasson
  |  `- Re: Rowhammer and CLFLUSHChris M. Thomasson
  `* Re: AMD Cache speed funnyaph
   `* Re: AMD Cache speed funnyMichael S
    `- Re: AMD Cache speed funnyaph

Pages:123
AMD Cache speed funny

<upb8i3$12emv$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37152&group=comp.arch#37152

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: vir.campestris@invalid.invalid (Vir Campestris)
Newsgroups: comp.arch
Subject: AMD Cache speed funny
Date: Tue, 30 Jan 2024 16:36:17 +0000
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <upb8i3$12emv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 30 Jan 2024 16:36:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a763e248f478e837bda93408374b9b37";
logging-data="1129183"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX198x3lFnAM2BZHJzZDqw1cE3TaGSMCZcyQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:DGWyVU7NTU6Ie+YHzrdY35LuNVI=
Content-Language: en-GB
 by: Vir Campestris - Tue, 30 Jan 2024 16:36 UTC

I've knocked up a little utility program to try to work out some
performance figures for my CPU.

It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache

What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.

A C++ fragment is this. I can post the whole thing if it would help.

// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;

Stopwatch s;
s.start();
while (1) // until break when mask runs out
{ for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time

if (mask == 0) break; // Stop if we've run out of mask

mask >>= 1; // shrink the mask
}

As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.

All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.

But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.

Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.

A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.

What am I missing?

Thanks
Andy

Re: AMD Cache speed funny

<2024Jan30.182059@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37153&group=comp.arch#37153

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Tue, 30 Jan 2024 17:20:59 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 42
Message-ID: <2024Jan30.182059@mips.complang.tuwien.ac.at>
References: <upb8i3$12emv$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="a90240702952f42b54cdea6c3117d810";
logging-data="1150008"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19A/gjY4pB/hdRA2ut6MATW"
Cancel-Lock: sha1:JnG+kzMxmJ/mTUG4s02u82R+d00=
X-newsreader: xrn 10.11
 by: Anton Ertl - Tue, 30 Jan 2024 17:20 UTC

Vir Campestris <vir.campestris@invalid.invalid> writes:
> for (size_t index = 0; index < storeWordCount; ++index)
> {
> // read and write a word in store.
> Raw[index & mask] ^= index;
> }
....
>When the mask is very small (3) it slows to 18GB/s. With 1 it halves
>again, and with zero (so it only operates on the same word over and
>over) it's half again. A fifth of the size with a large block.
>
>Something odd is happening here when I hammer the same location (32
>bytes and on down) so that it's slower. Yet this ought to be in the L1
>data cache.
>
>A late thought was to replace that ^= index with something that reads
>the memory only, or that writes it only, instead of doing a
>read-modify-write cycle. That gives me much faster performance with
>writes than reads. And neither read only, nor write only, show this odd
>slow down with small masks.
>
>What am I missing?

When you do

raw[0] ^= index;

in every step you read the result of the pervious iteration, xor it,
and store it again. This means that you have one chain of RMW data
dependences, with one RMW per iteration. On the Zen2 (which your
3400G has), this requires 8 cycles (see column H of
<http://www.complang.tuwien.ac.at/anton/memdep/>). With mask=1, you
get 2 chains, each with one 8-cycle RMW every second iteration, so you
need 4 cycles per iteration (see my column C). With mask=3, you get 4
chains and 2 cycles per iteration. Looking at my results, I would
expect another doubling with mask=7, but maybe your loop is running
into resource limits at that point (mine does 4 RMWs per iteration).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: AMD Cache speed funny

<20240130193815.00003f26@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37154&group=comp.arch#37154

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Tue, 30 Jan 2024 19:38:15 +0200
Organization: A noiseless patient Spider
Lines: 97
Message-ID: <20240130193815.00003f26@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: dont-email.me; posting-host="a3d3cea337ba30c321be35a26dedaf9c";
logging-data="1057057"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VQKgpUy4VGMwnznA7xIEn4LYo5vBhoIY="
Cancel-Lock: sha1:wQrCVJNs9CBcokXLq5fM0n8oA/4=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Tue, 30 Jan 2024 17:38 UTC

On Tue, 30 Jan 2024 16:36:17 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:

> I've knocked up a little utility program to try to work out some
> performance figures for my CPU.
>
> It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
> 4MB L3 cache
> 2MB L2 cache
> 384kb L1 cache
>

That's for the whole chip and it includes L1I caches.
For individual core and excluding L1I the numbers are:
4MB L3 cache
512 KB L2 cache
32 KB L1D cache

> What I do is to xor a location in memory in an array many times.
> The size of the area I xor over is set by a mask on the store index.
> The words in the store are 64 bit.
>
> A C++ fragment is this. I can post the whole thing if it would help.
>
> // Calculate a bit mask for the entire store
> Word mask = storeWordCount - 1;
>
> Stopwatch s;
> s.start();
> while (1) // until break when mask runs out
> {
> for (size_t index = 0; index < storeWordCount; ++index)
> {
> // read and write a word in store.
> Raw[index & mask] ^= index;
> }
> s.lap(mask); // records the current time
>
> if (mask == 0) break; // Stop if we've run out of mask
>
> mask >>= 1; // shrink the mask
> }
>
> As you can see it starts with a large mask (in fact for a whole GB)
> and halves it as it goes around.
>
> All looks fine at first. I get about 8GB per second with a large
> mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as
> the mask gets smaller. No apparent effect when it gets under the L1
> cache size.
>
> But...
> When the mask is very small (3) it slows to 18GB/s. With 1 it halves
> again, and with zero (so it only operates on the same word over and
> over) it's half again. A fifth of the size with a large block.
>
> Something odd is happening here when I hammer the same location (32
> bytes and on down) so that it's slower. Yet this ought to be in the
> L1 data cache.
>
> A late thought was to replace that ^= index with something that reads
> the memory only, or that writes it only, instead of doing a
> read-modify-write cycle. That gives me much faster performance with
> writes than reads. And neither read only, nor write only, show this
> odd slow down with small masks.
>
> What am I missing?
>
> Thanks
> Andy

First, I'd look at generated asm.
If compiler was doing a good job then at mask <= 4095 (32 KB) you should
see slightly less than 1 iteration of the loop per cycle, i.e. assuming
4.2 GHz clock, approximately 30 GB/s.
Since you see less, it's a sign that compiler did less than perfect job.
Try to help it with manual loop unrolling.

As to the problem with lower performance at very small masks, it's
expected. CPU tries to execute loads speculatively out of order under
assumption that they don't alias with preceding stores. So actual loads
runs few loop iterations ahead of the stores. We can't say for sure how
many iterations ahead, but 7 to 10 iterations sounds like a good guess.
When your mask=7 (32 bytes) then aliasing starts to happen. On old
primitive CPUs, like Pentium 4, it causes massive slowdown, because
those early loads has to be replayed after rather significant delay
of about 20 cycles (length of pipeline). Your Zen1+ CPU is much smarter,
it detects that things are no good and stops wild speculations. So, you
don't see huge slowdown. But without speculation every load starts only
after all stores that preceded it in program order were either
committed into L1D cache or their address was checked against the
speculative load address and no aliasing was found. Since you see only
mild slowdown, it seems that the later is done rather effectively and
your CPU is still able to run loads speculatively, but now only 2 or 3
steps ahead, which is not enough to get the same performance as before.

Re: AMD Cache speed funny

<daff11f49fd13567c56c0a872fae7735@www.novabbs.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37155&group=comp.arch#37155

  copy link   Newsgroups: comp.arch
Date: Tue, 30 Jan 2024 20:11:42 +0000
Subject: Re: AMD Cache speed funny
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$hIynU6QmtBZUfgbzXJxS9eRl1g8P2dv41Tfa1QBx5vVeqbePC1jEm
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <upb8i3$12emv$1@dont-email.me>
Organization: Rocksolid Light
Message-ID: <daff11f49fd13567c56c0a872fae7735@www.novabbs.org>
 by: MitchAlsup1 - Tue, 30 Jan 2024 20:11 UTC

Vir Campestris wrote:

> As you can see it starts with a large mask (in fact for a whole GB) and
> halves it as it goes around.

> All looks fine at first. I get about 8GB per second with a large mask,
> at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
> gets smaller. No apparent effect when it gets under the L1 cache size.

The execution window is apparently able to absorb the latency of L3 miss,
and stream L3->L1 accesses.

Anton answered the question regarding small masks.

Re: AMD Cache speed funny

<20240130223705.00001d96@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37156&group=comp.arch#37156

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.samoylyk.net!news.nntp4.net!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Tue, 30 Jan 2024 22:37:05 +0200
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <20240130223705.00001d96@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
<daff11f49fd13567c56c0a872fae7735@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="3bd38c6789d8364e9a0dbfb2302c9385";
logging-data="1207785"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Kc+2g5kT4LiFKOGq6RnICtqbwyWEnNCo="
Cancel-Lock: sha1:Ijffyf7eObMsGXtt+yYJJnob9Jo=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Tue, 30 Jan 2024 20:37 UTC

On Tue, 30 Jan 2024 20:11:42 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

> Vir Campestris wrote:
>
> > As you can see it starts with a large mask (in fact for a whole GB)
> > and halves it as it goes around.
>
> > All looks fine at first. I get about 8GB per second with a large
> > mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
> > as the mask gets smaller. No apparent effect when it gets under the
> > L1 cache size.
>
> The execution window is apparently able to absorb the latency of L3
> miss, and stream L3->L1 accesses.
>

That sounds unlikely. L3 latency is too big to be covered by execution
window. Much more likely they have adequate HW prefetch from L3 to L2
and may be (less likely) even to L1D.

> Anton answered the question regarding small masks.

Re: AMD Cache speed funny

<upcr4t$1drbk$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37157&group=comp.arch#37157

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Wed, 31 Jan 2024 07:59:41 +0100
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <upcr4t$1drbk$1@dont-email.me>
References: <upb8i3$12emv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Wed, 31 Jan 2024 06:59:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="446cec21f5221b42e763c7e38999adcb";
logging-data="1502580"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oePPm/LSNt3rTK97zgHgir3+hYGtvEPrXDf27LDclQw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.1
Cancel-Lock: sha1:hN1pxDcl5/7MW3rUuYsDTTbdJHo=
In-Reply-To: <upb8i3$12emv$1@dont-email.me>
 by: Terje Mathisen - Wed, 31 Jan 2024 06:59 UTC

Vir Campestris wrote:
> I've knocked up a little utility program to try to work out some
> performance figures for my CPU.
>
> It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
> 4MB L3 cache
> 2MB L2 cache
> 384kb L1 cache
>
> What I do is to xor a location in memory in an array many times.
> The size of the area I xor over is set by a mask on the store index.
> The words in the store are 64 bit.
>
> A C++ fragment is this. I can post the whole thing if it would help.
>
> // Calculate a bit mask for the entire store
> Word mask = storeWordCount - 1;
>
> Stopwatch s;
> s.start();
> while (1)       // until break when mask runs out
> {
>         for (size_t index = 0; index < storeWordCount; ++index)
>         {
>                 // read and write a word in store.
>                 Raw[index & mask] ^= index;
>         }
>         s.lap(mask);            // records the current time
>
>         if (mask == 0) break;   // Stop if we've run out of mask
>
>         mask >>= 1;             // shrink the mask
> }
>
> As you can see it starts with a large mask (in fact for a whole GB) and
> halves it as it goes around.
>
> All looks fine at first. I get about 8GB per second with a large mask,
> at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
> gets smaller. No apparent effect when it gets under the L1 cache size.
>
> But...
> When the mask is very small (3) it slows to 18GB/s. With 1 it halves
> again, and with zero (so it only operates on the same word over and
> over) it's half again. A fifth of the size with a large block.
>
> Something odd is happening here when I hammer the same location (32
> bytes and on down) so that it's slower. Yet this ought to be in the L1
> data cache.
>
> A late thought was to replace that ^= index with something that reads
> the memory only, or that writes it only, instead of doing a
> read-modify-write cycle. That gives me much faster performance with
> writes than reads. And neither read only, nor write only, show this odd
> slow down with small masks.

Mitch, Anton and Michael have already answered, I just want to add that
we have one additional potential factor:

Rowhammer protection:

It is possible that the pattern of re-XORing the same or a small number
of locations over and over could trigger a pattern detector which was
designed to mitigate against Rowhammer.

OTOH, this would much more easily be handled with memory range based
coalescing of write operations in the last level cache, right?

I.e. for normal (write combining) memory, it would (afaik) be legal to
delay the actual writes to RAM for a significant time, long enough to
merge multiple memory writes.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AMD Cache speed funny

<2024Jan31.091713@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37158&group=comp.arch#37158

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Wed, 31 Jan 2024 08:17:13 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 48
Message-ID: <2024Jan31.091713@mips.complang.tuwien.ac.at>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="b9c5437e8ad9a4977a884e846f828897";
logging-data="1529593"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NvPi5pQkE0dwQ6PtaDKOv"
Cancel-Lock: sha1:b70xa4i0KhkLKXPdFgUvmg/O/Iw=
X-newsreader: xrn 10.11
 by: Anton Ertl - Wed, 31 Jan 2024 08:17 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>Rowhammer protection:
>
>It is possible that the pattern of re-XORing the same or a small number=20
>of locations over and over could trigger a pattern detector which was=20
>designed to mitigate against Rowhammer.

I don't think that memory controller designers have actually
implemented Rowhammer protection: I would expect that the processor
manufacturers would have bregged about that if they had. They have
not. And even RAM manufacturers have stopped mentioning anything
about Rowhammer in their specs. It seems that all hardware
manufacturers have decided that Rowhammer is something that will just
disappear from public knowledge (and therefore from what they have to
deal with) if they just ignore it long enough. It appears that they
are right.

They seem to take the same approach wrt Spectre-family attacks. In
that case, however, new variants appear all the time, so maybe the
approach won't work here.

However, in the present case "the same small number of locations" is
not hammered, because a small number of memory locations fits into the
cache in the adjacent access pattern that this test uses, and all
writes will just be to the cache.

>OTOH, this would much more easily be handled with memory range based=20
>coalescing of write operations in the last level cache, right?

We have had write-back caches (at the L2 or L1 level, and certainly at
the LLC level) since the later 486 years.

>I.e. for normal (write combining) memory

Normal memory is write-back. AFAIK write combining is for stuff like
graphics card memory.

>it would (afaik) be legal to=20
>delay the actual writes to RAM for a significant time, long enough to=20
>merge multiple memory writes.

And this is what actually happens, through the magic of write-back
caches.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: AMD Cache speed funny

<20240131131353.0000688c@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37159&group=comp.arch#37159

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Wed, 31 Jan 2024 13:13:53 +0200
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <20240131131353.0000688c@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
<upcr4t$1drbk$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: dont-email.me; posting-host="dde7e6588c421492c1bae181782363df";
logging-data="1560568"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0xxPsskygUKRJEvcgvYWOaVJlGPl1av8="
Cancel-Lock: sha1:IxYHbkBzgDTs3KEvbdm5EYsVtQ4=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Wed, 31 Jan 2024 11:13 UTC

On Wed, 31 Jan 2024 07:59:41 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Vir Campestris wrote:
> > I've knocked up a little utility program to try to work out some
> > performance figures for my CPU.
> >
> > It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
> > 4MB L3 cache
> > 2MB L2 cache
> > 384kb L1 cache
> >
> > What I do is to xor a location in memory in an array many times.
> > The size of the area I xor over is set by a mask on the store index.
> > The words in the store are 64 bit.
> >
> > A C++ fragment is this. I can post the whole thing if it would help.
> >
> > // Calculate a bit mask for the entire store
> > Word mask = storeWordCount - 1;
> >
> > Stopwatch s;
> > s.start();
> > while (1)       // until break when mask runs out
> > {
> >         for (size_t index = 0; index < storeWordCount; ++index)
> >         {
> >                 // read and write a word in store.
> >                 Raw[index & mask] ^= index;
> >         }
> >         s.lap(mask);            // records the current time
> >
> >         if (mask == 0) break;   // Stop if we've run out of mask
> >
> >         mask >>= 1;             // shrink the mask
> > }
> >
> > As you can see it starts with a large mask (in fact for a whole GB)
> > and halves it as it goes around.
> >
> > All looks fine at first. I get about 8GB per second with a large
> > mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
> > as the mask gets smaller. No apparent effect when it gets under the
> > L1 cache size.
> >
> > But...
> > When the mask is very small (3) it slows to 18GB/s. With 1 it
> > halves again, and with zero (so it only operates on the same word
> > over and over) it's half again. A fifth of the size with a large
> > block.
> >
> > Something odd is happening here when I hammer the same location (32
> > bytes and on down) so that it's slower. Yet this ought to be in the
> > L1 data cache.
> >
> > A late thought was to replace that ^= index with something that
> > reads the memory only, or that writes it only, instead of doing a
> > read-modify-write cycle. That gives me much faster performance with
> > writes than reads. And neither read only, nor write only, show this
> > odd slow down with small masks.
>
> Mitch, Anton and Michael have already answered, I just want to add
> that we have one additional potential factor:
>
> Rowhammer protection:
>
> It is possible that the pattern of re-XORing the same or a small
> number of locations over and over could trigger a pattern detector
> which was designed to mitigate against Rowhammer.
>
> OTOH, this would much more easily be handled with memory range based
> coalescing of write operations in the last level cache, right?
>
> I.e. for normal (write combining) memory, it would (afaik) be legal
> to delay the actual writes to RAM for a significant time, long enough
> to merge multiple memory writes.
>
> Terje
>
>

I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."

By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.

Re: AMD Cache speed funny

<mktuN.273502$Wp_8.214627@fx17.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37160&group=comp.arch#37160

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: AMD Cache speed funny
Newsgroups: comp.arch
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com>
Lines: 17
Message-ID: <mktuN.273502$Wp_8.214627@fx17.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 31 Jan 2024 15:04:50 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 31 Jan 2024 15:04:50 GMT
X-Received-Bytes: 1343
 by: Scott Lurndal - Wed, 31 Jan 2024 15:04 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Wed, 31 Jan 2024 07:59:41 +0100
>Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>
>
>By now, it seems obvious that making CLFLUSH instruction non-privilaged
>and pretty much non-restricted by memory range/page attributes was a
>mistake, but that mistake can't be fixed without breaking things.
>Considering that CLFLUSH exists since very early 2000s, it is
>understandable.
>IIRC, ARMv8 did the same mistake a decade later. It is less
>understandable.

ARMv8 has a control bit that can be set to allow EL0 access
to the DC system instructions. By default it is a privileged
instruction. It is up to the operating software to enable
it for user-mode code.

Rowhammer and CLFLUSH (was: AMD Cache speed funny)

<2024Jan31.181721@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37161&group=comp.arch#37161

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.chmurka.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Rowhammer and CLFLUSH (was: AMD Cache speed funny)
Date: Wed, 31 Jan 2024 17:17:21 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 50
Message-ID: <2024Jan31.181721@mips.complang.tuwien.ac.at>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com>
Injection-Info: dont-email.me; posting-host="b9c5437e8ad9a4977a884e846f828897";
logging-data="1731574"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+O41qDG6pIpyIvrM/13bPL"
Cancel-Lock: sha1:DyYUfmmQ+eOZddyM4aYhLoaECYU=
X-newsreader: xrn 10.11
 by: Anton Ertl - Wed, 31 Jan 2024 17:17 UTC

Michael S <already5chosen@yahoo.com> writes:
>I have very little to add to very good response by Anton.
>That little addition is: the most if not all Rowhammer POC examples rely
>on CLFLUSH. That's what the manual says about it:
>"Executions of the CLFLUSH instruction are ordered with respect to each
>other and with respect to writes, locked read-modify-write
>instructions, fence instructions, and executions of CLFLUSHOPT to the
>same cache line.1 They are not ordered with respect to executions of
>CLFLUSHOPT to different cache lines."
>
>By now, it seems obvious that making CLFLUSH instruction non-privilaged
>and pretty much non-restricted by memory range/page attributes was a
>mistake, but that mistake can't be fixed without breaking things.
>Considering that CLFLUSH exists since very early 2000s, it is
>understandable.
>IIRC, ARMv8 did the same mistake a decade later. It is less
>understandable.

Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).

However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.

The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go. With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Rowhammer and CLFLUSH

<6e4daf9e72082682c2fe92fa3054ccad@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37162&group=comp.arch#37162

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Wed, 31 Jan 2024 20:12:15 +0000
Organization: novaBBS
Message-ID: <6e4daf9e72082682c2fe92fa3054ccad@www.novabbs.com>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1235007"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$GYsHGZz2ZsWYaa1UOYjhse.WacVHI5KmfRn/m3sDyH5IM/RlrUmVm
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
 by: MitchAlsup - Wed, 31 Jan 2024 20:12 UTC

Anton Ertl wrote:

> Michael S <already5chosen@yahoo.com> writes:
>>I have very little to add to very good response by Anton.
>>That little addition is: the most if not all Rowhammer POC examples rely
>>on CLFLUSH. That's what the manual says about it:
>>"Executions of the CLFLUSH instruction are ordered with respect to each
>>other and with respect to writes, locked read-modify-write
>>instructions, fence instructions, and executions of CLFLUSHOPT to the
>>same cache line.1 They are not ordered with respect to executions of
>>CLFLUSHOPT to different cache lines."
>>
>>By now, it seems obvious that making CLFLUSH instruction non-privilaged
>>and pretty much non-restricted by memory range/page attributes was a
>>mistake, but that mistake can't be fixed without breaking things.
>>Considering that CLFLUSH exists since very early 2000s, it is
>>understandable.
>>IIRC, ARMv8 did the same mistake a decade later. It is less
>>understandable.

> Ideally caches are fully transparent microarchitecture, then you don't
> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
> getting DRAM up-to-date for DMA from I/O devices.

I have wondered for a while about why device access is not to coherent
space. If it were so, then no CFLUSH functionality is needed, I/O can
just read/write an address and always get the freshest copy. {{Maybe
not the device itself, but the PCIe Root could translate from device
access space to memory access space (coherent).}}
> An alternative
> would be to let the memory controller remember which lines are
> modified, and if the I/O device asks for that line, get the up-to-date
> data from the cache line using the cache-consistency protocol. This
> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
> concerned, the ordering constraints may still be relevant), so there
> is a way to fix this mistake (if it is one).

> However, AFAIK this is insufficient for fixing Rowhammer.

If L3 (LLC) is not a processor cache but a great big read/write buffer
for DRAM, then Rowhammering is significantly harder to accomplish.
> Caches have
> relatively limited associativity, up to something like 16-way
> set-associativity, so if you write to the same set 17 times, you are
> guaranteed to miss the cache. With 3 levels of cache you may need 49
> accesses (probably less), but I expect that the resulting DRAM
> accesses to a cache line are still not rare enough that Rowhammer
> cannot happen.

Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.

So, the trick is to detect the RowHammering and insert refresh commands.

> The first paper on Rowhammer already outlined how the memory
> controller could count how often adjacent DRAM rows are accessed and
> thus weaken the row under consideration. This approach needs a little
> adjustment for Double Rowhammer and not immediately neighbouring rows,
> but otherwise seems to me to be the way to go. With autorefresh in
> the DRAM devices these days, the DRAM manufacturers could implement
> this on their own, without needing to coordinate with memory
> controller designers. But apparently they think that the customers
> don't care, so they can save the expense.

> - anton

Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)

<20240131224915.000063d9@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37163&group=comp.arch#37163

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH (was: AMD Cache speed funny)
Date: Wed, 31 Jan 2024 22:49:15 +0200
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <20240131224915.000063d9@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
<upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<2024Jan31.181721@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="458330ddd435bd841a14cd8bf85c2e26";
logging-data="1769985"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/DN1UN4C4Hv0yxDiNHMbuLig31INxJkao="
Cancel-Lock: sha1:LJzvPi5d4c+4I8vWhJV978XWVRw=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Wed, 31 Jan 2024 20:49 UTC

On Wed, 31 Jan 2024 17:17:21 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >I have very little to add to very good response by Anton.
> >That little addition is: the most if not all Rowhammer POC examples
> >rely on CLFLUSH. That's what the manual says about it:
> >"Executions of the CLFLUSH instruction are ordered with respect to
> >each other and with respect to writes, locked read-modify-write
> >instructions, fence instructions, and executions of CLFLUSHOPT to the
> >same cache line.1 They are not ordered with respect to executions of
> >CLFLUSHOPT to different cache lines."
> >
> >By now, it seems obvious that making CLFLUSH instruction
> >non-privilaged and pretty much non-restricted by memory range/page
> >attributes was a mistake, but that mistake can't be fixed without
> >breaking things. Considering that CLFLUSH exists since very early
> >2000s, it is understandable.
> >IIRC, ARMv8 did the same mistake a decade later. It is less
> >understandable.
>
> Ideally caches are fully transparent microarchitecture, then you don't
> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
> getting DRAM up-to-date for DMA from I/O devices. An alternative
> would be to let the memory controller remember which lines are
> modified, and if the I/O device asks for that line, get the up-to-date
> data from the cache line using the cache-consistency protocol.

Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully
I/O-coherent for several years, I find your theory unlikely.

Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.

Of course, this particular use case does not require *non-priviledged*
CLFLUSH, so obviously Intel had different reason.

> This
> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
> concerned, the ordering constraints may still be relevant), so there
> is a way to fix this mistake (if it is one).
>
> However, AFAIK this is insufficient for fixing Rowhammer. Caches have
> relatively limited associativity, up to something like 16-way
> set-associativity, so if you write to the same set 17 times, you are
> guaranteed to miss the cache. With 3 levels of cache you may need 49
> accesses (probably less), but I expect that the resulting DRAM
> accesses to a cache line are still not rare enough that Rowhammer
> cannot happen.
>

Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.

Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.

Today we have yet another variant called RowPress that bypasses TRR
mitigation more reliably than mult-rate RH. I think this one would be
practically impossible without CLFLUSH., esp. when system under attack
carries other DRAM accesses in parallel with attackers code.

> The first paper on Rowhammer already outlined how the memory
> controller could count how often adjacent DRAM rows are accessed and
> thus weaken the row under consideration. This approach needs a little
> adjustment for Double Rowhammer and not immediately neighbouring rows,
> but otherwise seems to me to be the way to go.

IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.

> With autorefresh in
> the DRAM devices these days, the DRAM manufacturers could implement
> this on their own, without needing to coordinate with memory
> controller designers. But apparently they think that the customers
> don't care, so they can save the expense.
>
> - anton

They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.

Re: Rowhammer and CLFLUSH

<aba96427866214c1b2981a75ecc45db7@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37165&group=comp.arch#37165

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Wed, 31 Jan 2024 23:22:38 +0000
Organization: novaBBS
Message-ID: <aba96427866214c1b2981a75ecc45db7@www.novabbs.com>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at> <20240131224915.000063d9@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1250463"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$N0JST5qB3PSRJc1C7lgxD.7py7y422f4/6FYy9M0gStJuLDxgGBnG
 by: MitchAlsup - Wed, 31 Jan 2024 23:22 UTC

Michael S wrote:

> On Wed, 31 Jan 2024 17:17:21 GMT
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

>> Michael S <already5chosen@yahoo.com> writes:
>> >I have very little to add to very good response by Anton.
>> >That little addition is: the most if not all Rowhammer POC examples
>> >rely on CLFLUSH. That's what the manual says about it:
>> >"Executions of the CLFLUSH instruction are ordered with respect to
>> >each other and with respect to writes, locked read-modify-write
>> >instructions, fence instructions, and executions of CLFLUSHOPT to the
>> >same cache line.1 They are not ordered with respect to executions of
>> >CLFLUSHOPT to different cache lines."
>> >
>> >By now, it seems obvious that making CLFLUSH instruction
>> >non-privilaged and pretty much non-restricted by memory range/page
>> >attributes was a mistake, but that mistake can't be fixed without
>> >breaking things. Considering that CLFLUSH exists since very early
>> >2000s, it is understandable.
>> >IIRC, ARMv8 did the same mistake a decade later. It is less
>> >understandable.
>>
>> Ideally caches are fully transparent microarchitecture, then you don't
>> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
>> getting DRAM up-to-date for DMA from I/O devices. An alternative
>> would be to let the memory controller remember which lines are
>> modified, and if the I/O device asks for that line, get the up-to-date
>> data from the cache line using the cache-consistency protocol.

> Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
> and that at that time all Intel's PCI/AGP root hubs were already fully
> I/O-coherent for several years, I find your theory unlikely.

> Myself, I don't know the original reason, but I do know a use case
> where CLFLUSH, while not strictly necessary, simplifies things greatly
> - entering deep sleep state in which CPU caches are powered down and
> DRAM put in self-refresh mode.

> Of course, this particular use case does not require *non-priviledged*
> CLFLUSH, so obviously Intel had different reason.

There was no assumption that this could result in a side channel or
attack vector at the time of its non-privileged inclusion. Afterwards
there was no reason to make it privileged until 2017 and by then the
ability to do anything about it has vanished.

Me, personally, I see this as a violation of the cache is there to
reduce memory latency principle and thereby improve performance.

>> This
>> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
>> concerned, the ordering constraints may still be relevant), so there
>> is a way to fix this mistake (if it is one).
>>
>> However, AFAIK this is insufficient for fixing Rowhammer. Caches have
>> relatively limited associativity, up to something like 16-way
>> set-associativity, so if you write to the same set 17 times, you are
>> guaranteed to miss the cache. With 3 levels of cache you may need 49
>> accesses (probably less), but I expect that the resulting DRAM
>> accesses to a cache line are still not rare enough that Rowhammer
>> cannot happen.
>>

> Original RH required very high hammering rate that certainly can't be
> achieved by playing with associativity of L3 cache.

> Newer multiside hammering probably can do it in theory, but it would be
> very difficult in practice.

The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.

The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.

> Today we have yet another variant called RowPress that bypasses TRR
> mitigation more reliably than mult-rate RH. I think this one would be
> practically impossible without CLFLUSH., esp. when system under attack
> carries other DRAM accesses in parallel with attackers code.

>> The first paper on Rowhammer already outlined how the memory
>> controller could count how often adjacent DRAM rows are accessed and
>> thus weaken the row under consideration. This approach needs a little
>> adjustment for Double Rowhammer and not immediately neighbouring rows,
>> but otherwise seems to me to be the way to go.

> IMHO, all thise solutions are pure fantasy, because memory controller
> does not even know which rows are physically adjacent.

Different DIMMs and even different DRAMs on the same DIMM may not
share that correspondence. {There is a lot of bit line and a little
word line repair done at the tester.}

> POC authors
> typically run lengthy tests in order to figure it out.

>> With autorefresh in
>> the DRAM devices these days, the DRAM manufacturers could implement
>> this on their own, without needing to coordinate with memory
>> controller designers. But apparently they think that the customers
>> don't care, so they can save the expense.
>>
>> - anton

> They cared enough to implement the simplest of proposed solutions - TRR.
> Yes, it was quickly found insufficient, but at least there was a
> demonstration of good intentions.

Re: AMD Cache speed funny

<P-qdnZrfj5Fc-yb4nZ2dnZfqn_SdnZ2d@supernews.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37173&group=comp.arch#37173

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Thu, 01 Feb 2024 09:39:13 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: AMD Cache speed funny
Newsgroups: comp.arch
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <P-qdnZrfj5Fc-yb4nZ2dnZfqn_SdnZ2d@supernews.com>
Date: Thu, 01 Feb 2024 09:39:13 +0000
Lines: 23
X-Trace: sv3-IgoqdE6rCl57AHfg0ARskDoG0/W6jnvp2ki0Qe0ecgEV/U0eI9q/X2FAIESlqv3T7vc0J/KlfdTPyBC!hAu7RmusBA9sMai4FVELMfQUi2L3XCAeT2/DyY6SInvOzNw2HJ8dO0zU4EYITXqZmR1jZsF8wEJ4!op3Vj94LnGM=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Received-Bytes: 2306
 by: aph@littlepinkcloud.invalid - Thu, 1 Feb 2024 09:39 UTC

Michael S <already5chosen@yahoo.com> wrote:
>
> By now, it seems obvious that making CLFLUSH instruction non-privilaged
> and pretty much non-restricted by memory range/page attributes was a
> mistake, but that mistake can't be fixed without breaking things.
> Considering that CLFLUSH exists since very early 2000s, it is
> understandable.
> IIRC, ARMv8 did the same mistake a decade later. It is less
> understandable.

For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence, when
using non-volatile memory you need an efficient way to flush dcache to
the point of peristence. You need that in order to make sure that a
transaction has been written to a log.

With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent. Which is a good idea, but not everyone does it.

Andrew.

Re: AMD Cache speed funny

<20240201153646.00006e78@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37176&group=comp.arch#37176

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: AMD Cache speed funny
Date: Thu, 1 Feb 2024 15:36:46 +0200
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <20240201153646.00006e78@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
<upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<P-qdnZrfj5Fc-yb4nZ2dnZfqn_SdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="3c6a247926d5d4ebdb68b70146492706";
logging-data="2204384"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wEmVk2ll5F05GuA8ci4O8Dl54ukQIGHU="
Cancel-Lock: sha1:e4xhEFFNHMReL1PFUkcGJk7IZH4=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Thu, 1 Feb 2024 13:36 UTC

On Thu, 01 Feb 2024 09:39:13 +0000
aph@littlepinkcloud.invalid wrote:

> Michael S <already5chosen@yahoo.com> wrote:
> >
> > By now, it seems obvious that making CLFLUSH instruction
> > non-privilaged and pretty much non-restricted by memory range/page
> > attributes was a mistake, but that mistake can't be fixed without
> > breaking things. Considering that CLFLUSH exists since very early
> > 2000s, it is understandable.
> > IIRC, ARMv8 did the same mistake a decade later. It is less
> > understandable.
>
> For Arm, with its non-coherent data and instruction caches, you need
> some way to flush dcache to the point of unification in order to make
> instruction changes visible. Also, regardless of icache coherence,
> when using non-volatile memory you need an efficient way to flush
> dcache to the point of peristence. You need that in order to make
> sure that a transaction has been written to a log.
>
> With the latter, you could restrict dcache flushes to pages with a
> particular non-volatile attribute. I don't think there's anything you
> can do about the former, short of simply making i- and d-cache
> coherent.

For the later, privileged flush instruction sounds sufficient.

For the former, ARMv8 appears to have a special instruction (or you can
call it a special variant of DC instruction) - Clean by virtual address
to point of unification (DC CVAU). This instruction alone would not
make RH attack much easier. The problem is that privilagability of this
instruction controlled by the same bit as privilagability of two much
more dangerous variations of DC (DC CVAC and DC CIVAC).

> Which is a good idea, but not everyone does it.
>
> Andrew.

Neoverse N1 had it. I don't know about the rest of Neoverse series.

Re: Rowhammer and CLFLUSH

<VyNuN.188050$yEgf.160511@fx09.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37177&group=comp.arch#37177

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at>
In-Reply-To: <2024Jan31.181721@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 39
Message-ID: <VyNuN.188050$yEgf.160511@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 01 Feb 2024 14:05:41 UTC
Date: Thu, 01 Feb 2024 09:05:19 -0500
X-Received-Bytes: 2750
 by: EricP - Thu, 1 Feb 2024 14:05 UTC

Anton Ertl wrote:
> Michael S <already5chosen@yahoo.com> writes:
>> I have very little to add to very good response by Anton.
>> That little addition is: the most if not all Rowhammer POC examples rely
>> on CLFLUSH. That's what the manual says about it:
>> "Executions of the CLFLUSH instruction are ordered with respect to each
>> other and with respect to writes, locked read-modify-write
>> instructions, fence instructions, and executions of CLFLUSHOPT to the
>> same cache line.1 They are not ordered with respect to executions of
>> CLFLUSHOPT to different cache lines."
>>
>> By now, it seems obvious that making CLFLUSH instruction non-privilaged
>> and pretty much non-restricted by memory range/page attributes was a
>> mistake, but that mistake can't be fixed without breaking things.
>> Considering that CLFLUSH exists since very early 2000s, it is
>> understandable.
>> IIRC, ARMv8 did the same mistake a decade later. It is less
>> understandable.
>
> Ideally caches are fully transparent microarchitecture, then you don't
> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
> getting DRAM up-to-date for DMA from I/O devices. An alternative
> would be to let the memory controller remember which lines are
> modified, and if the I/O device asks for that line, get the up-to-date
> data from the cache line using the cache-consistency protocol. This
> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
> concerned, the ordering constraints may still be relevant), so there
> is a way to fix this mistake (if it is one).

The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.

Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.

Re: Rowhammer and CLFLUSH

<PMNuN.258835$7sbb.122888@fx16.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37178&group=comp.arch#37178

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at> <20240131224915.000063d9@yahoo.com>
In-Reply-To: <20240131224915.000063d9@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 47
Message-ID: <PMNuN.258835$7sbb.122888@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 01 Feb 2024 14:20:31 UTC
Date: Thu, 01 Feb 2024 09:20:24 -0500
X-Received-Bytes: 3125
 by: EricP - Thu, 1 Feb 2024 14:20 UTC

Michael S wrote:
> On Wed, 31 Jan 2024 17:17:21 GMT
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>> I have very little to add to very good response by Anton.
>>> That little addition is: the most if not all Rowhammer POC examples
>>> rely on CLFLUSH. That's what the manual says about it:
>>> "Executions of the CLFLUSH instruction are ordered with respect to
>>> each other and with respect to writes, locked read-modify-write
>>> instructions, fence instructions, and executions of CLFLUSHOPT to the
>>> same cache line.1 They are not ordered with respect to executions of
>>> CLFLUSHOPT to different cache lines."
>>>
>>> By now, it seems obvious that making CLFLUSH instruction
>>> non-privilaged and pretty much non-restricted by memory range/page
>>> attributes was a mistake, but that mistake can't be fixed without
>>> breaking things. Considering that CLFLUSH exists since very early
>>> 2000s, it is understandable.
>>> IIRC, ARMv8 did the same mistake a decade later. It is less
>>> understandable.
>> Ideally caches are fully transparent microarchitecture, then you don't
>> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
>> getting DRAM up-to-date for DMA from I/O devices. An alternative
>> would be to let the memory controller remember which lines are
>> modified, and if the I/O device asks for that line, get the up-to-date
>> data from the cache line using the cache-consistency protocol.
>
> Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
> and that at that time all Intel's PCI/AGP root hubs were already fully
> I/O-coherent for several years, I find your theory unlikely.
>
> Myself, I don't know the original reason, but I do know a use case
> where CLFLUSH, while not strictly necessary, simplifies things greatly
> - entering deep sleep state in which CPU caches are powered down and
> DRAM put in self-refresh mode.

CLFLUSH wouldn't be useful for that as it flushes for a virtual address.
It also allows all sorts reorderings that you don't want to think about
during a (possibly emergency) cache sync.

The privileged WBINVD and WBNOINVD instructions are intended for that.
It sounds like they basically halt the core for the duration of the
write back of all modified lines.

Re: Rowhammer and CLFLUSH

<20240201163027.000003f4@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37179&group=comp.arch#37179

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Thu, 1 Feb 2024 16:30:27 +0200
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <20240201163027.000003f4@yahoo.com>
References: <upb8i3$12emv$1@dont-email.me>
<upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<2024Jan31.181721@mips.complang.tuwien.ac.at>
<VyNuN.188050$yEgf.160511@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="3c6a247926d5d4ebdb68b70146492706";
logging-data="2204384"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qzUPLBfMkBrep3iE750d6Jeyj9Ef72sQ="
Cancel-Lock: sha1:Bp2jnCxBIYojHXbmb+ayHK62TCM=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Thu, 1 Feb 2024 14:30 UTC

On Thu, 01 Feb 2024 09:05:19 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:

> Anton Ertl wrote:
> > Michael S <already5chosen@yahoo.com> writes:
> >> I have very little to add to very good response by Anton.
> >> That little addition is: the most if not all Rowhammer POC
> >> examples rely on CLFLUSH. That's what the manual says about it:
> >> "Executions of the CLFLUSH instruction are ordered with respect to
> >> each other and with respect to writes, locked read-modify-write
> >> instructions, fence instructions, and executions of CLFLUSHOPT to
> >> the same cache line.1 They are not ordered with respect to
> >> executions of CLFLUSHOPT to different cache lines."
> >>
> >> By now, it seems obvious that making CLFLUSH instruction
> >> non-privilaged and pretty much non-restricted by memory range/page
> >> attributes was a mistake, but that mistake can't be fixed without
> >> breaking things. Considering that CLFLUSH exists since very early
> >> 2000s, it is understandable.
> >> IIRC, ARMv8 did the same mistake a decade later. It is less
> >> understandable.
> >
> > Ideally caches are fully transparent microarchitecture, then you
> > don't need stuff like CLFLUSH. My guess is that CLFLUSH is there
> > for getting DRAM up-to-date for DMA from I/O devices. An
> > alternative would be to let the memory controller remember which
> > lines are modified, and if the I/O device asks for that line, get
> > the up-to-date data from the cache line using the cache-consistency
> > protocol. This would turn CLFLUSH into a noop (at least as far as
> > writing to DRAM is concerned, the ordering constraints may still be
> > relevant), so there is a way to fix this mistake (if it is one).
>
> The text in Intel Vol 1 Architecture manual indicates they viewed all
> these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
> as part of SSE for use by graphics applications that want to take
> manual control of their caching and minimize cache pollution.
>
> Note that the non-temporal move instructions MOVNTxx were also part of
> that SSE bunch and could also be used to force a write to DRAM.
>

According to Wikipedia, CLFLUSH was not introduced with SSE.
It was introduced together with SSE2, but formally is not part of it.
CLFLUSHOPT came much, much, much later and was likely related to Optane
DIMMs aspirations of late 2010s.

Re: Rowhammer and CLFLUSH

<uph03r$27kba$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37180&group=comp.arch#37180

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Thu, 1 Feb 2024 12:48:59 -0800
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <uph03r$27kba$2@dont-email.me>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<2024Jan31.181721@mips.complang.tuwien.ac.at>
<VyNuN.188050$yEgf.160511@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 20:48:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e5db24bf58349e0f24b4ca76196085dd";
logging-data="2347370"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183j4vWeHEyhNGo1gmzjtM3kpDEd+qjUXI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:9oZzN2vL4Rq/zJjRndXKF8D6uQY=
Content-Language: en-US
In-Reply-To: <VyNuN.188050$yEgf.160511@fx09.iad>
 by: Chris M. Thomasson - Thu, 1 Feb 2024 20:48 UTC

On 2/1/2024 6:05 AM, EricP wrote:
> Anton Ertl wrote:
>> Michael S <already5chosen@yahoo.com> writes:
>>> I have very little to add to very good response by Anton.
>>> That little addition is: the most if not all Rowhammer POC examples rely
>>> on CLFLUSH. That's what the manual says about it:
>>> "Executions of the CLFLUSH instruction are ordered with respect to each
>>> other and with respect to writes, locked read-modify-write
>>> instructions, fence instructions, and executions of CLFLUSHOPT to the
>>> same cache line.1 They are not ordered with respect to executions of
>>> CLFLUSHOPT to different cache lines."
>>>
>>> By now, it seems obvious that making CLFLUSH instruction non-privilaged
>>> and pretty much non-restricted by memory range/page attributes was a
>>> mistake, but that mistake can't be fixed without breaking things.
>>> Considering that CLFLUSH exists since very early 2000s, it is
>>> understandable.
>>> IIRC, ARMv8 did the same mistake a decade later. It is less
>>> understandable.
>>
>> Ideally caches are fully transparent microarchitecture, then you don't
>> need stuff like CLFLUSH.  My guess is that CLFLUSH is there for
>> getting DRAM up-to-date for DMA from I/O devices.  An alternative
>> would be to let the memory controller remember which lines are
>> modified, and if the I/O device asks for that line, get the up-to-date
>> data from the cache line using the cache-consistency protocol.  This
>> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
>> concerned, the ordering constraints may still be relevant), so there
>> is a way to fix this mistake (if it is one).
>
> The text in Intel Vol 1 Architecture manual indicates they viewed all
> these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
> as part of SSE for use by graphics applications that want to take
> manual control of their caching and minimize cache pollution.
>
> Note that the non-temporal move instructions MOVNTxx were also part of
> that SSE bunch and could also be used to force a write to DRAM.
>
>
>

Then there are the LFENCE, SFENCE and MFENCE for write back memory.
Non-temporal stores, iirc.

Re: Rowhammer and CLFLUSH

<uph05a$27kba$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37181&group=comp.arch#37181

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Thu, 1 Feb 2024 12:49:46 -0800
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <uph05a$27kba$3@dont-email.me>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<2024Jan31.181721@mips.complang.tuwien.ac.at>
<VyNuN.188050$yEgf.160511@fx09.iad> <uph03r$27kba$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Feb 2024 20:49:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e5db24bf58349e0f24b4ca76196085dd";
logging-data="2347370"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/r2ajgx6LU/WRAJP7fzfsDh1MseUXone0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NeS7yzBaCrbxmuexdrqORW4YuKI=
Content-Language: en-US
In-Reply-To: <uph03r$27kba$2@dont-email.me>
 by: Chris M. Thomasson - Thu, 1 Feb 2024 20:49 UTC

On 2/1/2024 12:48 PM, Chris M. Thomasson wrote:
> On 2/1/2024 6:05 AM, EricP wrote:
>> Anton Ertl wrote:
>>> Michael S <already5chosen@yahoo.com> writes:
>>>> I have very little to add to very good response by Anton.
>>>> That little addition is: the most if not all Rowhammer POC examples
>>>> rely
>>>> on CLFLUSH. That's what the manual says about it:
>>>> "Executions of the CLFLUSH instruction are ordered with respect to each
>>>> other and with respect to writes, locked read-modify-write
>>>> instructions, fence instructions, and executions of CLFLUSHOPT to the
>>>> same cache line.1 They are not ordered with respect to executions of
>>>> CLFLUSHOPT to different cache lines."
>>>>
>>>> By now, it seems obvious that making CLFLUSH instruction non-privilaged
>>>> and pretty much non-restricted by memory range/page attributes was a
>>>> mistake, but that mistake can't be fixed without breaking things.
>>>> Considering that CLFLUSH exists since very early 2000s, it is
>>>> understandable.
>>>> IIRC, ARMv8 did the same mistake a decade later. It is less
>>>> understandable.
>>>
>>> Ideally caches are fully transparent microarchitecture, then you don't
>>> need stuff like CLFLUSH.  My guess is that CLFLUSH is there for
>>> getting DRAM up-to-date for DMA from I/O devices.  An alternative
>>> would be to let the memory controller remember which lines are
>>> modified, and if the I/O device asks for that line, get the up-to-date
>>> data from the cache line using the cache-consistency protocol.  This
>>> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
>>> concerned, the ordering constraints may still be relevant), so there
>>> is a way to fix this mistake (if it is one).
>>
>> The text in Intel Vol 1 Architecture manual indicates they viewed all
>> these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
>> as part of SSE for use by graphics applications that want to take
>> manual control of their caching and minimize cache pollution.
>>
>> Note that the non-temporal move instructions MOVNTxx were also part of
>> that SSE bunch and could also be used to force a write to DRAM.
>>
>>
>>
>
> Then there are the LFENCE, SFENCE and MFENCE for write back memory.
> Non-temporal stores, iirc.

Oops, non-write back memory! IIRC. Sorry.

Re: AMD Cache speed funny

<Pb6cnSOq7p98XCH4nZ2dnZfqnPqdnZ2d@supernews.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37183&group=comp.arch#37183

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 02 Feb 2024 10:20:17 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: AMD Cache speed funny
Newsgroups: comp.arch
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <P-qdnZrfj5Fc-yb4nZ2dnZfqn_SdnZ2d@supernews.com> <20240201153646.00006e78@yahoo.com>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <Pb6cnSOq7p98XCH4nZ2dnZfqnPqdnZ2d@supernews.com>
Date: Fri, 02 Feb 2024 10:20:17 +0000
Lines: 43
X-Trace: sv3-d5ChpwkZj6odFj/LHuDy0aDJ0ISY6pZzefGKjdBI5zzo8nPH13XAaxvGPs9Tlvxhzc8bDt/eQtLvMt9!47CMXBn4G2LrYhVI+j+YdNSHn0b2iPsyT+e4QGiV/OnN0J02EpfTLEsNAwkBmPTH6u8heM770S8E!ml+OExHLjfE=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Received-Bytes: 3187
 by: aph@littlepinkcloud.invalid - Fri, 2 Feb 2024 10:20 UTC

Michael S <already5chosen@yahoo.com> wrote:
> On Thu, 01 Feb 2024 09:39:13 +0000
> aph@littlepinkcloud.invalid wrote:
>
>> Michael S <already5chosen@yahoo.com> wrote:
>> >
>> > By now, it seems obvious that making CLFLUSH instruction
>> > non-privilaged and pretty much non-restricted by memory range/page
>> > attributes was a mistake, but that mistake can't be fixed without
>> > breaking things. Considering that CLFLUSH exists since very early
>> > 2000s, it is understandable.
>> > IIRC, ARMv8 did the same mistake a decade later. It is less
>> > understandable.
>>
>> For Arm, with its non-coherent data and instruction caches, you need
>> some way to flush dcache to the point of unification in order to make
>> instruction changes visible. Also, regardless of icache coherence,
>> when using non-volatile memory you need an efficient way to flush
>> dcache to the point of peristence. You need that in order to make
>> sure that a transaction has been written to a log.
>>
>> With the latter, you could restrict dcache flushes to pages with a
>> particular non-volatile attribute. I don't think there's anything you
>> can do about the former, short of simply making i- and d-cache
>> coherent.
>
> For the later, privileged flush instruction sounds sufficient.

Does it? You're trying for hight throughput, and a full system call
wouldn't help with that. And besides, if userspace can ask kernel to
do something on its behalf, you haven't added any security by making
it privileged.

> For the former, ARMv8 appears to have a special instruction (or you can
> call it a special variant of DC instruction) - Clean by virtual address
> to point of unification (DC CVAU). This instruction alone would not
> make RH attack much easier. The problem is that privilagability of this
> instruction controlled by the same bit as privilagability of two much
> more dangerous variations of DC (DC CVAC and DC CIVAC).

Ah, thanks.

Andrew.

Re: Rowhammer and CLFLUSH

<5g9vN.55388$6ePe.50431@fx42.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37184&group=comp.arch#37184

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx42.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at> <6e4daf9e72082682c2fe92fa3054ccad@www.novabbs.com>
In-Reply-To: <6e4daf9e72082682c2fe92fa3054ccad@www.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 25
Message-ID: <5g9vN.55388$6ePe.50431@fx42.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 02 Feb 2024 17:04:01 UTC
Date: Fri, 02 Feb 2024 12:03:41 -0500
X-Received-Bytes: 1813
 by: EricP - Fri, 2 Feb 2024 17:03 UTC

MitchAlsup wrote:
> Anton Ertl wrote:
>
>
> Rowhammer happens when you beat on the same cache line multiple times
> {causing a charge sharing problem on the word lines. Every time you cause
> the DRAM to precharge (deActivate) you lose the count on how many times
> you have to bang on the same word line to disrupt the stored cells.
>
> So, the trick is to detect the RowHammering and insert refresh commands.

It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.

Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.

So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.

Re: Rowhammer and CLFLUSH

<br9vN.95150$STLe.2753@fx34.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37185&group=comp.arch#37185

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at> <20240131224915.000063d9@yahoo.com> <aba96427866214c1b2981a75ecc45db7@www.novabbs.com>
In-Reply-To: <aba96427866214c1b2981a75ecc45db7@www.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 32
Message-ID: <br9vN.95150$STLe.2753@fx34.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 02 Feb 2024 17:15:51 UTC
Date: Fri, 02 Feb 2024 12:15:21 -0500
X-Received-Bytes: 2104
 by: EricP - Fri, 2 Feb 2024 17:15 UTC

MitchAlsup wrote:
> Michael S wrote:
>
>> Original RH required very high hammering rate that certainly can't be
>> achieved by playing with associativity of L3 cache.
>
>> Newer multiside hammering probably can do it in theory, but it would be
>> very difficult in practice.
>
> The problem here is the fact that DRAMs do not use linear decoders, so
> address X and address X+1 do not necessarily shared paired word lines.
> The word lines could be as far as ½ the block away from each other.
>
> The DRAM decoders are faster and smaller when there is a grey-like-code
> imposed on the logical-address to physical-word-line. This also happens
> in SRAM decoders. Going back and looking at the most used logical to
> physical mapping shows that while X and X+1 can (occasionally) be side
> by side, X, X+1 and X+2 should never be 3 words lines in a row.

A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.

I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.

Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.

Re: Rowhammer and CLFLUSH

<upjb4a$3gug8$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37186&group=comp.arch#37186

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-3e2c-0-3198-7a07-b82b-ca41.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Fri, 2 Feb 2024 18:09:14 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <upjb4a$3gug8$1@newsreader4.netcologne.de>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me>
<20240131131353.0000688c@yahoo.com>
<2024Jan31.181721@mips.complang.tuwien.ac.at>
<20240131224915.000063d9@yahoo.com>
<aba96427866214c1b2981a75ecc45db7@www.novabbs.com>
<br9vN.95150$STLe.2753@fx34.iad>
Injection-Date: Fri, 2 Feb 2024 18:09:14 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-3e2c-0-3198-7a07-b82b-ca41.ipv6dyn.netcologne.de:2a0a:a540:3e2c:0:3198:7a07:b82b:ca41";
logging-data="3701256"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 2 Feb 2024 18:09 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

> Then a giant 2 million input OR gate would tell us if any row's
> canary had flipped.

That would look... interesting.

How are large OR gates actually constructed? I would assume that an
eight-input OR gate could look something like

nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

which would reduce the number of inputs by a factor of 2^3, so
seven layers of these OR gates would be needed.

Wiring would be interesting as well...

Re: Rowhammer and CLFLUSH

<6832320db42374bad77a88e774aaec5e@www.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=37187&group=comp.arch#37187

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Rowhammer and CLFLUSH
Date: Fri, 2 Feb 2024 19:18:12 +0000
Organization: novaBBS
Message-ID: <6832320db42374bad77a88e774aaec5e@www.novabbs.com>
References: <upb8i3$12emv$1@dont-email.me> <upcr4t$1drbk$1@dont-email.me> <20240131131353.0000688c@yahoo.com> <2024Jan31.181721@mips.complang.tuwien.ac.at> <20240131224915.000063d9@yahoo.com> <aba96427866214c1b2981a75ecc45db7@www.novabbs.com> <br9vN.95150$STLe.2753@fx34.iad> <upjb4a$3gug8$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1459701"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ebd9cd10a9ebda631fbccab5347a0f771d5a2118
X-Rslight-Site: $2y$10$bMMfH176hZr2HOEWKsENhO1/ZvHUsJHdJ87HnazdB2iSQlYP67Jrq
X-Spam-Checker-Version: SpamAssassin 4.0.0
 by: MitchAlsup - Fri, 2 Feb 2024 19:18 UTC

Thomas Koenig wrote:

> EricP <ThatWouldBeTelling@thevillage.com> schrieb:

>> Then a giant 2 million input OR gate would tell us if any row's
>> canary had flipped.

> That would look... interesting.

> How are large OR gates actually constructed? I would assume that an
> eight-input OR gate could look something like

> nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

Close, but NANDs come with 4-inputs and NORs come with 3*, so you get
a 3×4 = 12:1 reduction per pair of stages.

2985984->248832->20736->1728->144->12->1

> which would reduce the number of inputs by a factor of 2^3, so
> seven layers of these OR gates would be needed.

6 not 7

> Wiring would be interesting as well...

That is why we have 10 layers of metal--oh wait DRAMs don't have that
much metal.....

(*) NANDs having 4 inputs while NORs only have 3 is a consequence of
P-channel transistors having lower transconductance and higher body
effects, and there are differences between planar transistors and
finFETs here, too.

Pages:123
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor