Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

I have a very small mind and must live with it. -- E. Dijkstra


devel / comp.arch / Memory dependency microbenchmark

SubjectAuthor
* Memory dependency microbenchmarkAnton Ertl
+* Re: Memory dependency microbenchmarkEricP
|`* Re: Memory dependency microbenchmarkAnton Ertl
| `* Re: Memory dependency microbenchmarkEricP
|  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|   `* Re: Memory dependency microbenchmarkEricP
|    +* Re: Memory dependency microbenchmarkMitchAlsup
|    |`* Re: Memory dependency microbenchmarkEricP
|    | `- Re: Memory dependency microbenchmarkMitchAlsup
|    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|     `* Re: Memory dependency microbenchmarkMitchAlsup
|      `* Re: Memory dependency microbenchmarkChris M. Thomasson
|       `* Re: Memory dependency microbenchmarkMitchAlsup
|        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|         `* Re: Memory dependency microbenchmarkKent Dickey
|          +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          || `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  +* Re: Memory dependency microbenchmarkaph
|          ||  |+- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  | `* Re: Memory dependency microbenchmarkaph
|          ||  |  +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |  `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |     `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkaph
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkaph
|          ||  |   |  `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   `* Re: Memory dependency microbenchmarkStefan Monnier
|          ||  |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkaph
|          ||  |     |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |   `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |     `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |      `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |       `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |         `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     `- Re: Memory dependency microbenchmarkStefan Monnier
|          ||  `* Re: Memory dependency microbenchmarkEricP
|          ||   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   | `* Re: Memory dependency microbenchmarkBranimir Maksimovic
|          ||   |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||    |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||    ||`* Re: Memory dependency microbenchmarkEricP
|          ||    || `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||    |`* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    | `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||    `* Re: Memory dependency microbenchmarkEricP
|          ||     +* Re: Memory dependency microbenchmarkaph
|          ||     |`* Re: Memory dependency microbenchmarkEricP
|          ||     | +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | `* Re: Memory dependency microbenchmarkaph
|          ||     |  +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |+- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |`* Re: Memory dependency microbenchmarkEricP
|          ||     |  | +- Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  | +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  `* Re: Memory dependency microbenchmarkEricP
|          ||     |   `* Re: Memory dependency microbenchmarkaph
|          ||     |    +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |`* Re: Memory dependency microbenchmarkaph
|          ||     |    | +* Re: Memory dependency microbenchmarkTerje Mathisen
|          ||     |    | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    `- Re: Memory dependency microbenchmarkEricP
|          ||     `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||      `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          |`* weak consistency and the supercomputer attitude (was: Memory dependency microbenAnton Ertl
|          | +- Re: weak consistency and the supercomputer attitudeStefan Monnier
|          | +- Re: weak consistency and the supercomputer attitudeMitchAlsup
|          | `* Re: weak consistency and the supercomputer attitudePaul A. Clayton
|          `* Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkChris M. Thomasson
+- Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkAnton Ertl
`* Alder Lake results for the memory dependency microbenchmarkAnton Ertl

Pages:12345678
Memory dependency microbenchmark

<2023Nov3.101558@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34839&group=comp.arch#34839

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Memory dependency microbenchmark
Date: Fri, 03 Nov 2023 09:15:58 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 210
Message-ID: <2023Nov3.101558@mips.complang.tuwien.ac.at>
Injection-Info: dont-email.me; posting-host="d857eaa64a41149a2eec0b4e651a7c98";
logging-data="2876184"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tQZY4WpHC3pu9Wfioc63z"
Cancel-Lock: sha1:jsXkAdB9TxrdijN5Aqv/Oxwb/2s=
X-newsreader: xrn 10.11
 by: Anton Ertl - Fri, 3 Nov 2023 09:15 UTC

I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.

You can find the text portion of this page below; there are a few
links on the page that are not produced below.

If you have information on any of the things I don't know (or don't
remember), please let me know about it.

- anton

Memory Dependency Microbenchmark

This microbenchmark tests some aspects of how hardware deals with
dependences through memory. It performs the following loop:

for (i=0; i<250000000; i++) {
*b = *a+1;
*d = *c+1;
*f = *e+1;
*h = *g+1;
}

The eight parameters of the binary allow you to determine to which
memory location the pointers a b c d e f g h refer, in particular, if
some of them refer to the same memory location or not. Each parameter
corresponds to one pointer, and two pointers refer to the same memory
location iff the corresponding parameters are the same.

This microbenchmark uses pointers that have been in their registers
for a long time, so it does not test speculative alias detection by
the hardware.

I have measured using the following parameter combinations (in the
order a b c d e f g h):

* A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
are completely independent. Ideally they can be performed as fast
as the resources allow.
* B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
the next iteration does not depend on the results of the previous
one, so again, the work can be performed as fast as the resources
allow. However, in order to achieve this performance, the loads
of the next iteration have to be started while several
architecturally earlier stores still have to be executed, and,
comparing the results of B to those of A, all measured cores have
more difficulty with that, resulting in slower performance.
* C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
computation in the next iteration depends on a computation in the
current iteration), four of those. So at most 4 of these
computations (plus the loop overhead) can be performed in
parallel.
* D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
computations in an iteration, and the first of those in the
current iteration depends on the second one in the previous
iteration), 2 of those. So at most two of these computations
(plus the loop overhead) can be performed in parallel.
* E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
are arranged differently: Here we first have two independent
computations, and then two computations that depend on the
earlier computations.
* F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
recurrence. In the best case you see the latency of three
computations per iteration.
* G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
are arranged differently.
* H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
computations can only be performed sequentialy, only the loop
overhead can be performed in parallel to them.

The results for different CPU cores are shown in the following. The
numbers are cycles per computation (statement in the loop body). To
get the cycles per iteration, multiply by 4.

A B C D E F G H microarchitecture CPU
1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)

There can be different sophistication levels of CPU cores, both in
terms of dealing with aliases, and in terms of forwarding from stores
to loads. And of course there is the difference between in-order
execution and out-of-order execution.

Aliases

The simplest approach would be to assume that all memory accesses may
refer to the same memory location. In that case we would expect that
the core performs all parameter combinations as badly as H. None of
the measured cores exhibits this behaviour, so obviously they are
more sophisticated than that.

The next approach is to allow to perform architecturally later loads
at the same time, or, with OoO, earlier than stores to a different
address. All tested cores seem to do this.

Finally, there is the issue of what to do when the load is to the
same address as an architecturally preceding store. This brings us to
the next section:

Store to load forwarding

The simplest approach is to wait until the data arrives in the cache
and then load it from there. I dimly rememeber seeing 15 cycles per
iteration for a loop that incremented a memory location in every
iteration, but none of the cores measured above take that long
(although, for Zen and Zen2 one might wonder).

The next approach is to let the load read from the store buffer if
the data is there (and first wait until it is there). In this case
the whole sequence of store-load has a total latency that is not much
higher than the load latency. It seems that most cores measured here
do that. We see this best in the H results; e.g., a Skylake has 4
cycles of load latency and 1 cycle of add latency, and we see 5.53
cycles of latency for store-load-add, meaning that the store
contributes an average latency of 0.53 cycles. There are some
complications in case the load only partially overlaps the store (I
should put an appropriate chipsncheese link here).

Finally, the core might detect that the data that is loaded is coming
from a store that has not been retired yet, so the physical register
used by the store can be directly renamed into the register of the
load result, and the load does not need to access the store buffer or
cache at all (what is the name of this feature?). As a result, in the
ideal case we see only the 1-cycle latency of the add in case H. In
the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
Rocket Lake probably also does that, but either does not succeed in
all cases (the different results between D and E speak for this
theory), or there is an additional source of latency.

I expect that Firestorm (the performance core of the Apple M1) also
has this feature, but unfortunately the cycles performance counter
does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50

In-order vs. out-of-order execution

With in-order execution (on Cortex A53 and A55, and on the U74), the
loads cannot be executed before architecturally earlier stores, even
if both refer to different memory locations. So even A is relatively
slow on in-order cores. In-order execution also means that B is
almost as slow as H, while with out-of-order execution it can
theoretically be executed as fast as A (but in practice they exhibit
slightly slower performance, but certainly much faster than H).

With OoO, we see much better performance in cases where there are
independent computation chains. Given the size of the buffers in the
various OoO microarchitectures (hundreds of instructions in the
reorder buffer, dozens in schedulers), it is surprising that B is
slower than A given the small size of each iteration (~15
instructions); and even D is slower than E on some
microarchitectures, most notably Rocket Lake.

Measuring your own hardware

You can download the contents of this directory and run the benchmark
with

wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
unzip memdep.zip
cd memdep
make

If you want to do your own parameter combinations, you can run the
binary with

./memdep-`uname -m` a b c d e f g h

where a b c d e f g h are integers and correspond to the pointers in
the loop. If you want to get results like in the table above, run it
like this:

perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

Future work

Make a good microbenchmark that produces the addresses so late that
either the core has to wait for the addresses or has to speculate
whether the load accesses the same address as the store or not. Do it
for predictable aliasing and for unpredictable aliasing. I remember a
good article about this predictor (that actually looked at Intel's
patent), but don't remember the URL; Intel calls this technique as
implemented in Ice Lake and later P-cores Fast Store Forwarding
Predictor (FSFP) (but my memory says that the article I read about
looked at older microarchitectures that have a similar feature). AMD
describes such a hardware feature under the name predictive store
forwarding (PSF), which they added in Zen3.


Click here to read the complete article
Re: Memory dependency microbenchmark

<yT91N.150749$HwD9.28213@fx11.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34841&group=comp.arch#34841

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
In-Reply-To: <2023Nov3.101558@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 222
Message-ID: <yT91N.150749$HwD9.28213@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 03 Nov 2023 17:14:06 UTC
Date: Fri, 03 Nov 2023 13:13:41 -0400
X-Received-Bytes: 11602
 by: EricP - Fri, 3 Nov 2023 17:13 UTC

Anton Ertl wrote:
> I have written a microbenchmark for measuring how memory dependencies
> affect the performance of various microarchitectures. You can find it
> along with a description and results on
> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>
> You can find the text portion of this page below; there are a few
> links on the page that are not produced below.
>
> If you have information on any of the things I don't know (or don't
> remember), please let me know about it.
>
> - anton
>
> Memory Dependency Microbenchmark
>
> This microbenchmark tests some aspects of how hardware deals with
> dependences through memory. It performs the following loop:
>
> for (i=0; i<250000000; i++) {
> *b = *a+1;
> *d = *c+1;
> *f = *e+1;
> *h = *g+1;
> }
>
> The eight parameters of the binary allow you to determine to which
> memory location the pointers a b c d e f g h refer, in particular, if
> some of them refer to the same memory location or not. Each parameter
> corresponds to one pointer, and two pointers refer to the same memory
> location iff the corresponding parameters are the same.
>
> This microbenchmark uses pointers that have been in their registers
> for a long time, so it does not test speculative alias detection by
> the hardware.
>
> I have measured using the following parameter combinations (in the
> order a b c d e f g h):
>
> * A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
> are completely independent. Ideally they can be performed as fast
> as the resources allow.
> * B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
> the next iteration does not depend on the results of the previous
> one, so again, the work can be performed as fast as the resources
> allow. However, in order to achieve this performance, the loads
> of the next iteration have to be started while several
> architecturally earlier stores still have to be executed, and,
> comparing the results of B to those of A, all measured cores have
> more difficulty with that, resulting in slower performance.
> * C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
> computation in the next iteration depends on a computation in the
> current iteration), four of those. So at most 4 of these
> computations (plus the loop overhead) can be performed in
> parallel.
> * D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
> computations in an iteration, and the first of those in the
> current iteration depends on the second one in the previous
> iteration), 2 of those. So at most two of these computations
> (plus the loop overhead) can be performed in parallel.
> * E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
> are arranged differently: Here we first have two independent
> computations, and then two computations that depend on the
> earlier computations.
> * F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
> recurrence. In the best case you see the latency of three
> computations per iteration.
> * G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
> are arranged differently.
> * H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
> computations can only be performed sequentialy, only the loop
> overhead can be performed in parallel to them.
>
> The results for different CPU cores are shown in the following. The
> numbers are cycles per computation (statement in the loop body). To
> get the cycles per iteration, multiply by 4.
>
> A B C D E F G H microarchitecture CPU
> 1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
> 1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
> 1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
> 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
> 1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
> 1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
> 1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
> 0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
> 0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
> 3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
> 3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
> 1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
> 1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
> 1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
> 3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)
>
> There can be different sophistication levels of CPU cores, both in
> terms of dealing with aliases, and in terms of forwarding from stores
> to loads. And of course there is the difference between in-order
> execution and out-of-order execution.
>
> Aliases
>
> The simplest approach would be to assume that all memory accesses may
> refer to the same memory location. In that case we would expect that
> the core performs all parameter combinations as badly as H. None of
> the measured cores exhibits this behaviour, so obviously they are
> more sophisticated than that.
>
> The next approach is to allow to perform architecturally later loads
> at the same time, or, with OoO, earlier than stores to a different
> address. All tested cores seem to do this.
>
> Finally, there is the issue of what to do when the load is to the
> same address as an architecturally preceding store. This brings us to
> the next section:
>
> Store to load forwarding
>
> The simplest approach is to wait until the data arrives in the cache
> and then load it from there. I dimly rememeber seeing 15 cycles per
> iteration for a loop that incremented a memory location in every
> iteration, but none of the cores measured above take that long
> (although, for Zen and Zen2 one might wonder).
>
> The next approach is to let the load read from the store buffer if
> the data is there (and first wait until it is there). In this case
> the whole sequence of store-load has a total latency that is not much
> higher than the load latency. It seems that most cores measured here
> do that. We see this best in the H results; e.g., a Skylake has 4
> cycles of load latency and 1 cycle of add latency, and we see 5.53
> cycles of latency for store-load-add, meaning that the store
> contributes an average latency of 0.53 cycles. There are some
> complications in case the load only partially overlaps the store (I
> should put an appropriate chipsncheese link here).
>
> Finally, the core might detect that the data that is loaded is coming
> from a store that has not been retired yet, so the physical register
> used by the store can be directly renamed into the register of the
> load result, and the load does not need to access the store buffer or
> cache at all (what is the name of this feature?). As a result, in the
> ideal case we see only the 1-cycle latency of the add in case H. In
> the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
> Rocket Lake probably also does that, but either does not succeed in
> all cases (the different results between D and E speak for this
> theory), or there is an additional source of latency.
>
> I expect that Firestorm (the performance core of the Apple M1) also
> has this feature, but unfortunately the cycles performance counter
> does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50
>
> In-order vs. out-of-order execution
>
> With in-order execution (on Cortex A53 and A55, and on the U74), the
> loads cannot be executed before architecturally earlier stores, even
> if both refer to different memory locations. So even A is relatively
> slow on in-order cores. In-order execution also means that B is
> almost as slow as H, while with out-of-order execution it can
> theoretically be executed as fast as A (but in practice they exhibit
> slightly slower performance, but certainly much faster than H).
>
> With OoO, we see much better performance in cases where there are
> independent computation chains. Given the size of the buffers in the
> various OoO microarchitectures (hundreds of instructions in the
> reorder buffer, dozens in schedulers), it is surprising that B is
> slower than A given the small size of each iteration (~15
> instructions); and even D is slower than E on some
> microarchitectures, most notably Rocket Lake.
>
> Measuring your own hardware
>
> You can download the contents of this directory and run the benchmark
> with
>
> wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
> unzip memdep.zip
> cd memdep
> make
>
> If you want to do your own parameter combinations, you can run the
> binary with
>
> ./memdep-`uname -m` a b c d e f g h
>
> where a b c d e f g h are integers and correspond to the pointers in
> the loop. If you want to get results like in the table above, run it
> like this:
>
> perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'
>
> Future work
>
> Make a good microbenchmark that produces the addresses so late that
> either the core has to wait for the addresses or has to speculate
> whether the load accesses the same address as the store or not. Do it
> for predictable aliasing and for unpredictable aliasing. I remember a
> good article about this predictor (that actually looked at Intel's
> patent), but don't remember the URL; Intel calls this technique as
> implemented in Ice Lake and later P-cores Fast Store Forwarding
> Predictor (FSFP) (but my memory says that the article I read about
> looked at older microarchitectures that have a similar feature). AMD
> describes such a hardware feature under the name predictive store
> forwarding (PSF), which they added in Zen3.
>
> Related
>
> One thing I remember is that I have done a microbenchmark that was
> intended to measure predictive store forwarding, but it (also?)
> measured the forward-at-register-level technique described above.
>


Click here to read the complete article
Re: Memory dependency microbenchmark

<ui3s8b$2vc30$12@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34842&group=comp.arch#34842

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 3 Nov 2023 15:29:32 -0700
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <ui3s8b$2vc30$12@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 3 Nov 2023 22:29:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="91d289698cb293a9fe1990fd164a4cc8";
logging-data="3125344"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5Zhx8QPHjDjd5FcCiytFWTMqNymYpQ0M="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:8jexAAWLth82F0CK03iOcXH52P8=
In-Reply-To: <2023Nov3.101558@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Chris M. Thomasson - Fri, 3 Nov 2023 22:29 UTC

On 11/3/2023 2:15 AM, Anton Ertl wrote:
> I have written a microbenchmark for measuring how memory dependencies
> affect the performance of various microarchitectures. You can find it
> along with a description and results on
> <http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]

Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.

Re: Memory dependency microbenchmark

<2023Nov4.180132@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34843&group=comp.arch#34843

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 04 Nov 2023 17:01:32 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 55
Message-ID: <2023Nov4.180132@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad>
Injection-Info: dont-email.me; posting-host="dc56501592f0e0e8845495ac2fd2cef2";
logging-data="3650582"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/t2LetiMDNhp+hEMuPlJgs"
Cancel-Lock: sha1:WXRTze9BtdwxWxh6LSk0saF8m+Y=
X-newsreader: xrn 10.11
 by: Anton Ertl - Sat, 4 Nov 2023 17:01 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>I notice that all your 8 references are within the same cache line.
>Store to load forwarding might behave differently across lines
>and 4kB page boundaries.
>Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

The numbers are in 8-byte granularity, so to get different cache lines
n*8 is good enough, and for different pagres n*512. However, the
array is only 8000 bytes long, so you can use only numbers in the
range 0...999. Anyway, I made another make target "ericp", so when
you "make ericp", you get the following parameter combinations:

X 0 8 16 24 32 40 48 56

Different cache lines, but always the same "bank" in a cache line;
this should hurt K8, maybe also others.

Y 0 9 18 27 36 45 54 63

Different cache lines, and different banks for different accesses.

Z 0 513 994 11 524 989 22 535

All different cache lines and different banks; the second access is on
a different page than the first, and the third is likely on a
different page. Then start with the first page again. These are all
independent accesses.

Results:

X Y Z A B C D E F G H
1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8

We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
the dependence-wise equivalent A. My gyess is that these CPUs can
only store to one cache line per cycle, and can optimize the stores in
some cases for A. If that is the case, that is a resource thing, not
a failure to recognize independence.

We see a slowdown for X on K8, which is somewhat expected; however,
thinking about t, I wonder: It seems as if the K8 can do only one load
or store per cycle, if they are to the same bank, but several to
different banks; it has been too longs since a read how it worked. Y
and Z are the same speed as A, showing that the distance between
addresses does not influence the no-alias detection (at least for
these distances).

Is this what you had in mind?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<2023Nov4.184057@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34844&group=comp.arch#34844

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 04 Nov 2023 17:40:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 51
Message-ID: <2023Nov4.184057@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <ui3s8b$2vc30$12@dont-email.me>
Injection-Info: dont-email.me; posting-host="dc56501592f0e0e8845495ac2fd2cef2";
logging-data="3655096"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/N8kvOWJm9BQf8Y6WSmfix"
Cancel-Lock: sha1:a86mPhbiA3GyPQQ2tX3QLK2tXWc=
X-newsreader: xrn 10.11
 by: Anton Ertl - Sat, 4 Nov 2023 17:40 UTC

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>On 11/3/2023 2:15 AM, Anton Ertl wrote:
>> I have written a microbenchmark for measuring how memory dependencies
>> affect the performance of various microarchitectures. You can find it
>> along with a description and results on
>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>[...]
>
>Is the only arch out there that does not require an explicit memory
>barrier for data-dependent loads a DEC Alpha? I think so.

I don't know any architecture that requires memory barriers for
single-threaded programs that access just memory, not even Alpha.

You may be thinking of the memory consistency model of Alpha, which is
even weaker than everything else I know of. This is not surprising,
given that a prominent advocacy paper for weak consistency
[adve&gharachorloo95] came out of DEC.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<ui679v$3hfer$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34845&group=comp.arch#34845

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 4 Nov 2023 14:48:04 -0500
Organization: A noiseless patient Spider
Lines: 152
Message-ID: <ui679v$3hfer$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<ui3s8b$2vc30$12@dont-email.me> <2023Nov4.184057@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 4 Nov 2023 19:50:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="518ae8cdc79d437ac6ed56e6fd0b170f";
logging-data="3718619"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UXvEEQzuZHTDISzrQcoAm"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:Mo0xHZaFHz0tU55jTe29zguV938=
Content-Language: en-US
In-Reply-To: <2023Nov4.184057@mips.complang.tuwien.ac.at>
 by: BGB - Sat, 4 Nov 2023 19:48 UTC

On 11/4/2023 12:40 PM, Anton Ertl wrote:
> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>> On 11/3/2023 2:15 AM, Anton Ertl wrote:
>>> I have written a microbenchmark for measuring how memory dependencies
>>> affect the performance of various microarchitectures. You can find it
>>> along with a description and results on
>>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>> [...]
>>
>> Is the only arch out there that does not require an explicit memory
>> barrier for data-dependent loads a DEC Alpha? I think so.
>
> I don't know any architecture that requires memory barriers for
> single-threaded programs that access just memory, not even Alpha.
>
> You may be thinking of the memory consistency model of Alpha, which is
> even weaker than everything else I know of. This is not surprising,
> given that a prominent advocacy paper for weak consistency
> [adve&gharachorloo95] came out of DEC.
>

Hmm, one could maybe save some logic and also make timing a little
easier if they disallowed RAW and WAW (in the same cache line), and made
them undefined behavior...

However, this would suck for software (and would effectively either
mandate strict stack alignment to keep prologs/epilogs working, or
require saving registers in a convoluted order). And, in my case, I can
tell that code steps on these scenarios very often (particularly in
prolog sequences).

This is why I bother with a "somewhat expensive" feature of forwarding
stored cache lines back to load in cases where one is accessing a "just
accessed" cache line.

Disabling this forwarding (and forcing a stall instead), tends to have a
more noticeable adverse effect on performance (but is often needed to
try to pass timing at higher clock speeds).

Granted, I guess it is possible I could fiddle with the compiler to try
to improve this situation, say:
Only use MOV.X with a 16B alignment;
Stagger even/odd stores in pairs when possible.

Say, as opposed to:
MOV.X R12, (SP, 160)
MOV.X R10, (SP, 144)
MOV.X R8, (SP, 128)
MOV.X R30, (SP, 112)
MOV.X R28, (SP, 96)
MOV.X R26, (SP, 80)
MOV.X R24, (SP, 64)
Say:
MOV.X R10, (SP, 144)
MOV.X R12, (SP, 160)
MOV.X R30, (SP, 112)
MOV.X R8, (SP, 128)
MOV.X R26, (SP, 80)
MOV.X R28, (SP, 96)
MOV.X R24, (SP, 64)

Say, to try to avoid two adjacent stores within the same 32-byte
paired-line (for 64-bit load, one would try to avoid two adjacent stores
within the same 32 bytes).

But, I sort of ended my 75MHz experiment for now, and fell back to
50MHz, where I can more easily afford to have this forwarding (in which
case the above becomes mostly N/A).

As for cache-coherence between cores, this basically still does not
exist yet.

It also seems like it would require a bit more ringbus traffic to pull
off, say:
If a core wants to write to a line, it needs to flag its intention to do so;
If a line was previously fetched for read, but is now being written, the
line would need to be flushed and re-fetched with the new intention;
If a line is being fetched for write, we somehow need to signal it to be
flushed from whatever L1 cache was last holding it;
....

But, the alternative (what I currently have), effectively disallows
traditional forms of (writable) memory sharing between threads (and
"volatile" memory access seems to pretty slow at present). So, for
multithreaded code, one would need to have the threads work mostly
independently (without any shared mutable structures), and then trigger
an L1 flush when it is done with its work (and the receiving thread also
needs to trigger an L1 flush).

Granted, similar ended up happening with the rasterizer module with
TKRA-GL, where cache-flushing is needed whenever handing off the
framebuffer between the main CPU and the rasterizer module (otherwise,
graphical garbage and incompletely drawn geometry may result); and one
also needs to flush the L1 cache and trigger the rasterizer module to
flush its texture cache whenever uploading a new texture.

And, as-is, texture uploading has ended up being horridly slow. So, even
if I can (sort of) afford to run GLQuake with lightmaps now, dynamic
lightmaps need to be disabled as it is too slow.

Though, an intermediate possibility could be to store lightmaps with
dynamic lights in (only) one of 2 states, with two different textures:
Dynamic light ON/Max, Dynamic light OFF/Min). However, this would
effectively disallow effects like "slow pulsate" (which could only
alternate between fully-on and fully-off); as well as updates from
dynamic light sources like rockets (where, as-is, firing a rocket with
dynamic lightmaps, eats the CPU something hard).

Not entirely sure how something like a Pentium-1 managed all this either.

Granted, faster in this case would still be to do everything with vertex
lighting (as is currently still the default in my GLQuake port), if
albeit (not as good), and a bit of a hack as Quake wasn't really
designed for vertex lighting.

> @TechReport{adve&gharachorloo95,
> author = {Sarita V. Adve and Kourosh Gharachorloo},
> title = {Shared Memory Consistency Models: A Tutorial},
> institution = {Digital Western Research Lab},
> year = {1995},
> type = {WRL Research Report},
> number = {95/7},
> annote = {Gives an overview of architectural features of
> shared-memory computers such as independent memory
> banks and per-CPU caches, and how they make the (for
> programmers) most natural consistency model hard to
> implement, giving examples of programs that can fail
> with weaker consistency models. It then discusses
> several categories of weaker consistency models and
> actual consistency models in these categories, and
> which ``safety net'' (e.g., memory barrier
> instructions) programmers need to use to work around
> the deficiencies of these models. While the authors
> recognize that programmers find it difficult to use
> these safety nets correctly and efficiently, it
> still advocates weaker consistency models, claiming
> that sequential consistency is too inefficient, by
> outlining an inefficient implementation (which is of
> course no proof that no efficient implementation
> exists). Still the paper is a good introduction to
> the issues involved.}
> }
>
> - anton

Re: Memory dependency microbenchmark

<RfO1N.368064$w4ec.211729@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34847&group=comp.arch#34847

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at>
In-Reply-To: <2023Nov4.180132@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 75
Message-ID: <RfO1N.368064$w4ec.211729@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 05 Nov 2023 15:10:41 UTC
Date: Sun, 05 Nov 2023 10:09:34 -0500
X-Received-Bytes: 3988
 by: EricP - Sun, 5 Nov 2023 15:09 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> I notice that all your 8 references are within the same cache line.
>> Store to load forwarding might behave differently across lines
>> and 4kB page boundaries.
>> Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096
>
> The numbers are in 8-byte granularity, so to get different cache lines
> n*8 is good enough, and for different pagres n*512. However, the
> array is only 8000 bytes long, so you can use only numbers in the
> range 0...999. Anyway, I made another make target "ericp", so when
> you "make ericp", you get the following parameter combinations:
>
> X 0 8 16 24 32 40 48 56
>
> Different cache lines, but always the same "bank" in a cache line;
> this should hurt K8, maybe also others.
>
> Y 0 9 18 27 36 45 54 63
>
> Different cache lines, and different banks for different accesses.
>
> Z 0 513 994 11 524 989 22 535
>
> All different cache lines and different banks; the second access is on
> a different page than the first, and the third is likely on a
> different page. Then start with the first page again. These are all
> independent accesses.
>
> Results:
>
> X Y Z A B C D E F G H
> 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
> 1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
> 2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8
>
> We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
> the dependence-wise equivalent A. My gyess is that these CPUs can
> only store to one cache line per cycle, and can optimize the stores in
> some cases for A. If that is the case, that is a resource thing, not
> a failure to recognize independence.

It might be that A is occasionally combining the multiple stores
to the same line in the store buffer whereas X Y Z do not.
So maybe A needs 25% less cache accesses.

> We see a slowdown for X on K8, which is somewhat expected; however,
> thinking about t, I wonder: It seems as if the K8 can do only one load
> or store per cycle, if they are to the same bank, but several to
> different banks; it has been too longs since a read how it worked. Y
> and Z are the same speed as A, showing that the distance between
> addresses does not influence the no-alias detection (at least for
> these distances).
>
> Is this what you had in mind?
>
> - anton

Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.

First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.

Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.

And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).

Re: Memory dependency microbenchmark

<ui8uvc$4c78$9@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34850&group=comp.arch#34850

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 5 Nov 2023 12:46:36 -0800
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <ui8uvc$4c78$9@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<yT91N.150749$HwD9.28213@fx11.iad>
<2023Nov4.180132@mips.complang.tuwien.ac.at>
<RfO1N.368064$w4ec.211729@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 5 Nov 2023 20:46:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c368fd4043ce1ecca96d5dea385e0752";
logging-data="143592"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19mn8fTy40tAGmXfHysdiS8pfB3tT4B4jI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Ma6FYPIeM46mOKLrl0Y0+ufbeKg=
Content-Language: en-US
In-Reply-To: <RfO1N.368064$w4ec.211729@fx14.iad>
 by: Chris M. Thomasson - Sun, 5 Nov 2023 20:46 UTC

On 11/5/2023 7:09 AM, EricP wrote:
> Anton Ertl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> I notice that all your 8 references are within the same cache line.
>>> Store to load forwarding might behave differently across lines
>>> and 4kB page boundaries.
>>> Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096
>>
>> The numbers are in 8-byte granularity, so to get different cache lines
>> n*8 is good enough, and for different pagres n*512.  However, the
>> array is only 8000 bytes long, so you can use only numbers in the
>> range 0...999.  Anyway, I made another make target "ericp", so when
>> you "make ericp", you get the following parameter combinations:
>>
>> X 0 8 16 24 32 40 48 56
>>
>> Different cache lines, but always the same "bank" in a cache line;
>> this should hurt K8, maybe also others.
>>
>> Y 0 9 18 27 36 45 54 63
>>
>> Different cache lines, and different banks for different accesses.
>>
>> Z 0 513 994 11 524 989 22 535
>>
>> All different cache lines and different banks; the second access is on
>> a different page than the first, and the third is likely on a
>> different page.  Then start with the first page again.  These are all
>> independent accesses.
>>
>> Results:
>>
>>   X     Y     Z     A     B     C     D     E     F     G     H
>>  1.00  1.00  1.00  0.75  1.00  1.00  1.00  1.00  1.00  1.00  1.00 Zen3
>>  1.00  1.00  1.00  0.78  0.91  1.02  1.81  1.13  2.00  2.00  2.25
>> Rocketlake
>>  2.00  1.17  1.17  1.17  2.03  1.25  3.30  3.69  4.50  4.50  6.91 K8
>> We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
>> the dependence-wise equivalent A.  My gyess is that these CPUs can
>> only store to one cache line per cycle, and can optimize the stores in
>> some cases for A.  If that is the case, that is a resource thing, not
>> a failure to recognize independence.
>
> It might be that A is occasionally combining the multiple stores
> to the same line in the store buffer whereas X Y Z do not.
> So maybe A needs 25% less cache accesses.
>
>> We see a slowdown for X on K8, which is somewhat expected; however,
>> thinking about t, I wonder: It seems as if the K8 can do only one load
>> or store per cycle, if they are to the same bank, but several to
>> different banks; it has been too longs since a read how it worked.  Y
>> and Z are the same speed as A, showing that the distance between
>> addresses does not influence the no-alias detection (at least for
>> these distances).
>>
>> Is this what you had in mind?
>>
>> - anton
>
> Yes. My thought was that by targeting the same cache line you
> might be triggering alternate mechanisms that cause serialization.
>
> First was that x86-TSO coherence allows a younger load to bypass (execute
> before) an older store to a non-overlapping address, otherwise it is
> serial.
> The detection of "same address" could be as high resolution as 8-byte
> operand or as low as a cache line. So by targeting separate cache lines
> it could allow more load-store bypassing and concurrency.
>
> Also, as you noted, by targeting the same cache line it would serialize
> on the same cache bank port, if it has multiple banks.
>
> And I just suggested different pages to disable any "same page"
> virtual address translation optimizations (if there are any).

Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW. Btw, have you ever tried to implement hazard pointers on an
x86? It requires an explicit memory barrier.

Re: Memory dependency microbenchmark

<DUM2N.293169$2fS.117686@fx16.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34853&group=comp.arch#34853

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me>
In-Reply-To: <ui8uvc$4c78$9@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 65
Message-ID: <DUM2N.293169$2fS.117686@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 08 Nov 2023 14:26:43 UTC
Date: Wed, 08 Nov 2023 09:26:23 -0500
X-Received-Bytes: 3573
 by: EricP - Wed, 8 Nov 2023 14:26 UTC

Chris M. Thomasson wrote:
> On 11/5/2023 7:09 AM, EricP wrote:
>>
>> Yes. My thought was that by targeting the same cache line you
>> might be triggering alternate mechanisms that cause serialization.
>>
>> First was that x86-TSO coherence allows a younger load to bypass (execute
>> before) an older store to a non-overlapping address, otherwise it is
>> serial.
>> The detection of "same address" could be as high resolution as 8-byte
>> operand or as low as a cache line. So by targeting separate cache lines
>> it could allow more load-store bypassing and concurrency.
>>
>> Also, as you noted, by targeting the same cache line it would serialize
>> on the same cache bank port, if it has multiple banks.
>>
>> And I just suggested different pages to disable any "same page"
>> virtual address translation optimizations (if there are any).
>
> Iirc, one can release a spinlock using an atomic store on x86, no
> LOCK'ED RMW.

Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.

This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.

Say we have two cache lines A and B.

If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.

ST [A+0],r1 <- this hits cache
ST [B+0],r2 <- this misses cache
ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish

On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.

And the cache access is pipelined so all of this is asynchronous.
When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
So even in the simple "stall until older stores done" approach it needs
even more logic to detect this and NAK the following stores back to LSQ,
and later wake them up and replay them when the ST [B+0] is done.

> Btw, have you ever tried to implement hazard pointers on an
> x86? It requires an explicit memory barrier.

That lock-free stuff makes my brain hurt.

Re: Memory dependency microbenchmark

<22648de76fcb4b5ff4fdcaae1db56c99@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34867&group=comp.arch#34867

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 01:36:31 +0000
Organization: novaBBS
Message-ID: <22648de76fcb4b5ff4fdcaae1db56c99@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <ui3s8b$2vc30$12@dont-email.me> <2023Nov4.184057@mips.complang.tuwien.ac.at> <ui679v$3hfer$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="405462"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$SEuwNdAVpi6Lz.dNzSV6Gu3xQOaHw84LDyNHLirck6OGMPqaojieq
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Fri, 10 Nov 2023 01:36 UTC

BGB wrote:

> On 11/4/2023 12:40 PM, Anton Ertl wrote:
>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>> On 11/3/2023 2:15 AM, Anton Ertl wrote:
>>>> I have written a microbenchmark for measuring how memory dependencies
>>>> affect the performance of various microarchitectures. You can find it
>>>> along with a description and results on
>>>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>>> [...]
>>>
>>> Is the only arch out there that does not require an explicit memory
>>> barrier for data-dependent loads a DEC Alpha? I think so.
>>
>> I don't know any architecture that requires memory barriers for
>> single-threaded programs that access just memory, not even Alpha.
>>
>> You may be thinking of the memory consistency model of Alpha, which is
>> even weaker than everything else I know of. This is not surprising,
>> given that a prominent advocacy paper for weak consistency
>> [adve&gharachorloo95] came out of DEC.
>>

> Hmm, one could maybe save some logic and also make timing a little
> easier if they disallowed RAW and WAW (in the same cache line), and made
> them undefined behavior...
<
Intractable in practice.
<
What you CAN do is to dynamically solve this problem in HW. By constructing
a memory dependency matrix and not allowing a memory reference to complete
unless it is safe to do so.
<
> However, this would suck for software (and would effectively either
> mandate strict stack alignment to keep prologs/epilogs working, or
> require saving registers in a convoluted order). And, in my case, I can
> tell that code steps on these scenarios very often (particularly in
> prolog sequences).
<
It only suck when there is a REAL dependency and saves you neck every time
a dependency is not found (mid-to-high 90%-ile).
<
> This is why I bother with a "somewhat expensive" feature of forwarding
> stored cache lines back to load in cases where one is accessing a "just
> accessed" cache line.
<
I use a MDM (above) attached to a temporal cache.
<
When an instruction is inserted in the MDM/TC==CC, it is made dependent on all
instructions that are already in the CC that have not completed. and installed
in the Unknown Address state.
<
When an address is AGENed, the address is sent to the data cache and a portion
is sent to the CC and that portion CAMs against all the other addresses. The
result of the comparison is either "Is the same line" or "Can't be the same line"
and the state changes to Known Address.
<
While the compares are taking place, the MDM is relaxed. And entry that "can't
be the same as" is removed as a dependency. When all dependencies have been
removed, the instruction can complete.
<
If either CC or DC returns with data, the state advances to Known Data. Stores
with Known data are allowed to complete into CC. Modified CC data migrates back
to DC when the ST instruction becomes retireable.
<
At this point the ST has completed as seen by the CPU, but has not completed
as seen by "interested Observers", and is subject to prediction repair, AGEN
replay, and those rare situations where one changes the memory ordering model
{ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think "its
done" while the external world can think "It has not happened yet".
<
Thus, memory references that alias on a cache line basis are performed in order,
while those that are not run independently.
<
The conditional (temporal) cache holds the address and <at least a portion> of
the associated cache line. Any reference can hit on that data in CC even if it
is port or bank blocked in the DC. CC hits a bit over 50% of the time, effectively
reducing cache port/bank conflicts that have been touched in the time of the
temporal cache.
<
> Disabling this forwarding (and forcing a stall instead), tends to have a
> more noticeable adverse effect on performance (but is often needed to
> try to pass timing at higher clock speeds).

> Granted, I guess it is possible I could fiddle with the compiler to try
> to improve this situation, say:
> Only use MOV.X with a 16B alignment;
> Stagger even/odd stores in pairs when possible.

> Say, as opposed to:
> MOV.X R12, (SP, 160)
> MOV.X R10, (SP, 144)
> MOV.X R8, (SP, 128)
> MOV.X R30, (SP, 112)
> MOV.X R28, (SP, 96)
> MOV.X R26, (SP, 80)
> MOV.X R24, (SP, 64)
> Say:
> MOV.X R10, (SP, 144)
> MOV.X R12, (SP, 160)
> MOV.X R30, (SP, 112)
> MOV.X R8, (SP, 128)
> MOV.X R26, (SP, 80)
> MOV.X R28, (SP, 96)
> MOV.X R24, (SP, 64)
<
I will use this as an example as to Why you want save/restore instructions
in ISA::
a) so the compiler does not need to deal with ordering problems
b) so fewer instructions are produced.

> Say, to try to avoid two adjacent stores within the same 32-byte
> paired-line (for 64-bit load, one would try to avoid two adjacent stores
> within the same 32 bytes).
<
Solved by CC in my case.
<
>

Re: Memory dependency microbenchmark

<ed96637cbebec7131e08fd3e9a10442a@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34868&group=comp.arch#34868

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 01:42:15 +0000
Organization: novaBBS
Message-ID: <ed96637cbebec7131e08fd3e9a10442a@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me> <DUM2N.293169$2fS.117686@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="405462"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$pX13wAS8zdeaHuc2mku.De.ZTmbefTgdFEUCtswkJOT8b3CXw4r7K
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Fri, 10 Nov 2023 01:42 UTC

EricP wrote:

> Chris M. Thomasson wrote:
>> On 11/5/2023 7:09 AM, EricP wrote:
>>>
>>> Yes. My thought was that by targeting the same cache line you
>>> might be triggering alternate mechanisms that cause serialization.
>>>
>>> First was that x86-TSO coherence allows a younger load to bypass (execute
>>> before) an older store to a non-overlapping address, otherwise it is
>>> serial.
>>> The detection of "same address" could be as high resolution as 8-byte
>>> operand or as low as a cache line. So by targeting separate cache lines
>>> it could allow more load-store bypassing and concurrency.
>>>
>>> Also, as you noted, by targeting the same cache line it would serialize
>>> on the same cache bank port, if it has multiple banks.
>>>
>>> And I just suggested different pages to disable any "same page"
>>> virtual address translation optimizations (if there are any).
>>
>> Iirc, one can release a spinlock using an atomic store on x86, no
>> LOCK'ED RMW.

> Sure, because its not a RMW. Its just a store which under x86-TSO
> becomes visible after prior stores to the protected section.

> This rule seems to imposes some particular design complexities on the
> order a Load-Store Queue and cache can perform store operations.

> Say we have two cache lines A and B.

> If there is a store to cache line ST [A+0]
> then to a different line ST [B+0], then a another ST [A+1],
> and if the first ST [A+0] hits cache but second ST [B+0] misses,
> then under TSO the third ST [A+1] must appear to stall so that it
> does not become visible until after the ST [B+0] has been performed,
> even though line A is in the cache.

> ST [A+0],r1 <- this hits cache
> ST [B+0],r2 <- this misses cache
> ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish

> On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
> invalidate msg might get in and transfer ownership of line A to a different
> core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

> In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
<
If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
awaits "ordering" while the external world is left believing it has not
started yet {This is 1991 technology}. CC can effectively eliminate ST
ordering stalls seen from the SPU while preserving all of the TSO-ness
the external observers need.
<
> or it allows it to proceed to the cache but pins line A until the
> update to B is done. If it uses the second pinning approach then it
> must also deal with all the potential deadlock/livelock possibilities.
<
In a conditional Cache, every instructions has (at least a portion) of
its Data Cache associated line. So every ST has a place to deposit its
data; and that place can be subject to backup and cancellation (based on
external stuff happening}.
<
After the ST reached the complete state (ready to retire) CC data is
migrated to DC data as porting and banking permiti.
<
> And the cache access is pipelined so all of this is asynchronous.
> When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
> So even in the simple "stall until older stores done" approach it needs
> even more logic to detect this and NAK the following stores back to LSQ,
> and later wake them up and replay them when the ST [B+0] is done.

>> Btw, have you ever tried to implement hazard pointers on an
>> x86? It requires an explicit memory barrier.

> That lock-free stuff makes my brain hurt.

Re: Memory dependency microbenchmark

<7a05598d1200e6e8b71a01ee5c0035d4@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34870&group=comp.arch#34870

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 01:43:04 +0000
Organization: novaBBS
Message-ID: <7a05598d1200e6e8b71a01ee5c0035d4@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="405656"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$ifo1DWxpUs4bLc4Ea37w1OqaszIy7JJQmqcK.Q/r.BTwq0Tuws5z.
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Fri, 10 Nov 2023 01:43 UTC

Anton Ertl wrote:

> I have written a microbenchmark for measuring how memory dependencies
> affect the performance of various microarchitectures.
<
Absolutely Brilliant.
<
Well Done, and Thanks.

Re: Memory dependency microbenchmark

<uikaks$2lcnt$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34872&group=comp.arch#34872

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 9 Nov 2023 22:10:54 -0600
Organization: A noiseless patient Spider
Lines: 180
Message-ID: <uikaks$2lcnt$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<ui3s8b$2vc30$12@dont-email.me> <2023Nov4.184057@mips.complang.tuwien.ac.at>
<ui679v$3hfer$1@dont-email.me>
<22648de76fcb4b5ff4fdcaae1db56c99@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 10 Nov 2023 04:13:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b3b72785317a8a803b52477c3548dfe9";
logging-data="2798333"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Y4/V6BGdM84yEqL8LLI77"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Vp/QxZwtiy4AAxWLPeIrkmv4pI0=
In-Reply-To: <22648de76fcb4b5ff4fdcaae1db56c99@news.novabbs.com>
Content-Language: en-US
 by: BGB - Fri, 10 Nov 2023 04:10 UTC

On 11/9/2023 7:36 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/4/2023 12:40 PM, Anton Ertl wrote:
>>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>>> On 11/3/2023 2:15 AM, Anton Ertl wrote:
>>>>> I have written a microbenchmark for measuring how memory dependencies
>>>>> affect the performance of various microarchitectures.  You can find it
>>>>> along with a description and results on
>>>>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>>>> [...]
>>>>
>>>> Is the only arch out there that does not require an explicit memory
>>>> barrier for data-dependent loads a DEC Alpha? I think so.
>>>
>>> I don't know any architecture that requires memory barriers for
>>> single-threaded programs that access just memory, not even Alpha.
>>>
>>> You may be thinking of the memory consistency model of Alpha, which is
>>> even weaker than everything else I know of.  This is not surprising,
>>> given that a prominent advocacy paper for weak consistency
>>> [adve&gharachorloo95] came out of DEC.
>>>
>
>> Hmm, one could maybe save some logic and also make timing a little
>> easier if they disallowed RAW and WAW (in the same cache line), and
>> made them undefined behavior...
> <
> Intractable in practice.
> <
> What you CAN do is to dynamically solve this problem in HW. By constructing
> a memory dependency matrix and not allowing a memory reference to complete
> unless it is safe to do so.
> <

Probably true.

My options were one of:
Forward results from an in-progress Store back to a following Load/Store
(more expensive, but faster);
Stall the pipeline until the prior Store can complete (cheaper but slower).

Granted, the "not bother" option could be cheaper still, but carries an
unreasonable level of risk (and in practice would likely mean that any
store through a free pointer followed by another memory access would end
up needing to use NOP padding or similar).

Or, handling it in the CPU, by generating an interlock stall whenever a
Store is followed by another memory operation. But, this would probably
be the worst possible option in terms of performance...

>> However, this would suck for software (and would effectively either
>> mandate strict stack alignment to keep prologs/epilogs working, or
>> require saving registers in a convoluted order). And, in my case, I
>> can tell that code steps on these scenarios very often (particularly
>> in prolog sequences).
> <
> It only suck when there is a REAL dependency and saves you neck every time
> a dependency is not found (mid-to-high 90%-ile).
> <

OK.

>> This is why I bother with a "somewhat expensive" feature of forwarding
>> stored cache lines back to load in cases where one is accessing a
>> "just accessed" cache line.
> <
> I use a MDM (above) attached to a temporal cache. <
> When an instruction is inserted in the MDM/TC==CC, it is made dependent
> on all
> instructions that are already in the CC that have not completed. and
> installed
> in the Unknown Address state.
> <
> When an address is AGENed, the address is sent to the data cache and a
> portion
> is sent to the CC and that portion CAMs against all the other addresses.
> The
> result of the comparison is either "Is the same line" or "Can't be the
> same line"
> and the state changes to Known Address.
> <
> While the compares are taking place, the MDM is relaxed. And entry that
> "can't
> be the same as" is removed as a dependency. When all dependencies have been
> removed, the instruction can complete.
> <
> If either CC or DC returns with data, the state advances to Known Data.
> Stores
> with Known data are allowed to complete into CC. Modified CC data
> migrates back
> to DC when the ST instruction becomes retireable. <
> At this point the ST has completed as seen by the CPU, but has not
> completed
> as seen by "interested Observers", and is subject to prediction repair,
> AGEN
> replay, and those rare situations where one changes the memory ordering
> model
> {ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think
> "its
> done" while the external world can think "It has not happened yet".
> <
> Thus, memory references that alias on a cache line basis are performed
> in order,
> while those that are not run independently.
> <
> The conditional (temporal) cache holds the address and <at least a
> portion> of
> the associated cache line. Any reference can hit on that data in CC even
> if it
> is port or bank blocked in the DC. CC hits a bit over 50% of the time,
> effectively
> reducing cache port/bank conflicts that have been touched in the time of
> the
> temporal cache.
> <

In my pipeline, it is simpler...

Just, the L1 cache sees that the next stage holds the results of a store
to the same cache lines, and either forwards the result or generates a
stall until the stall completes.

Where, forwarding is faster, but stalling is cheaper.

>> Disabling this forwarding (and forcing a stall instead), tends to have
>> a more noticeable adverse effect on performance (but is often needed
>> to try to pass timing at higher clock speeds).
>
>> Granted, I guess it is possible I could fiddle with the compiler to
>> try to improve this situation, say:
>>    Only use MOV.X with a 16B alignment;
>>    Stagger even/odd stores in pairs when possible.
>
>> Say, as opposed to:
>>    MOV.X R12, (SP, 160)
>>    MOV.X R10, (SP, 144)
>>    MOV.X R8,  (SP, 128)
>>    MOV.X R30, (SP, 112)
>>    MOV.X R28, (SP, 96)
>>    MOV.X R26, (SP, 80)
>>    MOV.X R24, (SP, 64)
>> Say:
>>    MOV.X R10, (SP, 144)
>>    MOV.X R12, (SP, 160)
>>    MOV.X R30, (SP, 112)
>>    MOV.X R8,  (SP, 128)
>>    MOV.X R26, (SP, 80)
>>    MOV.X R28, (SP, 96)
>>    MOV.X R24, (SP, 64)
> <
> I will use this as an example as to Why you want save/restore instructions
> in ISA::
> a) so the compiler does not need to deal with ordering problems
> b) so fewer instructions are produced.
>

Still haven't gotten around to this, but it could potentially help
performance with the cheaper cache options (namely, configurations
without the internal forwarding).

But, yeah, the compiler does still need to deal with this sometimes.

>> Say, to try to avoid two adjacent stores within the same 32-byte
>> paired-line (for 64-bit load, one would try to avoid two adjacent
>> stores within the same 32 bytes).
> <
> Solved by CC in my case.

OK.

> <
>>

Re: Memory dependency microbenchmark

<uim5g0$30pg4$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34886&group=comp.arch#34886

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 12:57:36 -0800
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <uim5g0$30pg4$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<yT91N.150749$HwD9.28213@fx11.iad>
<2023Nov4.180132@mips.complang.tuwien.ac.at>
<RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me>
<DUM2N.293169$2fS.117686@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 10 Nov 2023 20:57:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a600364f4e67105389993083039debad";
logging-data="3171844"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+EuSrbHw2CDSvRrEySs0pvfAUQSoYTDuo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:cWaluA8vmy8zJBuT/BYJBLW8inE=
In-Reply-To: <DUM2N.293169$2fS.117686@fx16.iad>
Content-Language: en-US
 by: Chris M. Thomasson - Fri, 10 Nov 2023 20:57 UTC

On 11/8/2023 6:26 AM, EricP wrote:
> Chris M. Thomasson wrote:
>> On 11/5/2023 7:09 AM, EricP wrote:
>>>
>>> Yes. My thought was that by targeting the same cache line you
>>> might be triggering alternate mechanisms that cause serialization.
>>>
>>> First was that x86-TSO coherence allows a younger load to bypass
>>> (execute
>>> before) an older store to a non-overlapping address, otherwise it is
>>> serial.
>>> The detection of "same address" could be as high resolution as 8-byte
>>> operand or as low as a cache line. So by targeting separate cache lines
>>> it could allow more load-store bypassing and concurrency.
>>>
>>> Also, as you noted, by targeting the same cache line it would serialize
>>> on the same cache bank port, if it has multiple banks.
>>>
>>> And I just suggested different pages to disable any "same page"
>>> virtual address translation optimizations (if there are any).
>>
>> Iirc, one can release a spinlock using an atomic store on x86, no
>> LOCK'ED RMW.
>
> Sure, because its not a RMW. Its just a store which under x86-TSO
> becomes visible after prior stores to the protected section.
>
> This rule seems to imposes some particular design complexities on the
> order a Load-Store Queue and cache can perform store operations.
>
> Say we have two cache lines A and B.
>
> If there is a store to cache line ST [A+0]
> then to a different line ST [B+0], then a another ST [A+1],
> and if the first ST [A+0] hits cache but second ST [B+0] misses,
> then under TSO the third ST [A+1] must appear to stall so that it
> does not become visible until after the ST [B+0] has been performed,
> even though line A is in the cache.
>
>   ST [A+0],r1   <- this hits cache
>   ST [B+0],r2   <- this misses cache
>   ST [A+1],r3   <- this waits for B to arrive and store to [B+0] to finish
>
> On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
> invalidate msg might get in and transfer ownership of line A to a different
> core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].
>
> In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
> or it allows it to proceed to the cache but pins line A until the
> update to B is done. If it uses the second pinning approach then it
> must also deal with all the potential deadlock/livelock possibilities.
>
> And the cache access is pipelined so all of this is asynchronous.
> When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
> So even in the simple "stall until older stores done" approach it needs
> even more logic to detect this and NAK the following stores back to LSQ,
> and later wake them up and replay them when the ST [B+0] is done.
>
>> Btw, have you ever tried to implement hazard pointers on an x86? It
>> requires an explicit memory barrier.
>
> That lock-free stuff makes my brain hurt.

Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.

Re: Memory dependency microbenchmark

<1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34890&group=comp.arch#34890

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 23:35:55 +0000
Organization: novaBBS
Message-ID: <1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me> <DUM2N.293169$2fS.117686@fx16.iad> <uim5g0$30pg4$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="503707"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$KOkvCAnkk.SuxkB94pHQTeC6PRsPtYgZ03LQ5vRyHcDaW1iLeEzTe
 by: MitchAlsup - Fri, 10 Nov 2023 23:35 UTC

Chris M. Thomasson wrote:

> On 11/8/2023 6:26 AM, EricP wrote:
>
>>
>>> Btw, have you ever tried to implement hazard pointers on an x86? It
>>> requires an explicit memory barrier.
>>
>> That lock-free stuff makes my brain hurt.

> Iirc, hazard pointers require a store followed by a load to another
> location to be honored. This requires a membar on x86.
<
It is stuff like this that made My 66000 architecture define changes
to memory order based on several pieces of state (thus no membars)
<
Accesses to ROM are unordered
Accesses to config space is strongly ordered
Accesses to MMI/O space is sequentially consistent
Participating accesses (ATOMIC) are sequentially consistent
everything else is causal.
And the HW tracks this on a per memory reference basis--in effect
all orderings are in effect all the time.
<
Performance guys get what they want,
Lamport guys (atomic) get what they want
Device drivers get what they want on the accesses that need it
...

Re: Memory dependency microbenchmark

<f84329c7e7d27c044cba9753bbc1140f@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34891&group=comp.arch#34891

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 23:31:20 +0000
Organization: novaBBS
Message-ID: <f84329c7e7d27c044cba9753bbc1140f@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <ui3s8b$2vc30$12@dont-email.me> <2023Nov4.184057@mips.complang.tuwien.ac.at> <ui679v$3hfer$1@dont-email.me> <22648de76fcb4b5ff4fdcaae1db56c99@news.novabbs.com> <uikaks$2lcnt$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="503707"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$Azm6C2pHWeg4ZNKTxbSWBu5/P4uM0Biq/NiOfbnMn/OeARHW31aY6
X-Spam-Level: *
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Fri, 10 Nov 2023 23:31 UTC

BGB wrote:

> On 11/9/2023 7:36 PM, MitchAlsup wrote:

>>> Granted, I guess it is possible I could fiddle with the compiler to
>>> try to improve this situation, say:
>>>    Only use MOV.X with a 16B alignment;
>>>    Stagger even/odd stores in pairs when possible.
>>
>>> Say, as opposed to:
>>>    MOV.X R12, (SP, 160)
>>>    MOV.X R10, (SP, 144)
>>>    MOV.X R8,  (SP, 128)
>>>    MOV.X R30, (SP, 112)
>>>    MOV.X R28, (SP, 96)
>>>    MOV.X R26, (SP, 80)
>>>    MOV.X R24, (SP, 64)
>>> Say:
>>>    MOV.X R10, (SP, 144)
>>>    MOV.X R12, (SP, 160)
>>>    MOV.X R30, (SP, 112)
>>>    MOV.X R8,  (SP, 128)
>>>    MOV.X R26, (SP, 80)
>>>    MOV.X R28, (SP, 96)
>>>    MOV.X R24, (SP, 64)
>> <
>> I will use this as an example as to Why you want save/restore instructions
>> in ISA::
>> a) so the compiler does not need to deal with ordering problems
>> b) so fewer instructions are produced.
>>

> Still haven't gotten around to this, but it could potentially help
> performance with the cheaper cache options (namely, configurations
> without the internal forwarding).
<
Imagine a scenario where you are returning from one function call only
to call another function*. Since EXIT performs all the state reload
(and the RET), and the return IP is read first so the front end can
fetch-decode to feed the machine..... (*) possibly moving registers
around to form the argument list.
<
NOW imagine that the front end encounters a CALL and the instruction
at the target of the CALL is ENTER.
a) the restore process can stop-or-complete
b) the save process can mostly be skipped since memory already contains
the correct bit patterns (My 66000 ABI)
<
Saving cycles a compiler cannot.....
<
> But, yeah, the compiler does still need to deal with this sometimes.

Re: Memory dependency microbenchmark

<uimulp$38v2q$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34893&group=comp.arch#34893

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 10 Nov 2023 20:07:21 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <uimulp$38v2q$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<yT91N.150749$HwD9.28213@fx11.iad>
<2023Nov4.180132@mips.complang.tuwien.ac.at>
<RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me>
<DUM2N.293169$2fS.117686@fx16.iad> <uim5g0$30pg4$1@dont-email.me>
<1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Nov 2023 04:07:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b95bdab731045aab656ae9b9e9ad38d8";
logging-data="3439706"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/q6XJQLVV7ltMGttqMDlXNfMx2meyJR2I="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:CqMAG8fqPZdZBwsimoweDZu5mE4=
In-Reply-To: <1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com>
Content-Language: en-US
 by: Chris M. Thomasson - Sat, 11 Nov 2023 04:07 UTC

On 11/10/2023 3:35 PM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/8/2023 6:26 AM, EricP wrote:
>>
>>>
>>>> Btw, have you ever tried to implement hazard pointers on an x86? It
>>>> requires an explicit memory barrier.
>>>
>>> That lock-free stuff makes my brain hurt.
>
>> Iirc, hazard pointers require a store followed by a load to another
>> location to be honored. This requires a membar on x86.
> <
> It is stuff like this that made My 66000 architecture define changes to
> memory order based on several pieces of state (thus no membars)
> <
> Accesses to ROM are unordered
> Accesses to config space is strongly ordered
> Accesses to MMI/O space is sequentially consistent
> Participating accesses (ATOMIC) are sequentially consistent
> everything else is causal.
> And the HW tracks this on a per memory reference basis--in effect
> all orderings are in effect all the time.
> <
> Performance guys get what they want,
> Lamport guys (atomic) get what they want
> Device drivers get what they want on the accesses that need it
> ..

Nice! Well, there is a way to avoid the explicit membar on x86. It
involves a marriage of RCU and Hazard Pointers.

Re: Memory dependency microbenchmark

<QAS3N.14349$Ee89.322@fx17.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34910&group=comp.arch#34910

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me> <DUM2N.293169$2fS.117686@fx16.iad> <ed96637cbebec7131e08fd3e9a10442a@news.novabbs.com>
In-Reply-To: <ed96637cbebec7131e08fd3e9a10442a@news.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 118
Message-ID: <QAS3N.14349$Ee89.322@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 11 Nov 2023 21:44:16 UTC
Date: Sat, 11 Nov 2023 16:43:54 -0500
X-Received-Bytes: 6132
 by: EricP - Sat, 11 Nov 2023 21:43 UTC

MitchAlsup wrote:
> EricP wrote:
>
>> Chris M. Thomasson wrote:
>>> On 11/5/2023 7:09 AM, EricP wrote:
>>>>
>>>> Yes. My thought was that by targeting the same cache line you
>>>> might be triggering alternate mechanisms that cause serialization.
>>>>
>>>> First was that x86-TSO coherence allows a younger load to bypass
>>>> (execute
>>>> before) an older store to a non-overlapping address, otherwise it is
>>>> serial.
>>>> The detection of "same address" could be as high resolution as 8-byte
>>>> operand or as low as a cache line. So by targeting separate cache lines
>>>> it could allow more load-store bypassing and concurrency.
>>>>
>>>> Also, as you noted, by targeting the same cache line it would serialize
>>>> on the same cache bank port, if it has multiple banks.
>>>>
>>>> And I just suggested different pages to disable any "same page"
>>>> virtual address translation optimizations (if there are any).
>>>
>>> Iirc, one can release a spinlock using an atomic store on x86, no
>>> LOCK'ED RMW.
>
>> Sure, because its not a RMW. Its just a store which under x86-TSO
>> becomes visible after prior stores to the protected section.
>
>> This rule seems to imposes some particular design complexities on the
>> order a Load-Store Queue and cache can perform store operations.
>
>> Say we have two cache lines A and B.
>
>> If there is a store to cache line ST [A+0]
>> then to a different line ST [B+0], then a another ST [A+1],
>> and if the first ST [A+0] hits cache but second ST [B+0] misses,
>> then under TSO the third ST [A+1] must appear to stall so that it
>> does not become visible until after the ST [B+0] has been performed,
>> even though line A is in the cache.
>
>> ST [A+0],r1 <- this hits cache
>> ST [B+0],r2 <- this misses cache
>> ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to
>> finish
>
>> On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
>> invalidate msg might get in and transfer ownership of line A to a
>> different
>> core C2, allowing the new value of [A+1] to be visible at C2 before
>> [B+0].
>
>> In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
> <
> If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
> awaits "ordering" while the external world is left believing it has not
> started yet {This is 1991 technology}. CC can effectively eliminate ST
> ordering stalls seen from the SPU while preserving all of the TSO-ness
> the external observers need.

MDM = Memory Dependency Matrix
TC = Temporal Cache
CC = Conditional Cache

The Conditional Cache sounds similar to the Store Buffers that other
designs refer to but with multiple versions, as you outline below.
This seems to duplicate many existing functions of the Load Store Queue.

I'm thinking it would be simpler to keep everything in one circular
load/store queue and hold store data there after retire until it
can be handed off to the cache. See below.

> <
>> or it allows it to proceed to the cache but pins line A until the
>> update to B is done. If it uses the second pinning approach then it
>> must also deal with all the potential deadlock/livelock possibilities.
> <
> In a conditional Cache, every instructions has (at least a portion) of
> its Data Cache associated line. So every ST has a place to deposit its
> data; and that place can be subject to backup and cancellation (based on
> external stuff happening}.
> <
> After the ST reached the complete state (ready to retire) CC data is
> migrated to DC data as porting and banking permiti.
> <

Yes, but that seems a rather expensive approach.
The problem I have with the CC is that it requires some pretty complex
logic to track multiple versions of cache lines, journal before-image copies,
track their inter-dependencies, and decide when to commit updates.
And much of this duplicates functionality that the LSQ already has to
support store-to-load forwarding and load-store bypass address matching.
All those buffer need free lists and allocators, another dependency matrix,
CAM's to match addresses to the assigned buffers.

Its not clear to me that all this is significantly better than
a simpler approach.

Instead I was thinking of having a unified LSQ as a single circular buffer
with all the load and store entries in (circular) program order.
Store data is held in the LSQ after the store instruction retires
until it is accepted by the cache. While it is in the LSQ it can still be
forwarded to younger loads to the same address.

LSQ stores are sent to the pipelined cache which indicates back to LSQ
a hit/miss after the pipeline latency.
If it hits then the entry is removed from LSQ.
If it misses then the store data is held in LSQ and cache blocks further
stores until the miss resolves. However LSQ continues to send future
store address to trigger line prefetches until all miss buffers are busy.
If a load misses then all further loads must stop because
load-load bypassing is not allowed until TSO.

When the missed line arrives, cache sends a wakeup signal to LSQ
which restarts sending entries from where it left off.

Re: Memory dependency microbenchmark

<21fcdf1d3ff4ba1115b5a9382990afd4@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34918&group=comp.arch#34918

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 11 Nov 2023 23:32:30 +0000
Organization: novaBBS
Message-ID: <21fcdf1d3ff4ba1115b5a9382990afd4@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me> <DUM2N.293169$2fS.117686@fx16.iad> <ed96637cbebec7131e08fd3e9a10442a@news.novabbs.com> <QAS3N.14349$Ee89.322@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="612957"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$ED6iEr.s/pl2kozZqV4DSOu8avIgOfYJtm5bu8.4rWiuhEwuAKiUO
X-Spam-Level: *
 by: MitchAlsup - Sat, 11 Nov 2023 23:32 UTC

EricP wrote:

> MitchAlsup wrote:
>> EricP wrote:
>>
>>> Chris M. Thomasson wrote:
>>>> On 11/5/2023 7:09 AM, EricP wrote:
>>>>>
>>>>> Yes. My thought was that by targeting the same cache line you
>>>>> might be triggering alternate mechanisms that cause serialization.
>>>>>
>>>>> First was that x86-TSO coherence allows a younger load to bypass
>>>>> (execute
>>>>> before) an older store to a non-overlapping address, otherwise it is
>>>>> serial.
>>>>> The detection of "same address" could be as high resolution as 8-byte
>>>>> operand or as low as a cache line. So by targeting separate cache lines
>>>>> it could allow more load-store bypassing and concurrency.
>>>>>
>>>>> Also, as you noted, by targeting the same cache line it would serialize
>>>>> on the same cache bank port, if it has multiple banks.
>>>>>
>>>>> And I just suggested different pages to disable any "same page"
>>>>> virtual address translation optimizations (if there are any).
>>>>
>>>> Iirc, one can release a spinlock using an atomic store on x86, no
>>>> LOCK'ED RMW.
>>
>>> Sure, because its not a RMW. Its just a store which under x86-TSO
>>> becomes visible after prior stores to the protected section.
>>
>>> This rule seems to imposes some particular design complexities on the
>>> order a Load-Store Queue and cache can perform store operations.
>>
>>> Say we have two cache lines A and B.
>>
>>> If there is a store to cache line ST [A+0]
>>> then to a different line ST [B+0], then a another ST [A+1],
>>> and if the first ST [A+0] hits cache but second ST [B+0] misses,
>>> then under TSO the third ST [A+1] must appear to stall so that it
>>> does not become visible until after the ST [B+0] has been performed,
>>> even though line A is in the cache.
>>
>>> ST [A+0],r1 <- this hits cache
>>> ST [B+0],r2 <- this misses cache
>>> ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to
>>> finish
>>
>>> On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
>>> invalidate msg might get in and transfer ownership of line A to a
>>> different
>>> core C2, allowing the new value of [A+1] to be visible at C2 before
>>> [B+0].
>>
>>> In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
>> <
>> If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
>> awaits "ordering" while the external world is left believing it has not
>> started yet {This is 1991 technology}. CC can effectively eliminate ST
>> ordering stalls seen from the SPU while preserving all of the TSO-ness
>> the external observers need.

> MDM = Memory Dependency Matrix
> TC = Temporal Cache
> CC = Conditional Cache

> The Conditional Cache sounds similar to the Store Buffers that other
> designs refer to but with multiple versions, as you outline below.
> This seems to duplicate many existing functions of the Load Store Queue.

> I'm thinking it would be simpler to keep everything in one circular
> load/store queue and hold store data there after retire until it
> can be handed off to the cache. See below.
<
a) it is circular and tied intimately to the execution window.
b) it integrates the LDs into the queue {just in case a store has
to be replayed all dependent LDs get replayed too and for solving
memory order problems dynamically.}

>> <
>>> or it allows it to proceed to the cache but pins line A until the
>>> update to B is done. If it uses the second pinning approach then it
>>> must also deal with all the potential deadlock/livelock possibilities.
>> <
>> In a conditional Cache, every instructions has (at least a portion) of
>> its Data Cache associated line. So every ST has a place to deposit its
>> data; and that place can be subject to backup and cancellation (based on
>> external stuff happening}.
>> <
>> After the ST reached the complete state (ready to retire) CC data is
>> migrated to DC data as porting and banking permiti.
>> <

> Yes, but that seems a rather expensive approach.
> The problem I have with the CC is that it requires some pretty complex
> logic to track multiple versions of cache lines, journal before-image copies,
> track their inter-dependencies, and decide when to commit updates.
<
We did not track all that stuff (Mc 88120 circa 1991) what we did was to copy
data from the cache/memory into everybody needing that data, then stores could
overwrite data into the cache into younger accesses; so most of the STs did
not actually write into DC because we knew a younger store would do that for us.
<
> And much of this duplicates functionality that the LSQ already has to
> support store-to-load forwarding and load-store bypass address matching.
<
The CC is the LD/ST queue, but by integrating certain features, most of the
control logic vanishes, and the component "pretty much" manages itself. It
does speak back to the Execution window logic to control the advancement of
the consistent point and the retire point of the window.
<
> All those buffer need free lists and allocators, another dependency matrix,
> CAM's to match addresses to the assigned buffers.
<
Nope:: dedicated storage. If the storage is not available, insert stalls.
<
> Its not clear to me that all this is significantly better than
> a simpler approach.
<
Think of it like a reservation station for a horizon of memory reference
instructions. Instead of CAMing on renamed registers, you CAM on portion
of VA from the CPU side, and on PA for the snooping side. The MDM handles
the memory renaming in a way that can be undone {Branch prediction repair
and Snoop discovering memory order has been violated.}

> Instead I was thinking of having a unified LSQ as a single circular buffer
> with all the load and store entries in (circular) program order.
<
Yep, that is what a CC does. However, once 2 addresses discover "we cannot
refer to the same cache line container" they become independent.
<
> Store data is held in the LSQ after the store instruction retires
<
The ST must deliver its data to DC before the ST can retire. Until the ST
delivers its data to DC, the ST is subject to being replayed upon detection
of memory order violations.
<
> until it is accepted by the cache. While it is in the LSQ it can still be
> forwarded to younger loads to the same address.

> LSQ stores are sent to the pipelined cache which indicates back to LSQ
> a hit/miss after the pipeline latency.
<
And hit data is written into All CC entires waiting for it.
<
> If it hits then the entry is removed from LSQ.
<
Entries are inserted into CC at issue and removed from CC at retire. They
remain replayable until consistent.
<
> If it misses then the store data is held in LSQ and cache blocks further
> stores until the miss resolves. However LSQ continues to send future
<
Mc 88120 kept byte writes in the CC.data so STs could write data into CC
even before data arrives from the memory hierarchy.
<
> store address to trigger line prefetches until all miss buffers are busy.
<
The CC is the Miss Buffer !! and entries that missed but got translated
are selected for external access in much the same way that instructions
are selected from reservation stations.
<
> If a load misses then all further loads must stop because
> load-load bypassing is not allowed until TSO.
<
But the MDM effectively allows multiple LDs to miss and be serviced
independently {except config, MMI/O, ATOMIC) and repair back to TSO
only if an external even requires actual TSO behavior.
<
> When the missed line arrives, cache sends a wakeup signal to LSQ
> which restarts sending entries from where it left off.
<
When miss data returns with the PA, a portion of PA it used and everybody who
is waiting on the returning data snarfs it up in the same write (into CC)
cycle.

Re: Memory dependency microbenchmark

<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34932&group=comp.arch#34932

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 19:13:07 +0000
Organization: novaBBS
Message-ID: <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <yT91N.150749$HwD9.28213@fx11.iad> <2023Nov4.180132@mips.complang.tuwien.ac.at> <RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me> <DUM2N.293169$2fS.117686@fx16.iad> <uim5g0$30pg4$1@dont-email.me> <1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com> <uimulp$38v2q$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="702022"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$DVh66ZIhlPW5lqKeCCsU5O/Y206ArbVCK2dU4VqOaXu3o19ICpyKK
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Sun, 12 Nov 2023 19:13 UTC

Chris M. Thomasson wrote:

> On 11/10/2023 3:35 PM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> On 11/8/2023 6:26 AM, EricP wrote:
>>>
>>>>
>>>>> Btw, have you ever tried to implement hazard pointers on an x86? It
>>>>> requires an explicit memory barrier.
>>>>
>>>> That lock-free stuff makes my brain hurt.
>>
>>> Iirc, hazard pointers require a store followed by a load to another
>>> location to be honored. This requires a membar on x86.
>> <
>> It is stuff like this that made My 66000 architecture define changes to
>> memory order based on several pieces of state (thus no membars)
>> <
>> Accesses to ROM are unordered
>> Accesses to config space is strongly ordered
>> Accesses to MMI/O space is sequentially consistent
>> Participating accesses (ATOMIC) are sequentially consistent
>> everything else is causal.
>> And the HW tracks this on a per memory reference basis--in effect
>> all orderings are in effect all the time.
>> <
>> Performance guys get what they want,
>> Lamport guys (atomic) get what they want
>> Device drivers get what they want on the accesses that need it
>> ..

> Nice! Well, there is a way to avoid the explicit membar on x86. It
> involves a marriage of RCU and Hazard Pointers.
<
I watched membar evolve during my time at AMD and decided to make a
machine that never needs to use them--but still gets the right thinig
done.
<

Re: Memory dependency microbenchmark

<uiri0a$85mp$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34940&group=comp.arch#34940

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 14:01:46 -0800
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <uiri0a$85mp$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<yT91N.150749$HwD9.28213@fx11.iad>
<2023Nov4.180132@mips.complang.tuwien.ac.at>
<RfO1N.368064$w4ec.211729@fx14.iad> <ui8uvc$4c78$9@dont-email.me>
<DUM2N.293169$2fS.117686@fx16.iad> <uim5g0$30pg4$1@dont-email.me>
<1fb02e8e9f97392a42c47daf0b4b5145@news.novabbs.com>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 12 Nov 2023 22:01:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5b8a68db2c6a6c34455a2af4b9626eac";
logging-data="267993"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dOWHNoM9fIGYpfjQrJQut40YBzdkBtto="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:HrhDvaXsxavEMllE4X1gS+tezPY=
Content-Language: en-US
In-Reply-To: <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
 by: Chris M. Thomasson - Sun, 12 Nov 2023 22:01 UTC

On 11/12/2023 11:13 AM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/10/2023 3:35 PM, MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>>> On 11/8/2023 6:26 AM, EricP wrote:
>>>>
>>>>>
>>>>>> Btw, have you ever tried to implement hazard pointers on an x86?
>>>>>> It requires an explicit memory barrier.
>>>>>
>>>>> That lock-free stuff makes my brain hurt.
>>>
>>>> Iirc, hazard pointers require a store followed by a load to another
>>>> location to be honored. This requires a membar on x86.
>>> <
>>> It is stuff like this that made My 66000 architecture define changes
>>> to memory order based on several pieces of state (thus no membars)
>>> <
>>> Accesses to ROM are unordered
>>> Accesses to config space is strongly ordered
>>> Accesses to MMI/O space is sequentially consistent
>>> Participating accesses (ATOMIC) are sequentially consistent
>>> everything else is causal.
>>> And the HW tracks this on a per memory reference basis--in effect
>>> all orderings are in effect all the time.
>>> <
>>> Performance guys get what they want,
>>> Lamport guys (atomic) get what they want
>>> Device drivers get what they want on the accesses that need it
>>> ..
>
>> Nice! Well, there is a way to avoid the explicit membar on x86. It
>> involves a marriage of RCU and Hazard Pointers.
> <
> I watched membar evolve during my time at AMD and decided to make a
> machine that never needs to use them--but still gets the right thinig
> done.
> <

TSO? Oh, wait. I need to go think back about your system. It been some
years sense we conversed about it. Btw, what happened to the Mill?
Anyway, I digress.

Fwiw, imho, it helps to think about proper alignment and boundaries, try
to work with a given architecture, not against it.

A highly relaxed memory model can be beneficial for certain workloads.

Re: Memory dependency microbenchmark

<uirjr7$8dlh$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34942&group=comp.arch#34942

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 22:33:11 -0000 (UTC)
Organization: provalid.com
Lines: 12
Message-ID: <uirjr7$8dlh$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me>
Injection-Date: Sun, 12 Nov 2023 22:33:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e4750fb26d300cbd10d591f68ada5a48";
logging-data="276145"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TwgeUutaGmY9Vv810upA3"
Cancel-Lock: sha1:QUPfpubDxHcPfuenow23zIWTsTo=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: Kent Dickey - Sun, 12 Nov 2023 22:33 UTC

In article <uiri0a$85mp$2@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

>A highly relaxed memory model can be beneficial for certain workloads.

I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.

I believe that statement to be false. Can you describe some of these
workloads?

Kent

Re: Memory dependency microbenchmark

<uirk6f$8hef$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34943&group=comp.arch#34943

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 14:39:11 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <uirk6f$8hef$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 12 Nov 2023 22:39:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5b8a68db2c6a6c34455a2af4b9626eac";
logging-data="280015"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Jm62kPvmNFsT8w+TBp1r4MusnQocivSM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:z1Cs3cDKwdCUdB0JkrDyEeASQ5U=
In-Reply-To: <uirjr7$8dlh$2@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Sun, 12 Nov 2023 22:39 UTC

On 11/12/2023 2:33 PM, Kent Dickey wrote:
> In article <uiri0a$85mp$2@dont-email.me>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>
>> A highly relaxed memory model can be beneficial for certain workloads.
>
> I know a lot of people believe that statement to be true. In general, it
> is assumed to be true without proof.
>
> I believe that statement to be false. Can you describe some of these
> workloads?

One example... Basically, think along the lines of RCU. Usually,
read-mostly and write rather rarely can _greatly_ benefit for not using
any memory barriers. And this in on rather relaxed systems. Think SPARC
in RMO mode. DEC alpha is a "special case" wrt RCU.

Re: Memory dependency microbenchmark

<uirke6$8hef$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34944&group=comp.arch#34944

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 14:43:18 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <uirke6$8hef$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 12 Nov 2023 22:43:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5b8a68db2c6a6c34455a2af4b9626eac";
logging-data="280015"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Dt6jFMeJIehBbUUrQL287iwZ6Sf+DsII="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:uMfHmj1aZ3YamoqY73FhUuw7ei0=
In-Reply-To: <uirjr7$8dlh$2@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Sun, 12 Nov 2023 22:43 UTC

On 11/12/2023 2:33 PM, Kent Dickey wrote:
> In article <uiri0a$85mp$2@dont-email.me>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>
>> A highly relaxed memory model can be beneficial for certain workloads.
>
> I know a lot of people believe that statement to be true. In general, it
> is assumed to be true without proof.
>
> I believe that statement to be false. Can you describe some of these
> workloads?

Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin
performance right off the bat... Think about it.

Re: Memory dependency microbenchmark

<36011a9597060e08d46db0eddfed0976@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=34950&group=comp.arch#34950

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 00:20:51 +0000
Organization: novaBBS
Message-ID: <36011a9597060e08d46db0eddfed0976@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me> <uirke6$8hef$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="724861"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$gW8H86vFCNYkQkmlIkDHauVsB3fQuDPaCK81o4Qo/5hvz1KJUdvv2
 by: MitchAlsup - Mon, 13 Nov 2023 00:20 UTC

Chris M. Thomasson wrote:

> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>> In article <uiri0a$85mp$2@dont-email.me>,
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>
>>> A highly relaxed memory model can be beneficial for certain workloads.
>>
>> I know a lot of people believe that statement to be true. In general, it
>> is assumed to be true without proof.
>>
>> I believe that statement to be false. Can you describe some of these
>> workloads?

> Also, think about converting any sound lock-free algorithm's finely
> tuned memory barriers to _all_ sequential consistency... That would ruin
> performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than the
right answer later. There are very few algorithms with this property.

Pages:12345678
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor