Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

According to the latest official figures, 43% of all statistics are totally worthless.


devel / comp.arch / Re: Memory dependency microbenchmark

SubjectAuthor
* Memory dependency microbenchmarkAnton Ertl
+* Re: Memory dependency microbenchmarkEricP
|`* Re: Memory dependency microbenchmarkAnton Ertl
| `* Re: Memory dependency microbenchmarkEricP
|  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|   `* Re: Memory dependency microbenchmarkEricP
|    +* Re: Memory dependency microbenchmarkMitchAlsup
|    |`* Re: Memory dependency microbenchmarkEricP
|    | `- Re: Memory dependency microbenchmarkMitchAlsup
|    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|     `* Re: Memory dependency microbenchmarkMitchAlsup
|      `* Re: Memory dependency microbenchmarkChris M. Thomasson
|       `* Re: Memory dependency microbenchmarkMitchAlsup
|        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|         `* Re: Memory dependency microbenchmarkKent Dickey
|          +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          || `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  +* Re: Memory dependency microbenchmarkaph
|          ||  |+- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  | `* Re: Memory dependency microbenchmarkaph
|          ||  |  +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |  `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |     `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkaph
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkaph
|          ||  |   |  `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   `* Re: Memory dependency microbenchmarkStefan Monnier
|          ||  |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkaph
|          ||  |     |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |   `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |     `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |      `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |       `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |         `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     `- Re: Memory dependency microbenchmarkStefan Monnier
|          ||  `* Re: Memory dependency microbenchmarkEricP
|          ||   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   | `* Re: Memory dependency microbenchmarkBranimir Maksimovic
|          ||   |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||    |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||    ||`* Re: Memory dependency microbenchmarkEricP
|          ||    || `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||    |`* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    | `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||    `* Re: Memory dependency microbenchmarkEricP
|          ||     +* Re: Memory dependency microbenchmarkaph
|          ||     |`* Re: Memory dependency microbenchmarkEricP
|          ||     | +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | `* Re: Memory dependency microbenchmarkaph
|          ||     |  +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |+- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |`* Re: Memory dependency microbenchmarkEricP
|          ||     |  | +- Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  | +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  `* Re: Memory dependency microbenchmarkEricP
|          ||     |   `* Re: Memory dependency microbenchmarkaph
|          ||     |    +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |`* Re: Memory dependency microbenchmarkaph
|          ||     |    | +* Re: Memory dependency microbenchmarkTerje Mathisen
|          ||     |    | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    `- Re: Memory dependency microbenchmarkEricP
|          ||     `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||      `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          |`* weak consistency and the supercomputer attitude (was: Memory dependency microbenAnton Ertl
|          | +- Re: weak consistency and the supercomputer attitudeStefan Monnier
|          | +- Re: weak consistency and the supercomputer attitudeMitchAlsup
|          | `* Re: weak consistency and the supercomputer attitudePaul A. Clayton
|          `* Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkChris M. Thomasson
+- Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkAnton Ertl
`* Alder Lake results for the memory dependency microbenchmarkAnton Ertl

Pages:12345678
Re: Memory dependency microbenchmark

<7bT5N.54414$BbXa.29985@fx16.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35075&group=comp.arch#35075

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com>
Lines: 75
Message-ID: <7bT5N.54414$BbXa.29985@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 18 Nov 2023 00:03:15 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 18 Nov 2023 00:03:15 GMT
X-Received-Bytes: 3348
 by: Scott Lurndal - Sat, 18 Nov 2023 00:03 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Stefan Monnier wrote:
>
>>> As long as you fully analyze your program, ensure all multithreaded accesses
>>> are only through atomic variables, and you label every access to an
>>> atomic variable properly (although my point is: exactly what should that
>>> be??), then there is no problem.
>
>> BTW, the above sounds daunting when writing in C because you have to do
>> that analysis yourself, but there are programming languages out there
>> which will do that analysis for you as part of type checking.
>> I'm thinking here of languages like Rust or the STM library of
>> Haskell. This also solves the problem that memory accesses can be
>> reordered by the compiler, since in that case the compiler is fully
>> aware of which accesses can be reordered and which can't.
>
>
>> Stefan
><
>I created the Exotic Synchronization Method such that you could just
>write the code needed to do the work, and then decorate those accesses
>which are participating in the ATOMIC event. So, lets say you want to
>move an element from one doubly linked list to another place in some
>other doubly linked list:: you would write::
><
>BOOLEAN MoveElement( Element *fr, Element *to )
>{
> fn = fr->next;
> fp = fr->prev;
> tn = to->next;
>
>
>
> if( TRUE )
> {
> fp->next = fn;
> fn->prev = fp;
> to->next = fr;
> tn->prev = fr;
> fr->prev = to;
> fr->next = tn;
> return TRUE;
> }
> return FALSE;
>}
>
>In order to change this into a fully qualified ATOMIC event, the code
>is decorated as::
>
>BOOLEAN MoveElement( Element *fr, Element *to )
>{
> esmLOCK( fn = fr->next ); // get data
> esmLOCK( fp = fr->prev );
> esmLOCK( tn = to->next );
> esmLOCK( fn ); // touch data
> esmLOCK( fp );
> esmLOCK( tn );
> if( !esmINTERFERENCE() )
> {
> fp->next = fn; // move the bits around
> fn->prev = fp;
> to->next = fr;
> tn->prev = fr;
> fr->prev = to;
> esmLOCK( fr->next = tn );
> return TRUE;
> }
> return FALSE;
>}
>
>Having a multiplicity of containers participate in an ATOMIC event
>is key to making ATOMIC stuff fast and needing fewer ATOMICs to
>to get the job(s) done.

That looks suspiciously like transactional memory.

Re: Memory dependency microbenchmark

<47787581fe3402eb5170fda771088ef7@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35077&group=comp.arch#35077

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 18 Nov 2023 01:03:26 +0000
Organization: novaBBS
Message-ID: <47787581fe3402eb5170fda771088ef7@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <7bT5N.54414$BbXa.29985@fx16.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1275632"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$gcQavjB1HF/cQDTjmCYHeeUutF9mGEMbKtG4SzJBYHAaeMBs0Y1GK
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Sat, 18 Nov 2023 01:03 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup) writes:
>>Stefan Monnier wrote:
>>
>>>> As long as you fully analyze your program, ensure all multithreaded accesses
>>>> are only through atomic variables, and you label every access to an
>>>> atomic variable properly (although my point is: exactly what should that
>>>> be??), then there is no problem.
>>
>>> BTW, the above sounds daunting when writing in C because you have to do
>>> that analysis yourself, but there are programming languages out there
>>> which will do that analysis for you as part of type checking.
>>> I'm thinking here of languages like Rust or the STM library of
>>> Haskell. This also solves the problem that memory accesses can be
>>> reordered by the compiler, since in that case the compiler is fully
>>> aware of which accesses can be reordered and which can't.
>>
>>
>>> Stefan
>><
>>I created the Exotic Synchronization Method such that you could just
>>write the code needed to do the work, and then decorate those accesses
>>which are participating in the ATOMIC event. So, lets say you want to
>>move an element from one doubly linked list to another place in some
>>other doubly linked list:: you would write::
>><
>>BOOLEAN MoveElement( Element *fr, Element *to )
>>{
>> fn = fr->next;
>> fp = fr->prev;
>> tn = to->next;
>>
>>
>>
>> if( TRUE )
>> {
>> fp->next = fn;
>> fn->prev = fp;
>> to->next = fr;
>> tn->prev = fr;
>> fr->prev = to;
>> fr->next = tn;
>> return TRUE;
>> }
>> return FALSE;
>>}
>>
>>In order to change this into a fully qualified ATOMIC event, the code
>>is decorated as::
>>
>>BOOLEAN MoveElement( Element *fr, Element *to )
>>{
>> esmLOCK( fn = fr->next ); // get data
>> esmLOCK( fp = fr->prev );
>> esmLOCK( tn = to->next );
>> esmLOCK( fn ); // touch data
>> esmLOCK( fp );
>> esmLOCK( tn );
>> if( !esmINTERFERENCE() )
>> {
>> fp->next = fn; // move the bits around
>> fn->prev = fp;
>> to->next = fr;
>> tn->prev = fr;
>> fr->prev = to;
>> esmLOCK( fr->next = tn );
>> return TRUE;
>> }
>> return FALSE;
>>}
>>
>>Having a multiplicity of containers participate in an ATOMIC event
>>is key to making ATOMIC stuff fast and needing fewer ATOMICs to
>>to get the job(s) done.

> That looks suspiciously like transactional memory.

I has some flavors of such, but::
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;

But does have the property that all interested 3rd parties see participating
memory only in the before or only in the completely after states.

Re: Memory dependency microbenchmark

<20231118211047.00002521@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35090&group=comp.arch#35090

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 18 Nov 2023 21:10:47 +0200
Organization: A noiseless patient Spider
Lines: 114
Message-ID: <20231118211047.00002521@yahoo.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me>
<uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
<uj3s14$1vjlh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="b47cc13774b8a2a9a1b1e461ea869f5e";
logging-data="3598409"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18XfIZq/Tf3fGHWYOOeBzo44dXsUXBtebY="
Cancel-Lock: sha1:RPB3Gj/pbDY/ufbahs6Cjp5XIzA=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Sat, 18 Nov 2023 19:10 UTC

On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
kegs@provalid.com (Kent Dickey) wrote:

> In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
> Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> >> Are you thinking about a magic arch that we cannot use right now?
> >
> >Yes, he is, obviously.
> >So when you say "it's bad", please tell us why.
> >
> >We know it would run slow on existing CPUs, that's not the question.
> >The question is: why would it be impossible or very hard to
> >make a CPU that could execute such code efficiently.
> >
> >I suspect there can be a very valid reasons, maybe for the same
> >kinds of reasons why some systems allow nested transactions (e.g.
> >when you have a transaction with two calls to `gensym`: it doesn't
> >matter whether the two calls really return consecutive symbols (as
> >would be guaranteed if the code were truly run atomically), all that
> >matters is that those symbols are unique).
> >
> >So maybe with sequential consistency, there could be some forms of
> >parallelism which we'd completely disallow, whereas a weaker form of
> >consistency would allow it. I'm having a hard time imagining what it
> >could be, tho.
> >
> >BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
> >their big supercomputers, no?
> >
> >
> > Stefan
>
> Every HP PA-RISC system, from workstation to the largest server, were
> sequentially consistent and needed no barriers. It was not a problem,
> I never thought it was even considered any sort of performance issue.
> Once you decide to support it, you just throw some hardware at it,
> and you're done.
>
> Since the original HP PA-RISC MP designs were sequentially
> consistent, all the implementations afterward kept it up since no one
> wanted any existing code to break. The architects defined the
> architecture to allow weak ordering, but no implementation (by HP at
> least) did so. These architects then went on to IA-64, where it
> really is weak, since IA64 was in order, so it has more of a payoff
> there since IA64 didn't want to spend hardware on this (they had 50
> other things to waste it on), and IA-64 is full of bad ideas and
> ideas just not implemented well.
>
> Sun was TSO, which is weaker. Sun was never a performance champion,
> other than by throwing the most cores at a parallel problem. So being
> TSO relative to Sequential Consistency didn't seem to buy Sun much.
> DEC Alpha was the poster child of weakly ordered (so weak they didn't
> maintain single CPU consistent ordering with itself) and it was often
> a performance champion on tech workloads, but that had more to do
> with their much higher clock frequencies, and that edge went away
> once out-of-order took off. DEC Alpha was never a big player in
> TPC-C workloads, where everybody made their money in the 90's.
> Technical computing is nice and fun, but there was not a lot of
> profit in it compared to business workloads in the 90s. IA64's
> reason to exist was to run SGI/SUN/DEC out of business, and it
> effectively did (while hurting HP about as much).
>
> I don't know if SGI was sequentially consistent.

AFAIK, Stefan Monnier is correct. SGI MIPS gear was SC.
Their later gear (Itanium and x86) were not, because of underlying
cores.

> It's possible, since
> software developed for other systems might have pushed them to
> support it, but the academic RISC folks were pretty big on weakly
> ordered.
>
> A problem with weakly ordered is no implementation is THAT weakly
> ordered. To maximize doing things in a bad order requires "unlucky"
> cache misses, and these are just not common in practice. So "Store
> A; Store B" often appears to be done in that order with no barriers
> on weakly ordered systems. It's hard to feel confident you've
> written anything complex right, so most algorithms are kept
> relatively simple to make it more likely they are well tested.
>
> HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
> using just load and store. I forget the details, but it was something
> like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
> your byte to "iwant", then read the word. If everyone else is 0, then
> set your byte to "iwon", then read the word. If everyone else is
> still 0, you've won, do what you want. And if you see other bytes
> getting set, then you move to a backoff algorithm to determine the
> winner (you have 256 states you can move through). Note that when
> you release the lock, you can pick the next winner with a single
> store. What I understood was this was actually faster than a simple
> compare-and-swap since it let software immediately see if there was
> contention, and move to a back-off algorithm right away (and you can
> see who's contending, and deal with that as well). Spinlocks tend to
> lead to cache invalidation storms, and it's hard to tune, but this
> was much more tuneable. It scaled to 64-CPUs by doing the lock in
> two steps, and moving to a 64-bit word.
>
> Kent

I don't see how high-end SC core can match performance of high-end
TSO-or-weaker core with weak point of SC being single-thread
performance in the case where code encounters few L1D misses intermixed
with plenty of L1D hits.
Of course, I don't know all tricks that high-end core designers are
using today or even 10% of their tricks. So, may be, what I consider
impossible is in fact possible.
But as a matter of fact nobody designed brand new high-end SC core in
these century. The closest were probably MIPS cores from Scott
Lurndal's employee Cavium, but I think they were always at least factor
of 1.5x slower than contemporary state of the art in single-thread
performance and more commonly factor of 2+.

Re: Memory dependency microbenchmark

<ujbbdj$3fmi7$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35092&group=comp.arch#35092

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 18 Nov 2023 13:47:31 -0800
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <ujbbdj$3fmi7$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 18 Nov 2023 21:47:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c6bb7d11208b1d65e158106be7298c0f";
logging-data="3660359"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LTNgzSEExHlBiCPqGCUG44ykLPrmE9Vs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Gu5V71uIcenLW9zde6bgw7sJElY=
In-Reply-To: <47787581fe3402eb5170fda771088ef7@news.novabbs.com>
Content-Language: en-US
 by: Chris M. Thomasson - Sat, 18 Nov 2023 21:47 UTC

On 11/17/2023 5:03 PM, MitchAlsup wrote:
> Scott Lurndal wrote:
>
>> mitchalsup@aol.com (MitchAlsup) writes:
>>> Stefan Monnier wrote:
>>>
>>>>> As long as you fully analyze your program, ensure all multithreaded
>>>>> accesses
>>>>> are only through atomic variables, and you label every access to an
>>>>> atomic variable properly (although my point is: exactly what should
>>>>> that
>>>>> be??), then there is no problem.
>>>
>>>> BTW, the above sounds daunting when writing in C because you have to do
>>>> that analysis yourself, but there are programming languages out there
>>>> which will do that analysis for you as part of type checking.
>>>> I'm thinking here of languages like Rust or the STM library of
>>>> Haskell.  This also solves the problem that memory accesses can be
>>>> reordered by the compiler, since in that case the compiler is fully
>>>> aware of which accesses can be reordered and which can't.
>>>
>>>
>>>>         Stefan
>>> <
>>> I created the Exotic Synchronization Method such that you could just
>>> write the code needed to do the work, and then decorate those accesses
>>> which are participating in the ATOMIC event. So, lets say you want to
>>> move an element from one doubly linked list to another place in some
>>> other doubly linked list:: you would write::
>>> <
>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>> {
>>>     fn = fr->next;
>>>     fp = fr->prev;
>>>     tn = to->next;
>>>
>>>
>>>
>>>     if( TRUE )
>>>     {
>>>              fp->next = fn;
>>>              fn->prev = fp;
>>>              to->next = fr;
>>>              tn->prev = fr;
>>>              fr->prev = to;
>>>              fr->next = tn;
>>>              return TRUE;
>>>     }
>>>     return FALSE;
>>> }
>>>
>>> In order to change this into a fully qualified ATOMIC event, the code
>>> is decorated as::
>>>
>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>> {
>>>     esmLOCK( fn = fr->next );         // get data
>>>     esmLOCK( fp = fr->prev );
>>>     esmLOCK( tn = to->next );
>>>     esmLOCK( fn );                    // touch data
>>>     esmLOCK( fp );
>>>     esmLOCK( tn );
>>>     if( !esmINTERFERENCE() )
>>>     {
>>>              fp->next = fn;           // move the bits around
>>>              fn->prev = fp;
>>>              to->next = fr;
>>>              tn->prev = fr;
>>>              fr->prev = to;
>>>     esmLOCK( fr->next = tn );
>>>              return TRUE;
>>>     }
>>>     return FALSE;
>>> }
>>>
>>> Having a multiplicity of containers participate in an ATOMIC event
>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
>>> get the job(s) done.
>
>> That looks suspiciously like transactional memory.

Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

>
> I has some flavors of such, but::
> it has no nesting,
> it has a strict limit of 8 participating cache lines,
> it automagically transfers control when disruptive interference is
> detected,
> it is subject to timeouts;
>
> But does have the property that all interested 3rd parties see
> participating
> memory only in the before or only in the completely after states.

Are you familiar with KCSS? K-Compare Single Swap?

https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

?

Check this out the old thread:

https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

Re: Memory dependency microbenchmark

<ujc0bj$3lu85$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35096&group=comp.arch#35096

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 18 Nov 2023 19:44:50 -0800
Organization: A noiseless patient Spider
Lines: 4
Message-ID: <ujc0bj$3lu85$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 19 Nov 2023 03:44:51 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="26803417fac4b0c1b76764e54e58e1a6";
logging-data="3864837"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pAT+ikImZDS6aRQeHr8K6IRIMxA7DDWE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:opEsEU3/fioTQeAIOJX2WZeg0t4=
In-Reply-To: <uj3s14$1vjlh$1@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Sun, 19 Nov 2023 03:44 UTC

On 11/15/2023 5:41 PM, Kent Dickey wrote:
[...]

Btw, thank you for Kegs! Excellent.

Re: Memory dependency microbenchmark

<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35108&group=comp.arch#35108

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 19 Nov 2023 19:32:07 +0000
Organization: novaBBS
Message-ID: <cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <7bT5N.54414$BbXa.29985@fx16.iad> <47787581fe3402eb5170fda771088ef7@news.novabbs.com> <ujbbdj$3fmi7$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1464121"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Level: *
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$C6IAQ5ltmz8s4pcGEs1E0.DHy7UU5Fjh5fBf1eEOPm/cWMVQ.CBNi
 by: MitchAlsup - Sun, 19 Nov 2023 19:32 UTC

Chris M. Thomasson wrote:

> On 11/17/2023 5:03 PM, MitchAlsup wrote:
>> Scott Lurndal wrote:
>>
>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>> Stefan Monnier wrote:
>>>>
>>>>>> As long as you fully analyze your program, ensure all multithreaded
>>>>>> accesses
>>>>>> are only through atomic variables, and you label every access to an
>>>>>> atomic variable properly (although my point is: exactly what should
>>>>>> that
>>>>>> be??), then there is no problem.
>>>>
>>>>> BTW, the above sounds daunting when writing in C because you have to do
>>>>> that analysis yourself, but there are programming languages out there
>>>>> which will do that analysis for you as part of type checking.
>>>>> I'm thinking here of languages like Rust or the STM library of
>>>>> Haskell.  This also solves the problem that memory accesses can be
>>>>> reordered by the compiler, since in that case the compiler is fully
>>>>> aware of which accesses can be reordered and which can't.
>>>>
>>>>
>>>>>         Stefan
>>>> <
>>>> I created the Exotic Synchronization Method such that you could just
>>>> write the code needed to do the work, and then decorate those accesses
>>>> which are participating in the ATOMIC event. So, lets say you want to
>>>> move an element from one doubly linked list to another place in some
>>>> other doubly linked list:: you would write::
>>>> <
>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>> {
>>>>     fn = fr->next;
>>>>     fp = fr->prev;
>>>>     tn = to->next;
>>>>
>>>>
>>>>
>>>>     if( TRUE )
>>>>     {
>>>>              fp->next = fn;
>>>>              fn->prev = fp;
>>>>              to->next = fr;
>>>>              tn->prev = fr;
>>>>              fr->prev = to;
>>>>              fr->next = tn;
>>>>              return TRUE;
>>>>     }
>>>>     return FALSE;
>>>> }
>>>>
>>>> In order to change this into a fully qualified ATOMIC event, the code
>>>> is decorated as::
>>>>
>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>> {
>>>>     esmLOCK( fn = fr->next );         // get data
>>>>     esmLOCK( fp = fr->prev );
>>>>     esmLOCK( tn = to->next );
>>>>     esmLOCK( fn );                    // touch data
>>>>     esmLOCK( fp );
>>>>     esmLOCK( tn );
>>>>     if( !esmINTERFERENCE() )
>>>>     {
>>>>              fp->next = fn;           // move the bits around
>>>>              fn->prev = fp;
>>>>              to->next = fr;
>>>>              tn->prev = fr;
>>>>              fr->prev = to;
>>>>     esmLOCK( fr->next = tn );
>>>>              return TRUE;
>>>>     }
>>>>     return FALSE;
>>>> }
>>>>
>>>> Having a multiplicity of containers participate in an ATOMIC event
>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
>>>> get the job(s) done.
>>
>>> That looks suspiciously like transactional memory.

> Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
conditional branch where the condition is delivered from the miss
buffer (where I detect interference wrt participating cache lines.)

>>
>> I has some flavors of such, but::
>> it has no nesting,
>> it has a strict limit of 8 participating cache lines,
>> it automagically transfers control when disruptive interference is
>> detected,
>> it is subject to timeouts;
>>
>> But does have the property that all interested 3rd parties see
>> participating
>> memory only in the before or only in the completely after states.

> Are you familiar with KCSS? K-Compare Single Swap?

> https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

Easily done:
esmLOCK( c1 = p1->condition1 );
esmLOCK( c2 = p2->condition2 );
...
if( c1 == C1 && C2 == C2 && c2 == C3 ... )
...
esmLOCK( some data );

Esm was designed to allow any known synchronization means (in 2013)
to be directly implemented in esm either inline or via subroutine
calls.

> ?

> Check this out the old thread:

> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

Re: Memory dependency microbenchmark

<def7206542bfd9715d59357e69b413d6@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35109&group=comp.arch#35109

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 19 Nov 2023 19:38:47 +0000
Organization: novaBBS
Message-ID: <def7206542bfd9715d59357e69b413d6@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me> <20231118211047.00002521@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1464838"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$6xROBvpDpXKvOIxIVPQytOZqUYBMfyNvFOrJNRj0.veiksTiJPVpu
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Sun, 19 Nov 2023 19:38 UTC

Michael S wrote:

> On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
> kegs@provalid.com (Kent Dickey) wrote:

>> Kent

> I don't see how high-end SC core can match performance of high-end
> TSO-or-weaker core with weak point of SC being single-thread
> performance in the case where code encounters few L1D misses intermixed
> with plenty of L1D hits.

They cannot--but does it really matter ??

Back in 1993 I ran an experiment on Mc 88120 where we did not consider
a memory reference performed until we got a signal back from one of the
multiple DRAM controllers (compared to fire and forget) and an big
complicated applications the loss in performance was about 2%--2% perf
loss allowed on to be in a position to recover from an ECC failure in
the transport of write data to DRAM.

This was still not TSO but it is a relevant data point.

> Of course, I don't know all tricks that high-end core designers are
> using today or even 10% of their tricks. So, may be, what I consider
> impossible is in fact possible.

> But as a matter of fact nobody designed brand new high-end SC core in
> these century. The closest were probably MIPS cores from Scott
> Lurndal's employee Cavium, but I think they were always at least factor
> of 1.5x slower than contemporary state of the art in single-thread
> performance and more commonly factor of 2+.

Right now, nobody is competitive with x86-64 at the high end. You are
going to need a 5GHz core operating 4 instructions per cycle to be
within spitting distance.

Re: Memory dependency microbenchmark

<ujdonm$3u8mv$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35110&group=comp.arch#35110

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 19 Nov 2023 11:47:01 -0800
Organization: A noiseless patient Spider
Lines: 142
Message-ID: <ujdonm$3u8mv$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
<ujbbdj$3fmi7$1@dont-email.me>
<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 19 Nov 2023 19:47:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="26803417fac4b0c1b76764e54e58e1a6";
logging-data="4137695"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/sfCPGB4zZxaUnlmLhywlzbxbsEEdfxYM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:c4CYy9CaID9rQnGN6jFLSmBfao0=
Content-Language: en-US
In-Reply-To: <cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
 by: Chris M. Thomasson - Sun, 19 Nov 2023 19:47 UTC

On 11/19/2023 11:32 AM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/17/2023 5:03 PM, MitchAlsup wrote:
>>> Scott Lurndal wrote:
>>>
>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>> Stefan Monnier wrote:
>>>>>
>>>>>>> As long as you fully analyze your program, ensure all
>>>>>>> multithreaded accesses
>>>>>>> are only through atomic variables, and you label every access to an
>>>>>>> atomic variable properly (although my point is: exactly what
>>>>>>> should that
>>>>>>> be??), then there is no problem.
>>>>>
>>>>>> BTW, the above sounds daunting when writing in C because you have
>>>>>> to do
>>>>>> that analysis yourself, but there are programming languages out there
>>>>>> which will do that analysis for you as part of type checking.
>>>>>> I'm thinking here of languages like Rust or the STM library of
>>>>>> Haskell.  This also solves the problem that memory accesses can be
>>>>>> reordered by the compiler, since in that case the compiler is fully
>>>>>> aware of which accesses can be reordered and which can't.
>>>>>
>>>>>
>>>>>>         Stefan
>>>>> <
>>>>> I created the Exotic Synchronization Method such that you could just
>>>>> write the code needed to do the work, and then decorate those accesses
>>>>> which are participating in the ATOMIC event. So, lets say you want
>>>>> to move an element from one doubly linked list to another place in
>>>>> some
>>>>> other doubly linked list:: you would write::
>>>>> <
>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>> {
>>>>>     fn = fr->next;
>>>>>     fp = fr->prev;
>>>>>     tn = to->next;
>>>>>
>>>>>
>>>>>
>>>>>     if( TRUE )
>>>>>     {
>>>>>              fp->next = fn;
>>>>>              fn->prev = fp;
>>>>>              to->next = fr;
>>>>>              tn->prev = fr;
>>>>>              fr->prev = to;
>>>>>              fr->next = tn;
>>>>>              return TRUE;
>>>>>     }
>>>>>     return FALSE;
>>>>> }
>>>>>
>>>>> In order to change this into a fully qualified ATOMIC event, the code
>>>>> is decorated as::
>>>>>
>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>> {
>>>>>     esmLOCK( fn = fr->next );         // get data
>>>>>     esmLOCK( fp = fr->prev );
>>>>>     esmLOCK( tn = to->next );
>>>>>     esmLOCK( fn );                    // touch data
>>>>>     esmLOCK( fp );
>>>>>     esmLOCK( tn );
>>>>>     if( !esmINTERFERENCE() )
>>>>>     {
>>>>>              fp->next = fn;           // move the bits around
>>>>>              fn->prev = fp;
>>>>>              to->next = fr;
>>>>>              tn->prev = fr;
>>>>>              fr->prev = to;
>>>>>     esmLOCK( fr->next = tn );
>>>>>              return TRUE;
>>>>>     }
>>>>>     return FALSE;
>>>>> }
>>>>>
>>>>> Having a multiplicity of containers participate in an ATOMIC event
>>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
>>>>> get the job(s) done.
>>>
>>>> That looks suspiciously like transactional memory.
>
>> Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
>
> esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
> conditional branch where the condition is delivered from the miss
> buffer (where I detect interference wrt participating cache lines.)

So, can false sharing on a participating cache line make
esmINTERFERENCE() return true?

>
>>>
>>> I has some flavors of such, but::
>>> it has no nesting,
>>> it has a strict limit of 8 participating cache lines,
>>> it automagically transfers control when disruptive interference is
>>> detected,
>>> it is subject to timeouts;
>>>
>>> But does have the property that all interested 3rd parties see
>>> participating
>>> memory only in the before or only in the completely after states.
>
>> Are you familiar with KCSS? K-Compare Single Swap?
>
>> https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
>
> Easily done:
>     esmLOCK( c1 = p1->condition1 );
>     esmLOCK( c2 = p2->condition2 );
>     ...
>     if( c1 == C1 && C2 == C2 && c2 == C3 ... )
>         ...
>         esmLOCK( some data );
>
> Esm was designed to allow any known synchronization means (in 2013)
> to be directly implemented in esm either inline or via subroutine
> calls.

I can see how that would work. The problem is that I am not exactly sure
how esmINTERFERENCE works internally... Can it detect/prevent live lock?
I remember hearing from my friend Joe Seigh, who worked at IBM, that
they had some sort of logic that would prevent live lock in a compare
and swap wrt their free pool manipulation logic. Iirc, it was somewhat
related to ABA, hard to remember right now, sorry. I need to find that
old thread back in comp.programming.threads.

>
>> ?
>
>> Check this out the old thread:
>
>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

Re: Memory dependency microbenchmark

<ujdu10$3hgra$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35112&group=comp.arch#35112

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2533-0-68ec-5669-f850-19c5.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 19 Nov 2023 21:17:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <ujdu10$3hgra$1@newsreader4.netcologne.de>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me>
<20231118211047.00002521@yahoo.com>
<def7206542bfd9715d59357e69b413d6@news.novabbs.com>
Injection-Date: Sun, 19 Nov 2023 21:17:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2533-0-68ec-5669-f850-19c5.ipv6dyn.netcologne.de:2001:4dd7:2533:0:68ec:5669:f850:19c5";
logging-data="3720042"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 19 Nov 2023 21:17 UTC

MitchAlsup <mitchalsup@aol.com> schrieb:

> Right now, nobody is competitive with x86-64 at the high end. You are
> going to need a 5GHz core operating 4 instructions per cycle to be
> within spitting distance.

Power10 isn't doing badly at 4 GHz if I read the SPECint values
right, but it (very probably) cannot compete on performance per
currency unit.

Re: Memory dependency microbenchmark

<20231120001053.00005acf@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35114&group=comp.arch#35114

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 00:10:53 +0200
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <20231120001053.00005acf@yahoo.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me>
<uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
<uj3s14$1vjlh$1@dont-email.me>
<20231118211047.00002521@yahoo.com>
<def7206542bfd9715d59357e69b413d6@news.novabbs.com>
<ujdu10$3hgra$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="8253098a13aa9de832b791faba8c7fbd";
logging-data="4178071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kQ73MF8a6bNA5HO8bUJK7al6YWnroJWI="
Cancel-Lock: sha1:AnvaQFcEAW7WumD0V+uNkU2ABEU=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
 by: Michael S - Sun, 19 Nov 2023 22:10 UTC

On Sun, 19 Nov 2023 21:17:20 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> MitchAlsup <mitchalsup@aol.com> schrieb:
>
> > Right now, nobody is competitive with x86-64 at the high end. You
> > are going to need a 5GHz core operating 4 instructions per cycle to
> > be within spitting distance.
>
> Power10 isn't doing badly at 4 GHz if I read the SPECint values
> right, but it (very probably) cannot compete on performance per
> currency unit.

You do not read SPECint values right.
For starter, there are no POWER "speed" submissions at all. All
published numbers are embarassingly parallel "rate" scores where since
POWER7 POWER has huge advantage of 8 hardware thread vs all competitors
except Oracle having only 1 or 2 threads.
Since POWER8 IBM applies another trick of calling what is essentially 2
cores a single core.
My uneducated guess is that in sigle-threaded integer performance
POWER10 should be approximately on par with the best Intel Skylake i.e.
below fastest Intel, AMD and Apple chips from 2021.
Even more so relatively to 2023.
And by 2023Q4 apart frome those three mentioned above we have Qualcomm
near the top and even ARM Inc (Cortex X3) while behind the other four,
is probably ahead of IBM.

Re: Memory dependency microbenchmark

<d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35115&group=comp.arch#35115

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 19 Nov 2023 22:23:21 +0000
Organization: novaBBS
Message-ID: <d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <7bT5N.54414$BbXa.29985@fx16.iad> <47787581fe3402eb5170fda771088ef7@news.novabbs.com> <ujbbdj$3fmi7$1@dont-email.me> <cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com> <ujdonm$3u8mv$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1477412"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$dGJkjTFg.plwVZQX431dt.Q1qQxbRa.Rk6TkPTl6DlCpxO7/FbX2S
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Spam-Level: *
 by: MitchAlsup - Sun, 19 Nov 2023 22:23 UTC

Chris M. Thomasson wrote:

> On 11/19/2023 11:32 AM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> On 11/17/2023 5:03 PM, MitchAlsup wrote:
>>>> Scott Lurndal wrote:
>>>>
>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>> Stefan Monnier wrote:
>>>>>>
>>>>>>>> As long as you fully analyze your program, ensure all
>>>>>>>> multithreaded accesses
>>>>>>>> are only through atomic variables, and you label every access to an
>>>>>>>> atomic variable properly (although my point is: exactly what
>>>>>>>> should that
>>>>>>>> be??), then there is no problem.
>>>>>>
>>>>>>> BTW, the above sounds daunting when writing in C because you have
>>>>>>> to do
>>>>>>> that analysis yourself, but there are programming languages out there
>>>>>>> which will do that analysis for you as part of type checking.
>>>>>>> I'm thinking here of languages like Rust or the STM library of
>>>>>>> Haskell.  This also solves the problem that memory accesses can be
>>>>>>> reordered by the compiler, since in that case the compiler is fully
>>>>>>> aware of which accesses can be reordered and which can't.
>>>>>>
>>>>>>
>>>>>>>         Stefan
>>>>>> <
>>>>>> I created the Exotic Synchronization Method such that you could just
>>>>>> write the code needed to do the work, and then decorate those accesses
>>>>>> which are participating in the ATOMIC event. So, lets say you want
>>>>>> to move an element from one doubly linked list to another place in
>>>>>> some
>>>>>> other doubly linked list:: you would write::
>>>>>> <
>>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>>> {
>>>>>>     fn = fr->next;
>>>>>>     fp = fr->prev;
>>>>>>     tn = to->next;
>>>>>>
>>>>>>
>>>>>>
>>>>>>     if( TRUE )
>>>>>>     {
>>>>>>              fp->next = fn;
>>>>>>              fn->prev = fp;
>>>>>>              to->next = fr;
>>>>>>              tn->prev = fr;
>>>>>>              fr->prev = to;
>>>>>>              fr->next = tn;
>>>>>>              return TRUE;
>>>>>>     }
>>>>>>     return FALSE;
>>>>>> }
>>>>>>
>>>>>> In order to change this into a fully qualified ATOMIC event, the code
>>>>>> is decorated as::
>>>>>>
>>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>>> {
>>>>>>     esmLOCK( fn = fr->next );         // get data
>>>>>>     esmLOCK( fp = fr->prev );
>>>>>>     esmLOCK( tn = to->next );
>>>>>>     esmLOCK( fn );                    // touch data
>>>>>>     esmLOCK( fp );
>>>>>>     esmLOCK( tn );
>>>>>>     if( !esmINTERFERENCE() )
>>>>>>     {
>>>>>>              fp->next = fn;           // move the bits around
>>>>>>              fn->prev = fp;
>>>>>>              to->next = fr;
>>>>>>              tn->prev = fr;
>>>>>>              fr->prev = to;
>>>>>>     esmLOCK( fr->next = tn );
>>>>>>              return TRUE;
>>>>>>     }
>>>>>>     return FALSE;
>>>>>> }
>>>>>>
>>>>>> Having a multiplicity of containers participate in an ATOMIC event
>>>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
>>>>>> get the job(s) done.
>>>>
>>>>> That looks suspiciously like transactional memory.
>>
>>> Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
>>
>> esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
>> conditional branch where the condition is delivered from the miss
>> buffer (where I detect interference wrt participating cache lines.)

> So, can false sharing on a participating cache line make
> esmINTERFERENCE() return true?

>>
>>>>
>>>> I has some flavors of such, but::
>>>> it has no nesting,
>>>> it has a strict limit of 8 participating cache lines,
>>>> it automagically transfers control when disruptive interference is
>>>> detected,
>>>> it is subject to timeouts;
>>>>
>>>> But does have the property that all interested 3rd parties see
>>>> participating
>>>> memory only in the before or only in the completely after states.
>>
>>> Are you familiar with KCSS? K-Compare Single Swap?
>>
>>> https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
>>
>> Easily done:
>>     esmLOCK( c1 = p1->condition1 );
>>     esmLOCK( c2 = p2->condition2 );
>>     ...
>>     if( c1 == C1 && C2 == C2 && c2 == C3 ... )
>>         ...
>>         esmLOCK( some data );
>>
>> Esm was designed to allow any known synchronization means (in 2013)
>> to be directly implemented in esm either inline or via subroutine
>> calls.

> I can see how that would work. The problem is that I am not exactly sure
> how esmINTERFERENCE works internally... Can it detect/prevent live lock?

esmINTERFERENCE is a branch on interference instruction. This is a conditional
branch instruction that queries whether any of the participating cache lines
has seen a read-with-intent or coherent-invalidate. In effect the branch
logic reaches out to the miss buffer and asks if any of the participating
cache lines has been snooped-for-write:: and you can't do this in 2 instruc-
tions or you loose ATOMICicity.

If it has, control is transferred and the ATOMIC event is failed
If it has not, and all participating cache lines are present, then this
core is allowed to NAK all requests to those participating cache lines
{and control is not transferred}.

So, you gain control over where flow goes on failure, and essentially
commit the whole event to finish.

So, while it does not eliminate live/dead-lock situations, it allows SW
to be constructed to avoid live/dead lock situations:: Why is a value
which is provided when an ATOMIC event fails. 0 means success, negative
values are spurious (buffer overflows,...) while positives represent
the number of competing threads, so the following case, skips elements
on a linked list to decrease future initerference.

Element* getElement( unSigned Key )
{ int count = 0;
for( p = structure.head; p ; p = p->next )
{
if( p->Key == Key )
{
if( count-- < 0 )
{
esmLOCK( p );
prev = p->prev;
esmLOCK( prev );
next = p->next;
esmLOCK( next );
if( !esmINTERFERENCE() )
{
p->prev = next;
p->next = prev;
p->prev = NULL;
esmLOCK( p->next = NULL );
return p;
}
else
{
count = esmWHY();
p = structure.head;
}
}
}
}
return NULL;
}

Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
that happens when a timer goes off and n threads all want access to the
work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
to BigO( ln( n ) ) when requesters arrive in random order at random time.

> I remember hearing from my friend Joe Seigh, who worked at IBM, that
> they had some sort of logic that would prevent live lock in a compare
> and swap wrt their free pool manipulation logic. Iirc, it was somewhat
> related to ABA, hard to remember right now, sorry. I need to find that
> old thread back in comp.programming.threads.


Click here to read the complete article
Re: Memory dependency microbenchmark

<2023Nov20.083409@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35119&group=comp.arch#35119

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 07:34:09 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 69
Message-ID: <2023Nov20.083409@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me> <20231118211047.00002521@yahoo.com> <def7206542bfd9715d59357e69b413d6@news.novabbs.com> <ujdu10$3hgra$1@newsreader4.netcologne.de> <20231120001053.00005acf@yahoo.com>
Injection-Info: dont-email.me; posting-host="675554203234385e5bf21c1ea33282ab";
logging-data="267916"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/R6cEJM+ssoCpEfJJQXkqX"
Cancel-Lock: sha1:M7sLSRTaYE+QDRwdsv04u+UNDDQ=
X-newsreader: xrn 10.11
 by: Anton Ertl - Mon, 20 Nov 2023 07:34 UTC

Michael S <already5chosen@yahoo.com> writes:
>My uneducated guess is that in sigle-threaded integer performance
>POWER10 should be approximately on par with the best Intel Skylake

Slightly more educated:

gforth-fast onebench.fs (development version of Gforth), lower is better:

sieve bubble matrix fib fft version; machine
0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz; gcc-10.1

Another benchmark I can run quickly is the LaTeX benchmark
<http://www.complang.tuwien.ac.at/anton/latex-bench/>:

On the Power10:

487.14 msec task-clock # 0.998 CPUs utilized
3 context-switches # 6.158 /sec
1 cpu-migrations # 2.053 /sec
590 page-faults # 1.211 K/sec
1897813538 cycles # 3.896 GHz
4411333480 instructions # 2.32 insn per cycle
542280135 branches # 1.113 G/sec
11170586 branch-misses # 2.06% of all branches

0.488029554 seconds time elapsed

0.477731000 seconds user
0.010164000 seconds sys

On the Core i5-6600K (Skylake):

396.04 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
7091 page-faults # 0.018 M/sec
1584116718 cycles # 4.000 GHz
3015599019 instructions # 1.90 insn per cycle
564834472 branches # 1426.217 M/sec
8230928 branch-misses # 1.46% of all branches

0.396422458 seconds time elapsed

0.380444000 seconds user
0.016018000 seconds sys

One disadvantage of this LaTeX benchmark is that it depends on the
LaTeX version, and on the installed packages how much it has to do.
In the present case both systems use Tex Live 2020, so that should
make no difference. The Skylake has a lot of packages installed and
executes more instructions for the LaTeX benchmark than any other
AMD64 system where I measured the executed instructions. I do not
know how many packages are installed on the Power10 system.

Comparing the number of branches is interesting. If we assume that
compiler transformations like if-conversion and loop unrolling do
*not* cause a significant difference in branches, the similar number
of branches suggests that the workload is comparable.

In any case, both results indicate that the 3900MHz Power10 has a
lower single-thread performance than a 4GHz Skylake. So you buy a
Power10 if you want to process large numbers of threads or processes
and want to put in lots of RAM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<20231120164005.000074d3@yahoo.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35122&group=comp.arch#35122

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: already5chosen@yahoo.com (Michael S)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 16:40:05 +0200
Organization: A noiseless patient Spider
Lines: 86
Message-ID: <20231120164005.000074d3@yahoo.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me>
<uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
<uj3s14$1vjlh$1@dont-email.me>
<20231118211047.00002521@yahoo.com>
<def7206542bfd9715d59357e69b413d6@news.novabbs.com>
<ujdu10$3hgra$1@newsreader4.netcologne.de>
<20231120001053.00005acf@yahoo.com>
<2023Nov20.083409@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="0b2ecea226e69bfc947d787beb42585e";
logging-data="375643"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TOpuUc9yhGaATJkwh+PAAHTEKCMM1Ykc="
Cancel-Lock: sha1:9CB+7iBOhb6rgF/SlXZdPx+MUtQ=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
 by: Michael S - Mon, 20 Nov 2023 14:40 UTC

On Mon, 20 Nov 2023 07:34:09 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >My uneducated guess is that in sigle-threaded integer performance
> >POWER10 should be approximately on par with the best Intel Skylake
>
> Slightly more educated:
>
> gforth-fast onebench.fs (development version of Gforth), lower is
> better:
>
> sieve bubble matrix fib fft version; machine
> 0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
> 0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz;
> gcc-10.1
>
> Another benchmark I can run quickly is the LaTeX benchmark
> <http://www.complang.tuwien.ac.at/anton/latex-bench/>:
>
> On the Power10:
>
> 487.14 msec task-clock # 0.998 CPUs
> utilized 3 context-switches # 6.158 /sec
> 1 cpu-migrations # 2.053 /sec
> 590 page-faults # 1.211 K/sec
> 1897813538 cycles # 3.896 GHz
> 4411333480 instructions # 2.32 insn per
> cycle 542280135 branches # 1.113 G/sec
> 11170586 branch-misses # 2.06% of all
> branches
>
> 0.488029554 seconds time elapsed
>
> 0.477731000 seconds user
> 0.010164000 seconds sys
>
> On the Core i5-6600K (Skylake):
>
> 396.04 msec task-clock # 0.999 CPUs
> utilized 2 context-switches # 0.005 K/sec
> 0 cpu-migrations # 0.000 K/sec
> 7091 page-faults # 0.018 M/sec
> 1584116718 cycles # 4.000 GHz
> 3015599019 instructions # 1.90 insn per
> cycle 564834472 branches # 1426.217 M/sec
> 8230928 branch-misses # 1.46% of all
> branches
>
> 0.396422458 seconds time elapsed
>
> 0.380444000 seconds user
> 0.016018000 seconds sys
>
> One disadvantage of this LaTeX benchmark is that it depends on the
> LaTeX version, and on the installed packages how much it has to do.
> In the present case both systems use Tex Live 2020, so that should
> make no difference. The Skylake has a lot of packages installed and
> executes more instructions for the LaTeX benchmark than any other
> AMD64 system where I measured the executed instructions. I do not
> know how many packages are installed on the Power10 system.
>
> Comparing the number of branches is interesting. If we assume that
> compiler transformations like if-conversion and loop unrolling do
> *not* cause a significant difference in branches, the similar number
> of branches suggests that the workload is comparable.
>
> In any case, both results indicate that the 3900MHz Power10 has a
> lower single-thread performance than a 4GHz Skylake. So you buy a
> Power10 if you want to process large numbers of threads or processes
> and want to put in lots of RAM.
>
> - anton

That's interesting, thank you.

Do not your benchmarks have rather small datasets?
If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
with their big, slow L2 caches at disadvantage relatively to Skylake
Client that has small fast L2 cache.

But if your numbers are representative then POWER10 can be matched by
rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
E3-1271 v3 would stay a chance. Or by AMD Zen1.

Re: weak consistency and the supercomputer attitude

<ujfv8l$ca4g$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35124&group=comp.arch#35124

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: weak consistency and the supercomputer attitude
Date: Mon, 20 Nov 2023 10:50:34 -0500
Organization: A noiseless patient Spider
Lines: 195
Message-ID: <ujfv8l$ca4g$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
<uirke6$8hef$3@dont-email.me> <2023Nov13.084835@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 20 Nov 2023 15:50:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7c50aecd46a7733eb37093e73ca1b1cb";
logging-data="403600"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+aJ+qaKn4J+fLBFo0fAMn3K06NJ7webrY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:6x5ShhtnAYhVzHjGzd4FcDgWid0=
In-Reply-To: <2023Nov13.084835@mips.complang.tuwien.ac.at>
 by: Paul A. Clayton - Mon, 20 Nov 2023 15:50 UTC

On 11/13/23 2:48 AM, Anton Ertl wrote:
[snip]
> I think about several similar instances, where people went for
> simple-minded hardware designs and threw the complexity over the wall
> to the software people, and claimed that it was for performance; I
> call that the "supercomputing attitude", and it may work in areas
> where the software crisis has not yet struck[1], but is a bad attitude
> in areas like general-purpose computing where it has struck.

This is not just a hardware-software wall problem, though that
wall and its abuse is usually well-established. As someone with a
micro-optimization orientation, I know I need more external
awareness, but as a non-practicing entity what I think or present
has little effect/danger. Even in my case, there is some danger of
spreading a falsehood (or dangerously incomplete truths), so
external correction is valuable (and I value it myself as I
dislike being wrong, being corrected early hurts but hurts less
than being corrected after the inaccuracy has been well-
established in my own and others' minds).

System-aware optimization also interacts with interface layering.
Isolating concerns reduces design complexity and from a given
complexity allows exploiting "don't care" aspects. The "don't
care" aspects can be painful when the interface user does care;
sometimes these can force a violating of the abstraction,
introducing a dependency of a specific implementation (which can
then introduce an informal interface [performance compatibility
is a common informal interface]).

> 1) People thought that they could achieve faster hardware by throwing
> the task of scheduling instructions for maximum instruction-level
> parallelism over to the compiler people. Several companies (in
> particular, Intel, HP, and Transmeta) invested a lot of money into
> this dream (and the Mill project relives this dream), but it turned
> out that doing the scheduling in hardware is faster.

Yet there does not seem to be a strong push to develop a dataflow-
oriented interface/ISA (that is does not require genius
programmers or super-genius compilers). I am not certain what such
an interface would look like, but I suspect something closer to a
transport-triggered architecture (TTA) would be an early step. A
TTA-like architecture would compactly encode single use values and
provide some routing information while supporting (possible)
multiple use and some sense of use deferment (loads and stores).

Value prediction (including branch/predicate prediction) also
seems to be required to be included in design considerations.

Such an ISA would also probably blur the boundaries between
threads and naturally support speculative multithreading, which is
in some sense a distant/variable deferment communication/dataflow.

[snip]
>Meanwhile, Mitch Alsup also has posted that he can
> implement fast denormal numbers with IIRC 30 extra gates (which is
> probably less than what is needed for implementing the trap barrier).

I think that cost estimate assumes the inclusion of (single
rounding) FMADD. Single-rounding FMADD was not common for RISCs
when the Alpha designers made their choice.

I am **certainly not** a numerical analyst, but I had the
impression that flush-to-zero was not horrible for analyzing for
correctness and (for double precision) not commonly a problem. Yet
I also think that having multiply round based on an "integer"
power-of-two high result (without carry-in) — where the hardware
could also be used for integer multiply by reciprocal — might have
been "better", so my opinion should probably be taken with a mine
of salt.

I would not be surprised if special-purpose low-power DSPs not
only use not-IEEE formats but use inexact rounding. Even using
inexact computation might be justified for extreme cases.

> 3) The Alpha is a rich source of examples of the supercomputer
> attitude: It started out without instructions for accessing 8-bit and
> 16-bit data in memory. Instead, the idea was that for accessing
> memory, you would use instruction sequences, and for accessing I/O
> devices, the device was mapped three times or so: In one address range
> you performed bytewise access, in another address range 16-bit
> accesses, and in the third address range 32-bit and 64-bit accesses;
> I/O driver writers had to write or modify their drivers for this
> model. The rationale for that was that they required ECC for
> permanent storage and that would supposedly require slow RMW accesses
> for writing bytes to write-back caches. Now the 21064 and 21164 had a
> write-through D-cache. That made it easy to add byte and word
> accesses (BWX) in the 21164A (released 1996), but they could have done
> it from the start. The 21164A is in no way slower than the 21164; it
> has the same IPC and a higher clock rate.

Yet Intel has been using byte parity for L1 Dcaches, so that
design choice was perhaps not *entirely* irrational. (I disagree
with that choice, having hindsight, but I can appreciate the
reasoning.) Parity-only L1 Dcaches are not that bad since the
SRAM design will likely be more robust to allow faster access (I
think) and dirty values will tend to be either evicted quickly or
checked often.

If smaller writes are rare, hardware RMW in a writeback cache
would not have been that expensive, but the cost would have no
value if smaller writes are never necessary.

(I do wonder if there is an interface that would allow software to
reduce hardware RMW costs — often a value is read before being
modified — without introducing more complexity than benefit.
Exploiting the standard double-wide read used for unaligned
accesses to access a double-wide aligned memory seems similarly
desirable. While idiom-detection would allow this to be done in
hardware without changing the interface, idiom detection is more
complex than direct encoding and typically relies on software to
reduce that complexity — e.g., only detecting short contiguous
idioms.)

The different memory regions trick is also used for bit-granular
accesses in some ISAs (e.g., ARM) mainly for I/O device accesses.
Even without side-effects for accesses, non-atomicity might be a
concern. (Of course, one could architect that all simple load-op-
store sequences on that type of memory are atomic, using three
instruction idiom detection.)

> Some people welcome and celebrate the challenges that the
> supercomputer attitude poses for software, and justify it with
> "performance", but as the examples above show, such claims often turn
> out to be false when you actually invest effort into more capable
> hardware.

The tricky part seems to be in discerning when (and where) extra
effort is justified. This also depends on how easily the
difficulty can be encapsulated. Can a compiler reliably "do the
right thing" (without having to have been written by a supergenius
AI)? Can a library reliably provide the necessary extra
functionality — splitting the difficulty between application
programmer discipline and difficulty of developing the system
software — without requiring genius system programmers and highly
competent application programmers?

Someone who writes lock-free methods for fun is probably not well-
positioned to estimate the difficulty/lack-of-fun of such for most
programmers. Communication between different interest groups seems
critical, but communication also requires data and not just
anecdotes or traditional wisdom. (Anecdotes and traditional wisdom
do have value!)

[snip]
> But if you look at it from an architecture (i.e., hardware/software
> interface) perspective, weak consistency is just bad architecture:
> good architecture says what happens to the architectural state when
> software performs some instruction. From that perspective sequential
> consistency is architecturally best. Weaker consistency models
> describe how the architecture does not provide the sequential
> consistency guarantees that are so easy to describe; the weaker the
> model, the more deviations it has to describe.

I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.

When an effort is considered hard like parallel programming, there
seems to be a spectrum of viewpoints from the UNIX/"real
programmers" perspective of limiting effort to experts to simplify
the system interface so that almost anyone can do almost anything.
The extreme positions have obvious cultural issues (where
expertise is either required for worth or expertise is despised as
arrogance) as well as mechanical issues (expertise is naturally
limited by finite knowledge — where vast knowledge implies
communication overhead even within a single supercomputer
complex).
> [1] The software crisis is that software costs are higher than
> hardware costs, and supercomputing with its gigantic hardware costs
> notices the software crisis much later than general-purpose computing.


Click here to read the complete article
Re: Memory dependency microbenchmark

<ujfvr0$cda9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35125&group=comp.arch#35125

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 11:00:30 -0500
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <ujfvr0$cda9$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 20 Nov 2023 16:00:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7c50aecd46a7733eb37093e73ca1b1cb";
logging-data="406857"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5nWiN3+qpTC4Z89egUCia3bQCwvFZ1Lg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:qR01EsLZoAQfyMZfaF4ASvAGvwc=
In-Reply-To: <sQt4N.2604$rx%7.497@fx47.iad>
 by: Paul A. Clayton - Mon, 20 Nov 2023 16:00 UTC

On 11/13/23 1:22 PM, EricP wrote:
> Kent Dickey wrote:
[snip]
>> Thus, the people trapped in Relaxed Ordering Hell then push
>> weird schemes
>> on everyone else to try to come up with algorithms which need fewer
>> barriers.  It's crazy.
>>
>> Relaxed Ordering is a mistake.
>>
>> Kent
>
> I suggest something different: the ability to switch between TSO and
> relaxed with non-privileged user mode instructions.
>
> Non-concurrent code does not see the relaxed ordering, and should
> benefit
> from extra concurrency in the Load Store Queue and cache that
> relaxed rules
> allow, because the local core always sees its own memory as
> consistent.
> For example, relaxed ordering allows multiple LD and ST to be in
> multiple pipelines to multiple cache banks at once without regard
> as to the exact order the operations are applied.
>
> This is fine for non concurrently accessed data structures,
> either non-shared data areas or shared but guarded by mutexes.
>
> But relaxed is hard for people to reason about for concurrently
> accessed
> lock free data structures. Now these don't just appear out of thin
> air so
> it is reasonable for a program to emit TSO_START and TSO_END
> instructions.
>
> On the other hand, almost no code is lock-free or ever will be.
> So why have all the extra HW logic to support TSO if its only really
> needed for this rare kind of programming.
>
> But there is also a category of memory area that is not covered by
> the
> above rules, one where one core thinks its memory is local and not
> shared
> but in fact it is being accessed concurrently.
>
> If thread T1 (say an app) on core C1 says its memory is relaxed,
> and calls
> a subroutine passing a pointer to a value on T1's stack, and that
> pointer
> is passed to thread T2 (a driver) on core C2 which accesses that
> memory,
> then even if T2 declared itself to be using TSO rules it would not
> force
> T1 on C1 obey them.
>
> Where this approach could fail is the kind of laissez-faire
> sharing done
> by many apps, libraries, and OS's behind the scenes in the real
> world.

Another possibility is for non-shared memory to be handled
differently. (This is similar to My 66000's handling of memory
types and things mentioned by Mitch Alsup here.)

Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.

Various memory partitioning schemes theoretically can provide
similar benefits for systems with shared memory controllers where
programs do not share most modifiable data with other programs.
Even something like a web hosting system might be able to benefit
from lack of coherence (much less consistency) between different
web hosts.

Re: Memory dependency microbenchmark

<2023Nov20.165008@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35127&group=comp.arch#35127

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 15:50:08 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 85
Message-ID: <2023Nov20.165008@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me> <20231118211047.00002521@yahoo.com> <def7206542bfd9715d59357e69b413d6@news.novabbs.com> <ujdu10$3hgra$1@newsreader4.netcologne.de> <20231120001053.00005acf@yahoo.com> <2023Nov20.083409@mips.complang.tuwien.ac.at> <20231120164005.000074d3@yahoo.com>
Injection-Info: dont-email.me; posting-host="675554203234385e5bf21c1ea33282ab";
logging-data="413895"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HCFmJjsIwTtNlkZD0MH06"
Cancel-Lock: sha1:LMI4vR1ZpZvPnk/AsCEpBwjoit8=
X-newsreader: xrn 10.11
 by: Anton Ertl - Mon, 20 Nov 2023 15:50 UTC

Michael S <already5chosen@yahoo.com> writes:
>Do not your benchmarks have rather small datasets?
>If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
>with their big, slow L2 caches at disadvantage relatively to Skylake
>Client that has small fast L2 cache.

These benchmarks don't miss L1 much, so the size and latency of the L2
have little influence. For the LaTeX benchmark:

Performance counter stats for 'latex bench':

2963866155 instructions:u
874909912 L1-dcache-loads
13846145 L1-dcache-load-misses #1.58% of all L1-dcache accesses
2754917 LLC-loads
227951 LLC-stores

0.394554268 seconds time elapsed

0.370677000 seconds user
0.023914000 seconds sys

Interestingly, ~20% of the D-cache load misses become LLC loads
(i.e. L2 load misses), so a larger L2 may have an advantage. Let's
see on a Tiger Lake:

2955784200 instructions:u
864268345 L1-dcache-loads
11751007 L1-dcache-load-misses #1.36% of all L1-dcache accesses
393018 LLC-loads
13868 LLC-stores

0.306192273 seconds time elapsed

0.289893000 seconds user
0.016105000 seconds sys

So, going from 256KB L2 to 1280KB L2 reduces the number of L2 load
misses by a factor of 7 on this benchmark (well, there is TeX Live
2022 and possibly a different set of packages on the Tiger Lake
machine, so there is also the question of comparability).

The results I see on the Power10 machine are strange:

Performance counter stats for 'latex bench':

9622223 L1-dcache-load-misses
9847374 LLC-loads

0.476681995 seconds time elapsed

0.466409000 seconds user
0.010139000 seconds sys

Maybe prefetches are counted as LLC-loads.

For the small gforth benchmarks on Skylake:

2205651226 instructions:u
793549469 L1-dcache-loads
4115040 L1-dcache-load-misses #0.52% of all L1-dcache accesses
852553 LLC-loads
26356 LLC-stores

0.265190457 seconds time elapsed

0.260044000 seconds user
0.004036000 seconds sys

An even smaller D-cache miss rate, and a similar 20% L2 miss rate
(which constitutes an even smaller proportion of execution time.

>But if your numbers are representative then POWER10 can be matched by
>rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
>E3-1271 v3 would stay a chance. Or by AMD Zen1.

Certainly on these benchmarks. There may be applications that benefit
from what Power10 has to offer, but my guess is that for most
applications, you are better off with a somewhat recent Intel or AMD
CPU.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<jwvzfz8xunl.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35128&group=comp.arch#35128

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 20 Nov 2023 11:32:00 -0500
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <jwvzfz8xunl.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="6e7c686f2608a5ce4c1dea596b8cea11";
logging-data="417389"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX184f/ltbhcgjvu/gVuiGf0KuinoE2rh+q8="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:9z4X3skiDlUcpVHKuRbcrZBemFQ=
sha1:fvpbsiO58aQX2xoLyIpQWbcUcBs=
 by: Stefan Monnier - Mon, 20 Nov 2023 16:32 UTC

MitchAlsup [2023-11-17 18:37:17] wrote:
> Stefan Monnier wrote:
>> BTW, the above sounds daunting when writing in C because you have to do
>> that analysis yourself, but there are programming languages out there
>> which will do that analysis for you as part of type checking.
[...]
> I created the Exotic Synchronization Method such that you could just
> write the code needed to do the work, and then decorate those accesses
> which are participating in the ATOMIC event.
[...]
> In order to change this into a fully qualified ATOMIC event, the code
> is decorated as::
>
> BOOLEAN MoveElement( Element *fr, Element *to )
> {
> esmLOCK( fn = fr->next ); // get data
> esmLOCK( fp = fr->prev );
> esmLOCK( tn = to->next );
> esmLOCK( fn ); // touch data
> esmLOCK( fp );
> esmLOCK( tn );
> if( !esmINTERFERENCE() )
> {
> fp->next = fn; // move the bits around
> fn->prev = fp;
> to->next = fr;
> tn->prev = fr;
> fr->prev = to;
> esmLOCK( fr->next = tn );
> return TRUE;
> }
> return FALSE;
> }

This is nice, but the onus is still on the programmer to manually make
sure they don't forget to always `esmLOCK` all the shared data, that
they don't `esmLOCK` the data that doesn't need it, ...

In contrast, Haskell's STM will emit a type error if you ever try to use
a shared var in a non-atomic sequence (or if you try to do things like
`printf` from within an atomic sequence, ...).

I see Haskell's STM as a kind of "ideal API" for the programmer.
And it works tolerably in many cases. But reconciling this kind of
abstraction with the performance you can get by low-level twiddling is
the hard part :-)

Stefan

Re: Memory dependency microbenchmark

<nFM6N.14353$Ubzd.11432@fx36.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35130&group=comp.arch#35130

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
Lines: 26
Message-ID: <nFM6N.14353$Ubzd.11432@fx36.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 20 Nov 2023 17:26:43 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 20 Nov 2023 17:26:43 GMT
X-Received-Bytes: 1784
 by: Scott Lurndal - Mon, 20 Nov 2023 17:26 UTC

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>On 11/13/23 1:22 PM, EricP wrote:
>> Kent Dickey wrote:
>[snip]
>>> Thus, the people trapped in Relaxed Ordering Hell then push
>>> weird schemes
>>> on everyone else to try to come up with algorithms which need fewer
>>> barriers.  It's crazy.
>>>
>>> Relaxed Ordering is a mistake.
>>>
>>> Kent
>>
>> I suggest something different: the ability to switch between TSO and
>> relaxed with non-privileged user mode instructions.

>Even with a multithreaded program, stack and TLS would be "thread
>private" and not require the same consistency guarantees.

Why do you think that 'stack' would be thread private? It's
quite common to allocate long-lived data structures on the
stack and pass the address of the object to code that may
be executing in the context of other threads. So long as the lifetime of
the object extends beyond the last reference, of course.

Objects allocated on the stack in 'main', for instance.

Re: Memory dependency microbenchmark

<5yN6N.8007$DADd.5269@fx38.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35132&group=comp.arch#35132

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx38.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
In-Reply-To: <ujfvr0$cda9$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 92
Message-ID: <5yN6N.8007$DADd.5269@fx38.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 20 Nov 2023 18:27:13 UTC
Date: Mon, 20 Nov 2023 13:26:39 -0500
X-Received-Bytes: 4871
 by: EricP - Mon, 20 Nov 2023 18:26 UTC

Paul A. Clayton wrote:
> On 11/13/23 1:22 PM, EricP wrote:
>> Kent Dickey wrote:
> [snip]
>>> Thus, the people trapped in Relaxed Ordering Hell then push weird
>>> schemes
>>> on everyone else to try to come up with algorithms which need fewer
>>> barriers. It's crazy.
>>>
>>> Relaxed Ordering is a mistake.
>>>
>>> Kent
>>
>> I suggest something different: the ability to switch between TSO and
>> relaxed with non-privileged user mode instructions.
>>
>> Non-concurrent code does not see the relaxed ordering, and should benefit
>> from extra concurrency in the Load Store Queue and cache that relaxed
>> rules
>> allow, because the local core always sees its own memory as consistent.
>> For example, relaxed ordering allows multiple LD and ST to be in
>> multiple pipelines to multiple cache banks at once without regard
>> as to the exact order the operations are applied.
>>
>> This is fine for non concurrently accessed data structures,
>> either non-shared data areas or shared but guarded by mutexes.
>>
>> But relaxed is hard for people to reason about for concurrently accessed
>> lock free data structures. Now these don't just appear out of thin air so
>> it is reasonable for a program to emit TSO_START and TSO_END
>> instructions.
>>
>> On the other hand, almost no code is lock-free or ever will be.
>> So why have all the extra HW logic to support TSO if its only really
>> needed for this rare kind of programming.
>>
>> But there is also a category of memory area that is not covered by the
>> above rules, one where one core thinks its memory is local and not shared
>> but in fact it is being accessed concurrently.
>>
>> If thread T1 (say an app) on core C1 says its memory is relaxed, and
>> calls
>> a subroutine passing a pointer to a value on T1's stack, and that pointer
>> is passed to thread T2 (a driver) on core C2 which accesses that memory,
>> then even if T2 declared itself to be using TSO rules it would not force
>> T1 on C1 obey them.
>>
>> Where this approach could fail is the kind of laissez-faire sharing done
>> by many apps, libraries, and OS's behind the scenes in the real world.
>
> Another possibility is for non-shared memory to be handled
> differently. (This is similar to My 66000's handling of memory
> types and things mentioned by Mitch Alsup here.)
>
> Even with a multithreaded program, stack and TLS would be "thread
> private" and not require the same consistency guarantees.
>
> Various memory partitioning schemes theoretically can provide
> similar benefits for systems with shared memory controllers where
> programs do not share most modifiable data with other programs.
> Even something like a web hosting system might be able to benefit
> from lack of coherence (much less consistency) between different
> web hosts.

But this is my point: in many programs there is no memory that
you can point to and say it is always private to a single thread.
And this is independent of language, its to do with program structure.

You can say a certain memory range is shared and guarded by locks,
or shared and managed by lock-free code.
And we can point to this because the code in modularized this way.

But the opposite of 'definitely shared' is not 'definitely private',
it's 'dont know' or 'sometimes'.

Eg: Your code pops up a dialog box, prompts the user for an integer value,
and writes that value to a program global variable.
Do you know whether that dialog box is a separate thread?
Should you care? What if the dialog starts out in the same thread,
then a new release changes it to a separate thread.
What if the variable is on the thread stack or in a heap?

On a weak ordered system I definitely need to know because
I need barriers to ensure I can read the variable properly.

And this is where TSO makes the difference.
I don't have to know exactly every byte that might be updated concurrently
in some context at some time (and that context can change dynamically).

Re: weak consistency and the supercomputer attitude

<a06b915a631d2adcf4f4d57440e6577b@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35133&group=comp.arch#35133

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: weak consistency and the supercomputer attitude
Date: Mon, 20 Nov 2023 18:51:49 +0000
Organization: novaBBS
Message-ID: <a06b915a631d2adcf4f4d57440e6577b@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me> <uirke6$8hef$3@dont-email.me> <2023Nov13.084835@mips.complang.tuwien.ac.at> <ujfv8l$ca4g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1567880"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$Dq/nyvU.iR6AUyLMsRI.bOpQGBUUAfz5HldoTwEACHHxVSbkItcVy
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
 by: MitchAlsup - Mon, 20 Nov 2023 18:51 UTC

Paul A. Clayton wrote:

>

> [snip]
>> But if you look at it from an architecture (i.e., hardware/software
>> interface) perspective, weak consistency is just bad architecture:
>> good architecture says what happens to the architectural state when
>> software performs some instruction. From that perspective sequential
>> consistency is architecturally best. Weaker consistency models
>> describe how the architecture does not provide the sequential
>> consistency guarantees that are so easy to describe; the weaker the
>> model, the more deviations it has to describe.

> I am not convinced that sequential consistency is the best
> interface. My66000 does not provide sequential consistency for
> ordinary memory. While Mitch Alsup would have difficulty
> empathizing with most programmers, he has enough experience to
> write specifications for "hostile" engineers so he probably
> understands the tradeoffs on both sides of the interface fairly
> well.

All accesses being universally sequentially consistent is way too
much ordering, however, the ability to detect the start-end of
ATOMIC events and switching to SC gives the programmer all the
order he needs without constraining the non-concurrent memory
at all.

Over at config-space control registers--these need more than TSO or SC,
these need strong ordering.

On the other hand true ROM needs no ordering whatsoever--so why
impose any ??

One size does not fit all !!

>

Re: weak consistency and the supercomputer attitude

<ujgfie$evq1$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35134&group=comp.arch#35134

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: weak consistency and the supercomputer attitude
Date: Mon, 20 Nov 2023 12:29:01 -0800
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <ujgfie$evq1$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
<uirke6$8hef$3@dont-email.me> <2023Nov13.084835@mips.complang.tuwien.ac.at>
<ujfv8l$ca4g$1@dont-email.me>
<a06b915a631d2adcf4f4d57440e6577b@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 20 Nov 2023 20:29:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4322877222dc651c772ffc6c8629112a";
logging-data="491329"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QZSc9jnlWQHF79TzOSLjkD7ozdLKlWJc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:UR1v5qjENgUE6FswqpPNXHbyPpo=
In-Reply-To: <a06b915a631d2adcf4f4d57440e6577b@news.novabbs.com>
Content-Language: en-US
 by: Chris M. Thomasson - Mon, 20 Nov 2023 20:29 UTC

On 11/20/2023 10:51 AM, MitchAlsup wrote:
> Paul A. Clayton wrote:
>
>>
>
>> [snip]
>>> But if you look at it from an architecture (i.e., hardware/software
>>> interface) perspective, weak consistency is just bad architecture:
>>> good architecture says what happens to the architectural state when
>>> software performs some instruction.  From that perspective sequential
>>> consistency is architecturally best.  Weaker consistency models
>>> describe how the architecture does not provide the sequential
>>> consistency guarantees that are so easy to describe; the weaker the
>>> model, the more deviations it has to describe.
>
>> I am not convinced that sequential consistency is the best
>> interface. My66000 does not provide sequential consistency for
>> ordinary memory. While Mitch Alsup would have difficulty
>> empathizing with most programmers, he has enough experience to
>> write specifications for "hostile" engineers so he probably
>> understands the tradeoffs on both sides of the interface fairly
>> well.
>
> All accesses being universally sequentially consistent is way too
> much ordering, however, the ability to detect the start-end of
> ATOMIC events and switching to SC gives the programmer all the
> order he needs without constraining the non-concurrent memory
> at all.
>
> Over at config-space control registers--these need more than TSO or SC,
> these need strong ordering.
>
> On the other hand true ROM needs no ordering whatsoever--so why
> impose any ??
>
> One size does not fit all !!
>
>>

Fwiw, I remember posting an idea of so-called tagged memory barriers on
this group some years ago. I need to try to dig it up.

Re: Memory dependency microbenchmark

<141f558462e91c6b25336c5824ddf734@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35139&group=comp.arch#35139

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 21 Nov 2023 00:52:08 +0000
Organization: novaBBS
Message-ID: <141f558462e91c6b25336c5824ddf734@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <nFM6N.14353$Ubzd.11432@fx36.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1594982"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$2j1ta3yRtPXUd0Iaz3ni.Oo8qfKl/WgFZLZl.vgUj90CPmp9nNItG
X-Spam-Level: *
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
 by: MitchAlsup - Tue, 21 Nov 2023 00:52 UTC

Scott Lurndal wrote:

> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>On 11/13/23 1:22 PM, EricP wrote:
>>> Kent Dickey wrote:
>>[snip]
>>>> Thus, the people trapped in Relaxed Ordering Hell then push
>>>> weird schemes
>>>> on everyone else to try to come up with algorithms which need fewer
>>>> barriers.  It's crazy.
>>>>
>>>> Relaxed Ordering is a mistake.
>>>>
>>>> Kent
>>>
>>> I suggest something different: the ability to switch between TSO and
>>> relaxed with non-privileged user mode instructions.

>>Even with a multithreaded program, stack and TLS would be "thread
>>private" and not require the same consistency guarantees.

> Why do you think that 'stack' would be thread private? It's
> quite common to allocate long-lived data structures on the
> stack and pass the address of the object to code that may
> be executing in the context of other threads. So long as the lifetime of
> the object extends beyond the last reference, of course.

Even Thread Local Store is not private to the thread if the thread
creates a pointer into it and allows others to see the pointer.

The only thing the HW can validate as non-shared is that portion of
the stack containing callee save registers (and the return address)
but only 2 known architectures have these chunks of memory is an
address space where threads cannot read-write-or-execute that chunk.

> Objects allocated on the stack in 'main', for instance.

Re: Memory dependency microbenchmark

<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35140&group=comp.arch#35140

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!69.80.99.26.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 21 Nov 2023 13:44:56 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
Date: Tue, 21 Nov 2023 13:44:56 +0000
Lines: 36
X-Trace: sv3-ip1lGrF6fuULLdPFL4uFWaBo/mXuyuDx90TXmxoB0uB4EgQjRsN3ePOFt4FCHqaz+bakhWcL5lM4G1k!WCqmMtf46eO/ZWNDGontZluqpoRbNF4sgNdJkt83RdlbMsSF8ZdKi87eRm3fXemroebqFoO0TQb9!z7CxLIy72Jk=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
 by: aph@littlepinkcloud.invalid - Tue, 21 Nov 2023 13:44 UTC

EricP <ThatWouldBeTelling@thevillage.com> wrote:
>
> But this is my point: in many programs there is no memory that
> you can point to and say it is always private to a single thread.

That's more about program design. In a multi-threaded program this is
something you really should know.

> And this is independent of language, its to do with program structure.
>
> You can say a certain memory range is shared and guarded by locks,
> or shared and managed by lock-free code.
> And we can point to this because the code in modularized this way.
>
> But the opposite of 'definitely shared' is not 'definitely private',
> it's 'dont know' or 'sometimes'.
>
> Eg: Your code pops up a dialog box, prompts the user for an integer value,
> and writes that value to a program global variable.
> Do you know whether that dialog box is a separate thread?
> Should you care? What if the dialog starts out in the same thread,
> then a new release changes it to a separate thread.
> What if the variable is on the thread stack or in a heap?
>
> On a weak ordered system I definitely need to know because
> I need barriers to ensure I can read the variable properly.

In this case, no, I don't think you do. Barriers only control the
ordering between accesses, not when they become visible, and here
there's only one access. If there are at least two, and you really
need to see one before the other, then you need a barrier. And even on
a TSO machine, you're going to have to do something on both the reader
and the writer sides if you need ordering to be protected from a
compiler.

Andrew.

Re: Memory dependency microbenchmark

<M937N.14399$cnze.2278@fx35.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35141&group=comp.arch#35141

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx35.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <nFM6N.14353$Ubzd.11432@fx36.iad> <141f558462e91c6b25336c5824ddf734@news.novabbs.com>
In-Reply-To: <141f558462e91c6b25336c5824ddf734@news.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 44
Message-ID: <M937N.14399$cnze.2278@fx35.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 21 Nov 2023 14:30:04 UTC
Date: Tue, 21 Nov 2023 09:29:43 -0500
X-Received-Bytes: 2758
 by: EricP - Tue, 21 Nov 2023 14:29 UTC

MitchAlsup wrote:
> Scott Lurndal wrote:
>
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>> On 11/13/23 1:22 PM, EricP wrote:
>>>> Kent Dickey wrote:
>>> [snip]
>>>>> Thus, the people trapped in Relaxed Ordering Hell then push weird
>>>>> schemes
>>>>> on everyone else to try to come up with algorithms which need fewer
>>>>> barriers. It's crazy.
>>>>>
>>>>> Relaxed Ordering is a mistake.
>>>>>
>>>>> Kent
>>>>
>>>> I suggest something different: the ability to switch between TSO and
>>>> relaxed with non-privileged user mode instructions.
>
>>> Even with a multithreaded program, stack and TLS would be "thread
>>> private" and not require the same consistency guarantees.
>
>> Why do you think that 'stack' would be thread private? It's
>> quite common to allocate long-lived data structures on the
>> stack and pass the address of the object to code that may
>> be executing in the context of other threads. So long as the
>> lifetime of
>> the object extends beyond the last reference, of course.
>
> Even Thread Local Store is not private to the thread if the thread
> creates a pointer into it and allows others to see the pointer.
>
> The only thing the HW can validate as non-shared is that portion of
> the stack containing callee save registers (and the return address)
> but only 2 known architectures have these chunks of memory is an
> address space where threads cannot read-write-or-execute that chunk.

The callee save area may be R-W-E page protected against it own thread
but it doesn't prevent a privileged thread from concurrently accessing
that save area (say to edit the stack to deliver a signal)
so the same coherence applies there too.

Re: Memory dependency microbenchmark

<Fx57N.20583$BSkc.9831@fx06.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35142&group=comp.arch#35142

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx06.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
In-Reply-To: <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 64
Message-ID: <Fx57N.20583$BSkc.9831@fx06.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 21 Nov 2023 17:12:05 UTC
Date: Tue, 21 Nov 2023 12:11:00 -0500
X-Received-Bytes: 3886
 by: EricP - Tue, 21 Nov 2023 17:11 UTC

aph@littlepinkcloud.invalid wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> But this is my point: in many programs there is no memory that
>> you can point to and say it is always private to a single thread.
>
> That's more about program design. In a multi-threaded program this is
> something you really should know.
>
>> And this is independent of language, its to do with program structure.
>>
>> You can say a certain memory range is shared and guarded by locks,
>> or shared and managed by lock-free code.
>> And we can point to this because the code in modularized this way.
>>
>> But the opposite of 'definitely shared' is not 'definitely private',
>> it's 'dont know' or 'sometimes'.
>>
>> Eg: Your code pops up a dialog box, prompts the user for an integer value,
>> and writes that value to a program global variable.
>> Do you know whether that dialog box is a separate thread?
>> Should you care? What if the dialog starts out in the same thread,
>> then a new release changes it to a separate thread.
>> What if the variable is on the thread stack or in a heap?
>>
>> On a weak ordered system I definitely need to know because
>> I need barriers to ensure I can read the variable properly.
>
> In this case, no, I don't think you do. Barriers only control the
> ordering between accesses, not when they become visible, and here
> there's only one access. If there are at least two, and you really
> need to see one before the other, then you need a barrier.

The barriers also ensure the various local buffers, pipelines and
inbound and outbound comms command and reply message queues are drained.
It ensures that the operations that came before it have reached
their coherency point - the cache controller - and that any
outstanding asynchronous operations are complete.
And that in turn controls when values become visible.

On a weak order cpu with no store ordering, the cpu is not required
to propagate any store into the cache within any period of time.
It can stash it in a write combine buffer waiting to see if more
updates to the same line appear.

Weak order requires a membar after a store to force the it into the cache,
triggering the coherence handshake which invalidates other copies,
so that when remote cores reread a line they see the updated value.

In other words, to retire the membar instruction the core must force the
prior store values into the coherent cache making them globally visible.

The difference for TSO is that a store has implied membars before it to
prevent it bypassing (executing before) older loads and stores.

> And even on
> a TSO machine, you're going to have to do something on both the reader
> and the writer sides if you need ordering to be protected from a
> compiler.
>
> Andrew.

Compilers are a different discussion.


devel / comp.arch / Re: Memory dependency microbenchmark

Pages:12345678
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor