Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

The sum of the Universe is zero.


devel / comp.arch / Re: Memory dependency microbenchmark

SubjectAuthor
* Memory dependency microbenchmarkAnton Ertl
+* Re: Memory dependency microbenchmarkEricP
|`* Re: Memory dependency microbenchmarkAnton Ertl
| `* Re: Memory dependency microbenchmarkEricP
|  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|   `* Re: Memory dependency microbenchmarkEricP
|    +* Re: Memory dependency microbenchmarkMitchAlsup
|    |`* Re: Memory dependency microbenchmarkEricP
|    | `- Re: Memory dependency microbenchmarkMitchAlsup
|    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|     `* Re: Memory dependency microbenchmarkMitchAlsup
|      `* Re: Memory dependency microbenchmarkChris M. Thomasson
|       `* Re: Memory dependency microbenchmarkMitchAlsup
|        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|         `* Re: Memory dependency microbenchmarkKent Dickey
|          +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          || `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  +* Re: Memory dependency microbenchmarkaph
|          ||  |+- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  | `* Re: Memory dependency microbenchmarkaph
|          ||  |  +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |  `* Re: Memory dependency microbenchmarkKent Dickey
|          ||  |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |  `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |   `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   |    `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   |     `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   +* Re: Memory dependency microbenchmarkaph
|          ||  |   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |   | `* Re: Memory dependency microbenchmarkaph
|          ||  |   |  `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |   `* Re: Memory dependency microbenchmarkStefan Monnier
|          ||  |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkaph
|          ||  |     |   +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||  |     |`* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |   `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |    `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |     `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |      `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||  |     |       `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |        `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     |         `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||  |     `- Re: Memory dependency microbenchmarkStefan Monnier
|          ||  `* Re: Memory dependency microbenchmarkEricP
|          ||   +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||   |`* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   | `* Re: Memory dependency microbenchmarkBranimir Maksimovic
|          ||   |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||   `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    +* Re: Memory dependency microbenchmarkScott Lurndal
|          ||    |+* Re: Memory dependency microbenchmarkMitchAlsup
|          ||    ||`* Re: Memory dependency microbenchmarkEricP
|          ||    || `- Re: Memory dependency microbenchmarkMitchAlsup
|          ||    |`* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||    | `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||    `* Re: Memory dependency microbenchmarkEricP
|          ||     +* Re: Memory dependency microbenchmarkaph
|          ||     |`* Re: Memory dependency microbenchmarkEricP
|          ||     | +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     | `* Re: Memory dependency microbenchmarkaph
|          ||     |  +* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |+- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |`* Re: Memory dependency microbenchmarkEricP
|          ||     |  | +- Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  | +- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  |  `* Re: Memory dependency microbenchmarkMitchAlsup
|          ||     |  |   `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |  `* Re: Memory dependency microbenchmarkEricP
|          ||     |   `* Re: Memory dependency microbenchmarkaph
|          ||     |    +* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |`* Re: Memory dependency microbenchmarkaph
|          ||     |    | +* Re: Memory dependency microbenchmarkTerje Mathisen
|          ||     |    | |`- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    | `* Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    |  `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          ||     |    `- Re: Memory dependency microbenchmarkEricP
|          ||     `* Re: Memory dependency microbenchmarkPaul A. Clayton
|          ||      `- Re: Memory dependency microbenchmarkChris M. Thomasson
|          |`* weak consistency and the supercomputer attitude (was: Memory dependency microbenAnton Ertl
|          | +- Re: weak consistency and the supercomputer attitudeStefan Monnier
|          | +- Re: weak consistency and the supercomputer attitudeMitchAlsup
|          | `* Re: weak consistency and the supercomputer attitudePaul A. Clayton
|          `* Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkChris M. Thomasson
+- Re: Memory dependency microbenchmarkMitchAlsup
+* Re: Memory dependency microbenchmarkAnton Ertl
`* Alder Lake results for the memory dependency microbenchmarkAnton Ertl

Pages:12345678
Re: Memory dependency microbenchmark

<uj3d50$1tb8u$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35024&group=comp.arch#35024

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:27:59 -0800
Organization: A noiseless patient Spider
Lines: 126
Message-ID: <uj3d50$1tb8u$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 21:28:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2010398"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+k57iSATowLAuYL4yYG1qw8AGwZJtuELk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sEGvWDlKxu8LMNBHscH+SCLmyFg=
Content-Language: en-US
In-Reply-To: <uj3d0a$1tb8u$1@dont-email.me>
 by: Chris M. Thomasson - Wed, 15 Nov 2023 21:27 UTC

On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
> On 11/15/2023 1:09 PM, Kent Dickey wrote:
>> In article <uiu4t5$t4c2$2@dont-email.me>,
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/12/2023 9:54 PM, Kent Dickey wrote:
>>>> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
>>>> MitchAlsup <mitchalsup@aol.com> wrote:
>>>>> Kent Dickey wrote:
>>>>>
>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>
>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>> workloads.
>>>>>
>>>>>> I know a lot of people believe that statement to be true.  In
>>>>>> general, it
>>>>>> is assumed to be true without proof.
>>>>> <
>>>>> In its most general case, relaxed order only provides a performance
>>>>> advantage
>>>>> when the code is single threaded.
>>>>
>>>> I believe a Relaxed Memory model provides a small performance
>>>> improvement
>>>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
>>>> there's nothing to order).
>>>>
>>>> Relazed Memory ordering provides approximately zero performance
>>>> improvement
>>>> to an OoO CPU, and in fact, might actually lower performance
>>>> (depends on
>>>> how barriers are done--if done poorly, it could be a big negative).
>>>>
>>>> Yes, the system designers of the world have said: let's slow down our
>>>> fastest most expensive most profitable CPUs, so we can speed up our
>>>> cheapest
>>>> lowest profit CPUs a few percent, and push a ton of work onto software
>>>> developers.
>>>>
>>>> It's crazy.
>>>>
>>>>>> I believe that statement to be false.  Can you describe some of these
>>>>>> workloads?
>>>>> <
>>>>> Relaxed memory order fails spectacularly when multiple threads are
>>>>> accessing
>>>>> data.
>>>>
>>>> Probably need to clarify with "accessing modified data".
>>>>
>>>> Kent
>>>
>>> Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
>>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o
>>
>> You have internalized weakly ordered memory, and you're having trouble
>> seeing beyond it.
>
> Really? Don't project yourself on me. Altering all of the memory
> barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.
>
>
>>
>> CPUs with weakly ordered memory are the ones that need all those flags.
>> Yes, you need the flags if you want to use those CPUs.  I'm pointing out:
>> we could all just require better memory ordering and get rid of all this
>> cruft.  Give the flag, don't give the flag, the program is still correct
>> and works properly.
>
> Huh? Just cruft? wow. Just because it seems hard for you does not mean
> we should eliminate it. Believe it or not there are people out there
> that know how to use memory barriers. I suppose you would use seq_cst to
> load each node of a lock-free stack iteration in a RCU read-side region.
> This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
> like you do not have all that much experience with them. Humm...
>
>
>>
>> It's like FP denorms--it's generally been decided the hardware cost
>> to implement it is small, so hardware needs to support it at full speed.
>> No need to write code in a careful way to avoid denorms, to use funky
>> CPU-
>> specific calls to turn on flush-to-0, etc., it just works, we move on to
>> other topics.  But we still have flush-to-0 calls available--but you
>> don't
>> need to bother to use them.  In my opinion, memory ordering is much more
>> complex for programmers to handle.  I maintain it's actually so
>> complex most people cannot get it right in software for non-trivial
>> interactions.  I've found many hardware designers have a very hard time
>> reasoning about this as well when I report bugs (since the rules are so
>> complex and poorly described).  There are over 100 pages describing
>> memory
>> ordering in the Arm Architectureal Reference Manual, and it is very
>> complex (Dependency through registers and memory; Basic Dependency;
>> Address Dependency; Data Dependency; Control Dependency; Pick Basic
>> dependency; Pick Address Dependency; Pick Data Dependency; Pick
>> Control Dependency, Pick Dependency...and this is just from the
>> definition
>> of terms).  It's all very abstract and difficult to follow.  I'll be
>> honest: I do not understand all of these rules, and I don't care to.
>> I know how to implement a CPU, so I know what they've done, and that's
>> much simpler to understand.  But writing a threaded application is much
>> more complex than it should be for software.
>>
>> The cost to do TSO is some out-of-order tracking structures need to get
>> a little bigger, and some instructions have to stay in queues longer
>> (which is why they may need to get bigger), and allow re-issuing loads
>> which now have stale data.  The difference between TSO and Sequential
>> Consistency is to just disallow loads seeing stores queued before they
>> write to the data cache (well, you can speculatively let loads happen,
>> but you need to be able to walk it back, which is not difficult).  This
>> is why I say the performance cost is low--normal code missing caches and
>> not being pestered by other CPUs can run at the same speed.  But when
>> other CPUs begin pestering us, the interference can all be worked out as
>> efficiently as possible using hardware, and barriers just do not
>> compete.
>
> Having access to fine grain memory barriers is a very good thing. Of
> course we can use C++ right now and make everything seq_cst, but that is
> moronic. Why would you want to use seq_cst everywhere when you do not
> have to? There are rather massive performance implications.
>
> Are you thinking about a magic arch that we cannot use right now?

https://youtu.be/DZJPqTTt7MA

Re: Memory dependency microbenchmark

<uj3djd$1tb8u$4@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35025&group=comp.arch#35025

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:35:41 -0800
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <uj3djd$1tb8u$4@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
<uj3bgm$1t58r$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 21:35:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2010398"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hTHKfpgBkNtE8Z+OZNWLarCA8X1hVOfc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:choJ90QarGjlXLD1JI9VfAyCLgc=
Content-Language: en-US
In-Reply-To: <uj3bgm$1t58r$1@dont-email.me>
 by: Chris M. Thomasson - Wed, 15 Nov 2023 21:35 UTC

On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
> On 11/15/2023 12:56 PM, MitchAlsup wrote:
>> Kent Dickey wrote:
>>
>>> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
>>>  <aph@littlepinkcloud.invalid> wrote:
>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>>>
>>>>>> I don't think that's really true. The reorderings we see in
>>>>>> currently-
>>>>>> produced hardware are, more or less, a subset of the same reorderings
>>>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>>>> reordering in a multi-threaded C program it may well be (probably
>>>>>> is!)
>>>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>>>> common counter-example to this is for volatile accesses.
>>>>>
>>>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>>>
>>>> Maybe I wasn't clear enough. If you use std::atomic and
>>>> std::memory_order_* in such a way that there are no data races, your
>>>> concurrent program will be fine on both TSO and relaxed memory
>>>> ordering. If you try to fix data races with volatile instead of
>>>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>>>> TSO machine, but not on a machine with relaxed memory ordering.
>>
>>> What you are saying is:
>>
>>> As long as you fully analyze your program, ensure all multithreaded
>>> accesses
>>> are only through atomic variables, and you label every access to an
>>> atomic variable properly (although my point is: exactly what should that
>>> be??), then there is no problem.
>>
>>> If you assume you've already solved the problem, then you find the
>>> problem is solved!  Magic!
>>
>>> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
>>> is set on all accesses with no special trickery.
>> <
>> Should appear as if.....not behave as if.
>> <
>>>                                                  This acquire/release
>>> nonsense is all weakly ordered brain damage.
>> <
>> Agreed ...
>> <
>
> Why do you "seem" to think its brain damage? Knowing how to use them
> properly is a good thing.
>
>
>
>>>                                               The problem is on weakly
>>> ordered CPUs, performance definitely does matter in terms of getting
>>> this
>>> stuff right, but that's their problem.
>> <
>> Not necessarily !! If you have a weakly ordered machine and start doing
>> something ATOMIC, the processor can switch to sequential consistency upon
>> the detection of the ATOMIC event starting (LL for example) stay SC until
>> the ATOMIC even is done, then revert back to weakly ordered as it
>> pleases.
>> The single threaded code gets is performance while the multithreaded code
>> gets is SC (as Lamport demonstrated was necessary).
>> <
>>>                                         Being weakly ordered makes them
>>> slower when they have to execute barriers for correctness, but it's the
>>> barriers themselves that are the slow down, not ordering the requests
>>> properly.

Huh? Avoiding as many memory barriers as possible (aka... ideally
relaxed) is key to making sync algorithms faster! Or using an acquire
instead of seq_cst is great. Only use seq_cst when you absolutely have to.

>> <
>> But You see, doing it my way gets rid of the MemBar instructions but not
>> their necessary effects. In addition, in my model, every access within
>> an ATOMIC event is SC not just a MemBar at the front and end.
>> <
>>> If the CPU takes ownership of ordering, then the only rule is: you just
>>> have to use atomic properly (even then, you can often get away with
>>> volatile for most producer/consumer cases), and these subtypes for all
>>> accesses don't matter for correctness or performance.
>> <
>> But a CPU is not in a position to determine Memory Order, the
>> (multiplicity of) memory controllers do, the CPUs just have to figure
>> out how to put up with
>> the imposed order based on the kind of things the CPU is doing at that
>> instant.
>> <
>>> It also would be nice if multithreaded programs could be written in C99,
>>> or pre-C++11.  It's kinda surprising we've only been able to write
>>> threaded
>>> programs for about 10 years.
>> <
>> I wrote multithreaded programs on an 8085.....
>> <
>>> Kent
>

Re: Memory dependency microbenchmark

<uj3dti$1tb8u$5@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35026&group=comp.arch#35026

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:41:06 -0800
Organization: A noiseless patient Spider
Lines: 138
Message-ID: <uj3dti$1tb8u$5@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
<uj3bgm$1t58r$1@dont-email.me> <uj3djd$1tb8u$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 21:41:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2010398"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX199ok2KOHA4iU7XMPvK/VGvnU2Mi4Ya/mQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:KZj1f29o3uCiuh1DicBzUE3Y42E=
In-Reply-To: <uj3djd$1tb8u$4@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Wed, 15 Nov 2023 21:41 UTC

On 11/15/2023 1:35 PM, Chris M. Thomasson wrote:
> On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
>> On 11/15/2023 12:56 PM, MitchAlsup wrote:
>>> Kent Dickey wrote:
>>>
>>>> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
>>>>  <aph@littlepinkcloud.invalid> wrote:
>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>>>>
>>>>>>> I don't think that's really true. The reorderings we see in
>>>>>>> currently-
>>>>>>> produced hardware are, more or less, a subset of the same
>>>>>>> reorderings
>>>>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>>>>> reordering in a multi-threaded C program it may well be (probably
>>>>>>> is!)
>>>>>>> a bug (according to the C standard) *even on a TSO machine*. The
>>>>>>> only
>>>>>>> common counter-example to this is for volatile accesses.
>>>>>>
>>>>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>>>>
>>>>> Maybe I wasn't clear enough. If you use std::atomic and
>>>>> std::memory_order_* in such a way that there are no data races, your
>>>>> concurrent program will be fine on both TSO and relaxed memory
>>>>> ordering. If you try to fix data races with volatile instead of
>>>>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>>>>> TSO machine, but not on a machine with relaxed memory ordering.
>>>
>>>> What you are saying is:
>>>
>>>> As long as you fully analyze your program, ensure all multithreaded
>>>> accesses
>>>> are only through atomic variables, and you label every access to an
>>>> atomic variable properly (although my point is: exactly what should
>>>> that
>>>> be??), then there is no problem.
>>>
>>>> If you assume you've already solved the problem, then you find the
>>>> problem is solved!  Magic!
>>>
>>>> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
>>>> is set on all accesses with no special trickery.
>>> <
>>> Should appear as if.....not behave as if.
>>> <
>>>>                                                  This acquire/release
>>>> nonsense is all weakly ordered brain damage.
>>> <
>>> Agreed ...
>>> <
>>
>> Why do you "seem" to think its brain damage? Knowing how to use them
>> properly is a good thing.
>>
>>
>>
>>>>                                               The problem is on weakly
>>>> ordered CPUs, performance definitely does matter in terms of getting
>>>> this
>>>> stuff right, but that's their problem.
>>> <
>>> Not necessarily !! If you have a weakly ordered machine and start doing
>>> something ATOMIC, the processor can switch to sequential consistency
>>> upon
>>> the detection of the ATOMIC event starting (LL for example) stay SC
>>> until
>>> the ATOMIC even is done, then revert back to weakly ordered as it
>>> pleases.
>>> The single threaded code gets is performance while the multithreaded
>>> code
>>> gets is SC (as Lamport demonstrated was necessary).
>>> <
>>>>                                         Being weakly ordered makes them
>>>> slower when they have to execute barriers for correctness, but it's the
>>>> barriers themselves that are the slow down, not ordering the requests
>>>> properly.
>
> Huh? Avoiding as many memory barriers as possible (aka... ideally
> relaxed) is key to making sync algorithms faster! Or using an acquire
> instead of seq_cst is great. Only use seq_cst when you absolutely have to.
>
>
>>> <
>>> But You see, doing it my way gets rid of the MemBar instructions but not
>>> their necessary effects. In addition, in my model, every access within
>>> an ATOMIC event is SC not just a MemBar at the front and end.
>>> <
>>>> If the CPU takes ownership of ordering, then the only rule is: you just
>>>> have to use atomic properly (even then, you can often get away with
>>>> volatile for most producer/consumer cases), and these subtypes for all
>>>> accesses don't matter for correctness or performance.
>>> <
>>> But a CPU is not in a position to determine Memory Order, the
>>> (multiplicity of) memory controllers do, the CPUs just have to figure
>>> out how to put up with
>>> the imposed order based on the kind of things the CPU is doing at
>>> that instant.
>>> <
>>>> It also would be nice if multithreaded programs could be written in
>>>> C99,
>>>> or pre-C++11.  It's kinda surprising we've only been able to write
>>>> threaded
>>>> programs for about 10 years.
>>> <
>>> I wrote multithreaded programs on an 8085.....
>>> <
>>>> Kent
>>
>
>
pseudo-code:

std::atomic<node*> m_node = create_stack();

// Read-Copy Update (RCU)
// reader thread iteration...

node* current = n_node.load(std::memory_order_relaxed);
while (current)
{ node* next = current->m_next.load(std::memory_order_relaxed);
compute(current);
current = next;
}

Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!

Listen to all of this:

https://youtu.be/DZJPqTTt7MA

Re: Memory dependency microbenchmark

<uj3eqg$1tl39$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35027&group=comp.arch#35027

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:56:31 -0800
Organization: A noiseless patient Spider
Lines: 150
Message-ID: <uj3eqg$1tl39$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
<uj3bgm$1t58r$1@dont-email.me> <uj3djd$1tb8u$4@dont-email.me>
<uj3dti$1tb8u$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 21:56:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2020457"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+zgaKAEYhU5iYQJ0joV0iuXE7c93jjhSA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:cnV8GkwMqQyW/3D2QbydsueMnpo=
Content-Language: en-US
In-Reply-To: <uj3dti$1tb8u$5@dont-email.me>
 by: Chris M. Thomasson - Wed, 15 Nov 2023 21:56 UTC

On 11/15/2023 1:41 PM, Chris M. Thomasson wrote:
> On 11/15/2023 1:35 PM, Chris M. Thomasson wrote:
>> On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
>>> On 11/15/2023 12:56 PM, MitchAlsup wrote:
>>>> Kent Dickey wrote:
>>>>
>>>>> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
>>>>>  <aph@littlepinkcloud.invalid> wrote:
>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>>>>>
>>>>>>>> I don't think that's really true. The reorderings we see in
>>>>>>>> currently-
>>>>>>>> produced hardware are, more or less, a subset of the same
>>>>>>>> reorderings
>>>>>>>> that C compilers perform. Therefore, if you see a confusing
>>>>>>>> hardware
>>>>>>>> reordering in a multi-threaded C program it may well be
>>>>>>>> (probably is!)
>>>>>>>> a bug (according to the C standard) *even on a TSO machine*. The
>>>>>>>> only
>>>>>>>> common counter-example to this is for volatile accesses.
>>>>>>>
>>>>>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>>>>>
>>>>>> Maybe I wasn't clear enough. If you use std::atomic and
>>>>>> std::memory_order_* in such a way that there are no data races, your
>>>>>> concurrent program will be fine on both TSO and relaxed memory
>>>>>> ordering. If you try to fix data races with volatile instead of
>>>>>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>>>>>> TSO machine, but not on a machine with relaxed memory ordering.
>>>>
>>>>> What you are saying is:
>>>>
>>>>> As long as you fully analyze your program, ensure all multithreaded
>>>>> accesses
>>>>> are only through atomic variables, and you label every access to an
>>>>> atomic variable properly (although my point is: exactly what should
>>>>> that
>>>>> be??), then there is no problem.
>>>>
>>>>> If you assume you've already solved the problem, then you find the
>>>>> problem is solved!  Magic!
>>>>
>>>>> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
>>>>> is set on all accesses with no special trickery.
>>>> <
>>>> Should appear as if.....not behave as if.
>>>> <
>>>>>                                                  This acquire/release
>>>>> nonsense is all weakly ordered brain damage.
>>>> <
>>>> Agreed ...
>>>> <
>>>
>>> Why do you "seem" to think its brain damage? Knowing how to use them
>>> properly is a good thing.
>>>
>>>
>>>
>>>>>                                               The problem is on weakly
>>>>> ordered CPUs, performance definitely does matter in terms of
>>>>> getting this
>>>>> stuff right, but that's their problem.
>>>> <
>>>> Not necessarily !! If you have a weakly ordered machine and start doing
>>>> something ATOMIC, the processor can switch to sequential consistency
>>>> upon
>>>> the detection of the ATOMIC event starting (LL for example) stay SC
>>>> until
>>>> the ATOMIC even is done, then revert back to weakly ordered as it
>>>> pleases.
>>>> The single threaded code gets is performance while the multithreaded
>>>> code
>>>> gets is SC (as Lamport demonstrated was necessary).
>>>> <
>>>>>                                         Being weakly ordered makes
>>>>> them
>>>>> slower when they have to execute barriers for correctness, but it's
>>>>> the
>>>>> barriers themselves that are the slow down, not ordering the requests
>>>>> properly.
>>
>> Huh? Avoiding as many memory barriers as possible (aka... ideally
>> relaxed) is key to making sync algorithms faster! Or using an acquire
>> instead of seq_cst is great. Only use seq_cst when you absolutely have
>> to.
>>
>>
>>>> <
>>>> But You see, doing it my way gets rid of the MemBar instructions but
>>>> not
>>>> their necessary effects. In addition, in my model, every access within
>>>> an ATOMIC event is SC not just a MemBar at the front and end.
>>>> <
>>>>> If the CPU takes ownership of ordering, then the only rule is: you
>>>>> just
>>>>> have to use atomic properly (even then, you can often get away with
>>>>> volatile for most producer/consumer cases), and these subtypes for all
>>>>> accesses don't matter for correctness or performance.
>>>> <
>>>> But a CPU is not in a position to determine Memory Order, the
>>>> (multiplicity of) memory controllers do, the CPUs just have to
>>>> figure out how to put up with
>>>> the imposed order based on the kind of things the CPU is doing at
>>>> that instant.
>>>> <
>>>>> It also would be nice if multithreaded programs could be written in
>>>>> C99,
>>>>> or pre-C++11.  It's kinda surprising we've only been able to write
>>>>> threaded
>>>>> programs for about 10 years.
>>>> <
>>>> I wrote multithreaded programs on an 8085.....
>>>> <
>>>>> Kent
>>>
>>
>>
> pseudo-code:
>
> std::atomic<node*> m_node = create_stack();
>
>
>
> // Read-Copy Update (RCU)
> // reader thread iteration...
>
> node* current = n_node.load(std::memory_order_relaxed);
> while (current)
> {
>    node* next = current->m_next.load(std::memory_order_relaxed);
>    compute(current);
>    current = next;
> }
>
>
> Now, you are trying to tell me that I should use seq_cst instead of
> those relaxed memory barriers for the iteration? That would DESTROY
> performance, big time... Really BAD!

Heck, if I knew I was on a DEC Alpha and depending on the writer side of
the algorithm I was using, std::memory_order_consume would be in order.

>
>
> Listen to all of this:
>
> https://youtu.be/DZJPqTTt7MA

Re: Memory dependency microbenchmark

<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35029&group=comp.arch#35029

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 17:40:15 -0500
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="2bc0d8aa427f82d4336cc91899762e96";
logging-data="2036555"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19gttPLI5qNlL/zSc+M5AmLSkPIzhMX6Y8="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:h6j4asijKcwesOWMFP/pjeRUpTg=
sha1:5HOA9YyO6I5Mn6MIww+3+jyINJw=
 by: Stefan Monnier - Wed, 15 Nov 2023 22:40 UTC

> Are you thinking about a magic arch that we cannot use right now?

Yes, he is, obviously.
So when you say "it's bad", please tell us why.

We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.

I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).

So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.

BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?

Stefan

Re: Memory dependency microbenchmark

<uj3ikm$1u83m$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35031&group=comp.arch#35031

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 15:01:42 -0800
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <uj3ikm$1u83m$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 15 Nov 2023 23:01:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2039926"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/gQZyWQsfFG3ImbdR1VLO8q3yKIc75ueI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NugaHmJEjrDxDJZoVcDqupqhtBY=
In-Reply-To: <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Chris M. Thomasson - Wed, 15 Nov 2023 23:01 UTC

On 11/15/2023 2:40 PM, Stefan Monnier wrote:
>> Are you thinking about a magic arch that we cannot use right now?
>
> Yes, he is, obviously.
> So when you say "it's bad", please tell us why.

Oh, shit. This is the main point of my confusion. When I say its bad, I
am referring to using std::memory_order_seq_cst all over the place on
_existing_ architectures.

>
> We know it would run slow on existing CPUs, that's not the question.
> The question is: why would it be impossible or very hard to
> make a CPU that could execute such code efficiently.
>
> I suspect there can be a very valid reasons, maybe for the same kinds of
> reasons why some systems allow nested transactions (e.g. when you have
> a transaction with two calls to `gensym`: it doesn't matter whether the
> two calls really return consecutive symbols (as would be guaranteed if
> the code were truly run atomically), all that matters is that those
> symbols are unique).
>
> So maybe with sequential consistency, there could be some forms of
> parallelism which we'd completely disallow, whereas a weaker form of
> consistency would allow it. I'm having a hard time imagining what it
> could be, tho.
>
> BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
> their big supercomputers, no?

Offering sequential consistency and mandating it are different things.
We can all get sequential consistency just by using
std::memory_order_seq_cst all of the time. If an arch can be created
that is 100% sequential consistency and beat out existing finely tuned
algorithms, like RCU based lock-free algorithms, then, I would love to
see it.

As of now, wrt are current arch's, say, iterating a lock-free stack in a
RCU read side region using seq_cst is horrible.

Re: Memory dependency microbenchmark

<uj3jdv$1uebr$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35032&group=comp.arch#35032

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 15:15:10 -0800
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <uj3jdv$1uebr$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3ikm$1u83m$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 23:15:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2046331"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX190JPyT7aAXpiYCRjZSTUJDWS2LTA14rkM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:OaY7BpDttX1X5K8g0tB9T8Bq9+0=
In-Reply-To: <uj3ikm$1u83m$1@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Wed, 15 Nov 2023 23:15 UTC

On 11/15/2023 3:01 PM, Chris M. Thomasson wrote:
> On 11/15/2023 2:40 PM, Stefan Monnier wrote:
>>> Are you thinking about a magic arch that we cannot use right now?
>>
>> Yes, he is, obviously.
>> So when you say "it's bad", please tell us why.
>
> Oh, shit. This is the main point of my confusion. When I say its bad, I
> am referring to using std::memory_order_seq_cst all over the place on
> _existing_ architectures.
>
>
>>
>> We know it would run slow on existing CPUs, that's not the question.
>> The question is: why would it be impossible or very hard to
>> make a CPU that could execute such code efficiently.
>>
>> I suspect there can be a very valid reasons, maybe for the same kinds of
>> reasons why some systems allow nested transactions (e.g. when you have
>> a transaction with two calls to `gensym`: it doesn't matter whether the
>> two calls really return consecutive symbols (as would be guaranteed if
>> the code were truly run atomically), all that matters is that those
>> symbols are unique).
>>
>> So maybe with sequential consistency, there could be some forms of
>> parallelism which we'd completely disallow, whereas a weaker form of
>> consistency would allow it.  I'm having a hard time imagining what it
>> could be, tho.
>>
>> BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
>> their big supercomputers, no?
>
> Offering sequential consistency and mandating it are different things.
> We can all get sequential consistency just by using
> std::memory_order_seq_cst all of the time. If an arch can be created
> that is 100% sequential consistency and beat out existing finely tuned
> algorithms, like RCU based lock-free algorithms, then, I would love to
> see it.
>
> As of now, wrt are current arch's, say, iterating a lock-free stack in a
> RCU read side region using seq_cst is horrible.

Just because programming sync algorithms on a weakly ordered arch can be
difficult for some people does not mean we have to get rid of it
altogether...

Re: Memory dependency microbenchmark

<f4e88cde406540ff966241529df7853b@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35033&group=comp.arch#35033

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 23:32:12 +0000
Organization: novaBBS
Message-ID: <f4e88cde406540ff966241529df7853b@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com> <uj3bgm$1t58r$1@dont-email.me> <uj3djd$1tb8u$4@dont-email.me> <uj3dti$1tb8u$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1046402"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$OlmnpO5W7OESiX41SEN87.Atc5fQklSSmjXDRwEnWTdEN7zDiSplG
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Level: *
 by: MitchAlsup - Wed, 15 Nov 2023 23:32 UTC

Chris M. Thomasson wrote:

> pseudo-code:

> std::atomic<node*> m_node = create_stack();

> // Read-Copy Update (RCU)
> // reader thread iteration...

> node* current = n_node.load(std::memory_order_relaxed);
> while (current)
> {
> node* next = current->m_next.load(std::memory_order_relaxed);
> compute(current);
> current = next;
> }

> Now, you are trying to tell me that I should use seq_cst instead of
> those relaxed memory barriers for the iteration? That would DESTROY
> performance, big time... Really BAD!
<
The scan of the concurrent data structure's linked list does not have
to be atomic, or even ordered. It is only once you have identified an
element you want exclusive access to that sequential consistency is
needed.

> Listen to all of this:

> https://youtu.be/DZJPqTTt7MA

Re: Memory dependency microbenchmark

<jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35036&group=comp.arch#35036

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 18:39:28 -0500
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
<uj3ikm$1u83m$1@dont-email.me> <uj3jdv$1uebr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="829e36217108602686e69a7ccbc80039";
logging-data="2054277"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19V7+JvxnULosodgsIaYIOsaRpKCqq9DJ8="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:RWe8ORsRYWXPtNjeg4Y6IHLiqoU=
sha1:y7e2+1ITMpwZA+EcFBsiCYLnxXM=
 by: Stefan Monnier - Wed, 15 Nov 2023 23:39 UTC

> Just because programming sync algorithms on a weakly ordered arch can be
> difficult for some people does not mean we have to get rid of it
> altogether...

No, but it all depends on the hardware cost.
SGI's big multiprocessors (starting with their R10K processor) obeyed
sequential consistency and AFAIK that did not prevent them from providing
top-notch performance.

So if it's not terribly costly, it may end up being *faster* because the
memory barriers become noops (not to mention the fact that performance
engineers can spend time on other things).

I suspect Intel and others did look into it at some point and ended up
deciding it's not actually faster, but I agree with Kent that the
complexity cost for programmers might not really be warranted and that
maybe we'd all be better off if CPU manufacturers followed SGI's lead.

According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
the cost of supporting SC can be around 6% using tricks known in 2012.
I'd be surprised if it can't be brought further down.

Stefan

Re: Memory dependency microbenchmark

<uj3lmo$1unhg$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35038&group=comp.arch#35038

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 15:54:00 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <uj3lmo$1unhg$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
<uj3bgm$1t58r$1@dont-email.me> <uj3djd$1tb8u$4@dont-email.me>
<uj3dti$1tb8u$5@dont-email.me>
<f4e88cde406540ff966241529df7853b@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 23:54:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2055728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+dHtzqicD/MSHUKYEbi27GWnMEphYHLQs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:znkl51Xxn/CLoH3QZWKVDECjIEE=
In-Reply-To: <f4e88cde406540ff966241529df7853b@news.novabbs.com>
Content-Language: en-US
 by: Chris M. Thomasson - Wed, 15 Nov 2023 23:54 UTC

On 11/15/2023 3:32 PM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> pseudo-code:
>
>> std::atomic<node*> m_node = create_stack();
>
>
>
>> // Read-Copy Update (RCU)
>> // reader thread iteration...
>
>> node* current = n_node.load(std::memory_order_relaxed);
>> while (current)
>> {
>>     node* next = current->m_next.load(std::memory_order_relaxed);
>>     compute(current);
>>     current = next;
>> }
>
>
>> Now, you are trying to tell me that I should use seq_cst instead of
>> those relaxed memory barriers for the iteration? That would DESTROY
>> performance, big time... Really BAD!
> <
> The scan of the concurrent data structure's linked list does not have
> to be atomic, or even ordered.

It has to use atomic loads and stores. However, RCU makes it so we do
not need to use any memory barriers (dec alpha aside for a moment) while
iterating it.

> It is only once you have identified an
> element you want exclusive access to that sequential consistency is
> needed.

Even a mutex does not need sequential consistency.

Dekker aside for a moment because it depends on a store followed by a
load to another location. TSO cannot even handle this without a membar.

>
>
>> Listen to all of this:
>
>> https://youtu.be/DZJPqTTt7MA

Re: Memory dependency microbenchmark

<uj3ls2$1unhg$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35039&group=comp.arch#35039

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 15:56:50 -0800
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <uj3ls2$1unhg$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
<uj3bgm$1t58r$1@dont-email.me> <uj3djd$1tb8u$4@dont-email.me>
<uj3dti$1tb8u$5@dont-email.me>
<f4e88cde406540ff966241529df7853b@news.novabbs.com>
<uj3lmo$1unhg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 23:56:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2055728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ay1pGwUTOuUNf9JIS4w2Y/he9Dpx3gZY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ectzLNNfuTFv9pvWIULzL4UN4/E=
Content-Language: en-US
In-Reply-To: <uj3lmo$1unhg$1@dont-email.me>
 by: Chris M. Thomasson - Wed, 15 Nov 2023 23:56 UTC

On 11/15/2023 3:54 PM, Chris M. Thomasson wrote:
> On 11/15/2023 3:32 PM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> pseudo-code:
>>
>>> std::atomic<node*> m_node = create_stack();
>>
>>
>>
>>> // Read-Copy Update (RCU)
>>> // reader thread iteration...
>>
>>> node* current = n_node.load(std::memory_order_relaxed);
>>> while (current)
>>> {
>>>     node* next = current->m_next.load(std::memory_order_relaxed);
>>>     compute(current);
>>>     current = next;
>>> }
>>
>>
>>> Now, you are trying to tell me that I should use seq_cst instead of
>>> those relaxed memory barriers for the iteration? That would DESTROY
>>> performance, big time... Really BAD!
>> <
>> The scan of the concurrent data structure's linked list does not have
>> to be atomic, or even ordered.
>
> It has to use atomic loads and stores. However, RCU makes it so we do
> not need to use any memory barriers (dec alpha aside for a moment) while
> iterating it.
>
>
>> It is only once you have identified an
>> element you want exclusive access to that sequential consistency is
>> needed.
>
> Even a mutex does not need sequential consistency.
>
> Dekker aside for a moment because it depends on a store followed by a
> load to another location. TSO cannot even handle this without a membar.
>
>
>
>>
>>
>>> Listen to all of this:
>>
>>> https://youtu.be/DZJPqTTt7MA
>

Afaict, std::memory_order_consume was added to support RCU compatible
algorithms.

Re: Memory dependency microbenchmark

<uj3m8j$1unhg$3@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35040&group=comp.arch#35040

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 16:03:31 -0800
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <uj3m8j$1unhg$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3ikm$1u83m$1@dont-email.me>
<uj3jdv$1uebr$1@dont-email.me> <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 16 Nov 2023 00:03:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2055728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19r1ljsrVRpkYMaeZQszvyfk3DvADkAE6k="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:32MSpTE9JO8XaVgDbTov9vnx+h8=
Content-Language: en-US
In-Reply-To: <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
 by: Chris M. Thomasson - Thu, 16 Nov 2023 00:03 UTC

On 11/15/2023 3:39 PM, Stefan Monnier wrote:
>> Just because programming sync algorithms on a weakly ordered arch can be
>> difficult for some people does not mean we have to get rid of it
>> altogether...
>
> No, but it all depends on the hardware cost.
> SGI's big multiprocessors (starting with their R10K processor) obeyed
> sequential consistency and AFAIK that did not prevent them from providing
> top-notch performance.
>
> So if it's not terribly costly, it may end up being *faster* because the
> memory barriers become noops (not to mention the fact that performance
> engineers can spend time on other things).
>
> I suspect Intel and others did look into it at some point and ended up
> deciding it's not actually faster, but I agree with Kent that the
> complexity cost for programmers might not really be warranted and that
> maybe we'd all be better off if CPU manufacturers followed SGI's lead.
>
> According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
> the cost of supporting SC can be around 6% using tricks known in 2012.
> I'd be surprised if it can't be brought further down.

Actually, for some reason this line of thinking reminds me of the
following funny scene in a movie called Dirty Rotten Scoundrels. Where
they had to put corks on the forks of Ruprecht (Steve Martin), to
prevent him from hurting himself. So, basically, Ruprecht would be the
programmer and the Micheal Caine character would be the arch designer
thinking that relaxed models are too complex, and all programmers are
mainly morons:

https://youtu.be/SKDX-qJaJ08

https://youtu.be/SKDX-qJaJ08?t=33

Afaict, std::memory_order_consume was added to support RCU compatible
algorithms.

Re: Memory dependency microbenchmark

<jwvwmui1red.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35042&group=comp.arch#35042

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 19:37:01 -0500
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <jwvwmui1red.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
<uj3ikm$1u83m$1@dont-email.me> <uj3jdv$1uebr$1@dont-email.me>
<jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
<uj3m8j$1unhg$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="829e36217108602686e69a7ccbc80039";
logging-data="2069003"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/O+x4uAROKFDuCrK4snTPk+H9D0lR+fjE="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:/QpzPvCf/6TImSdToc8mlDhqsYI=
sha1:g/7/cywUVhKPMUToq3mNZlQnGvw=
 by: Stefan Monnier - Thu, 16 Nov 2023 00:37 UTC

> Actually, for some reason this line of thinking reminds me of the following
> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
> put corks on the forks of Ruprecht (Steve Martin), to prevent him from
> hurting himself. So, basically, Ruprecht would be the programmer and the
> Micheal Caine character would be the arch designer thinking that relaxed
> models are too complex, and all programmers are mainly morons:

Of course, cognitive dissonance would bite those people who have invested
efforts into learning about all the intricacies of non-sequential
memory models.

But if we can make SC's efficiency sufficiently close to that of TSO,
the benefit could be significant for all those people who have not
invested such efforts.

The evolution of computer programming is littered with steps that reduce
performance in exchange for a cleaner programming model.

You don't need to be a moron to be baffled by the complexity of relaxed
memory models.

Stefan

Re: Memory dependency microbenchmark

<uj3s14$1vjlh$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35044&group=comp.arch#35044

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
Organization: provalid.com
Lines: 87
Message-ID: <uj3s14$1vjlh$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0eb6232bf4b8fa835d4c1538e9906173";
logging-data="2084529"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/21R14lg724KUCJioU1o5z"
Cancel-Lock: sha1:3wqROAt+VponHfXVapcjeqkEXq8=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: Kent Dickey - Thu, 16 Nov 2023 01:41 UTC

In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> Are you thinking about a magic arch that we cannot use right now?
>
>Yes, he is, obviously.
>So when you say "it's bad", please tell us why.
>
>We know it would run slow on existing CPUs, that's not the question.
>The question is: why would it be impossible or very hard to
>make a CPU that could execute such code efficiently.
>
>I suspect there can be a very valid reasons, maybe for the same kinds of
>reasons why some systems allow nested transactions (e.g. when you have
>a transaction with two calls to `gensym`: it doesn't matter whether the
>two calls really return consecutive symbols (as would be guaranteed if
>the code were truly run atomically), all that matters is that those
>symbols are unique).
>
>So maybe with sequential consistency, there could be some forms of
>parallelism which we'd completely disallow, whereas a weaker form of
>consistency would allow it. I'm having a hard time imagining what it
>could be, tho.
>
>BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
>their big supercomputers, no?
>
>
> Stefan

Every HP PA-RISC system, from workstation to the largest server, were
sequentially consistent and needed no barriers. It was not a problem,
I never thought it was even considered any sort of performance issue.
Once you decide to support it, you just throw some hardware at it,
and you're done.

Since the original HP PA-RISC MP designs were sequentially consistent, all
the implementations afterward kept it up since no one wanted any existing
code to break. The architects defined the architecture to allow weak
ordering, but no implementation (by HP at least) did so. These
architects then went on to IA-64, where it really is weak, since IA64 was
in order, so it has more of a payoff there since IA64 didn't want to spend
hardware on this (they had 50 other things to waste it on), and IA-64 is
full of bad ideas and ideas just not implemented well.

Sun was TSO, which is weaker. Sun was never a performance champion,
other than by throwing the most cores at a parallel problem. So being
TSO relative to Sequential Consistency didn't seem to buy Sun much. DEC
Alpha was the poster child of weakly ordered (so weak they didn't maintain
single CPU consistent ordering with itself) and it was often a performance
champion on tech workloads, but that had more to do with their
much higher clock frequencies, and that edge went away once out-of-order
took off. DEC Alpha was never a big player in TPC-C workloads, where
everybody made their money in the 90's. Technical computing is nice and
fun, but there was not a lot of profit in it compared to business workloads
in the 90s. IA64's reason to exist was to run SGI/SUN/DEC out of business,
and it effectively did (while hurting HP about as much).

I don't know if SGI was sequentially consistent. It's possible, since
software developed for other systems might have pushed them to support it,
but the academic RISC folks were pretty big on weakly ordered.

A problem with weakly ordered is no implementation is THAT weakly
ordered. To maximize doing things in a bad order requires "unlucky" cache
misses, and these are just not common in practice. So "Store A; Store B"
often appears to be done in that order with no barriers on weakly ordered
systems. It's hard to feel confident you've written anything complex
right, so most algorithms are kept relatively simple to make it more likely
they are well tested.

HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
using just load and store. I forget the details, but it was something
like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
your byte to "iwant", then read the word. If everyone else is 0, then
set your byte to "iwon", then read the word. If everyone else is still
0, you've won, do what you want. And if you see other bytes getting
set, then you move to a backoff algorithm to determine the winner (you
have 256 states you can move through). Note that when you release the
lock, you can pick the next winner with a single store. What I
understood was this was actually faster than a simple compare-and-swap
since it let software immediately see if there was contention, and move
to a back-off algorithm right away (and you can see who's contending,
and deal with that as well). Spinlocks tend to lead to cache
invalidation storms, and it's hard to tune, but this was much more
tuneable. It scaled to 64-CPUs by doing the lock in two steps, and
moving to a 64-bit word.

Kent

Re: Memory dependency microbenchmark

<uj3tkp$1vpm1$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35045&group=comp.arch#35045

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 18:09:28 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <uj3tkp$1vpm1$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3s14$1vjlh$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 16 Nov 2023 02:09:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2090689"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FlYi1KwhU4HC6u6zz0z/OJwk+pfFgVRo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ePTjSfK0UWkQ4ZqeqaB6VtTc+CU=
In-Reply-To: <uj3s14$1vjlh$1@dont-email.me>
Content-Language: en-US
 by: Chris M. Thomasson - Thu, 16 Nov 2023 02:09 UTC

On 11/15/2023 5:41 PM, Kent Dickey wrote:
> In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
> Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>> Are you thinking about a magic arch that we cannot use right now?
>>
>> Yes, he is, obviously.
>> So when you say "it's bad", please tell us why.
>>
>> We know it would run slow on existing CPUs, that's not the question.
>> The question is: why would it be impossible or very hard to
>> make a CPU that could execute such code efficiently.
>>
>> I suspect there can be a very valid reasons, maybe for the same kinds of
>> reasons why some systems allow nested transactions (e.g. when you have
>> a transaction with two calls to `gensym`: it doesn't matter whether the
>> two calls really return consecutive symbols (as would be guaranteed if
>> the code were truly run atomically), all that matters is that those
>> symbols are unique).
>>
>> So maybe with sequential consistency, there could be some forms of
>> parallelism which we'd completely disallow, whereas a weaker form of
>> consistency would allow it. I'm having a hard time imagining what it
>> could be, tho.
>>
>> BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
>> their big supercomputers, no?
>>
>>
>> Stefan
>
> Every HP PA-RISC system, from workstation to the largest server, were
> sequentially consistent and needed no barriers. It was not a problem,
> I never thought it was even considered any sort of performance issue.
> Once you decide to support it, you just throw some hardware at it,
> and you're done.
>
> Since the original HP PA-RISC MP designs were sequentially consistent, all
> the implementations afterward kept it up since no one wanted any existing
> code to break. The architects defined the architecture to allow weak
> ordering, but no implementation (by HP at least) did so. These
> architects then went on to IA-64, where it really is weak, since IA64 was
> in order, so it has more of a payoff there since IA64 didn't want to spend
> hardware on this (they had 50 other things to waste it on), and IA-64 is
> full of bad ideas and ideas just not implemented well.
>
> Sun was TSO, which is weaker. [...]

Sun had RMO mode as well. Avoid #StoreLoad membars was a must, only use
them when you absolutely have to.

Re: Memory dependency microbenchmark

<uj3uak$1vtea$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35046&group=comp.arch#35046

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 18:21:04 -0800
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <uj3uak$1vtea$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3ikm$1u83m$1@dont-email.me>
<uj3jdv$1uebr$1@dont-email.me> <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
<uj3m8j$1unhg$3@dont-email.me> <jwvwmui1red.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 16 Nov 2023 02:21:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2094538"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NqU/TMImreIV2O6PTphPbc0JrqMgpfLc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:usVDiV0p/EJGj3n9xIQ3Zut0THM=
In-Reply-To: <jwvwmui1red.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Chris M. Thomasson - Thu, 16 Nov 2023 02:21 UTC

On 11/15/2023 4:37 PM, Stefan Monnier wrote:
>> Actually, for some reason this line of thinking reminds me of the following
>> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
>> put corks on the forks of Ruprecht (Steve Martin), to prevent him from
>> hurting himself. So, basically, Ruprecht would be the programmer and the
>> Micheal Caine character would be the arch designer thinking that relaxed
>> models are too complex, and all programmers are mainly morons:
>
> Of course, cognitive dissonance would bite those people who have invested
> efforts into learning about all the intricacies of non-sequential
> memory models.
>
> But if we can make SC's efficiency sufficiently close to that of TSO,
> the benefit could be significant for all those people who have not
> invested such efforts.
>
> The evolution of computer programming is littered with steps that reduce
> performance in exchange for a cleaner programming model.
>
> You don't need to be a moron to be baffled by the complexity of relaxed
> memory models.

Mainly, the kernel guys have to know all about them in order to try to
squeeze as much performance as they can from a given arch. Lock/Wait
free algorithms are used quite a bit. Creating them is never easy,
relaxed model or not.

Fwiw, I like the SPARC model where one could put the processor in RMO or
TSO mode. I think there was another mode, but I cannot remember it right
now. A SEQ mode should be doable in SPARC. However, it would incur a
performance penalty.

The funny thing in that in lock/wait-free programming, we always look
for the weakest barriers than can be used, and still be correct. So, if
everything was suddenly seq_cst with much better performance on a new
magic arch, well, I would need to check it out for sure!

Re: Memory dependency microbenchmark

<Z2q5N.25376$cAm7.3257@fx18.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35055&group=comp.arch#35055

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com> <uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3ikm$1u83m$1@dont-email.me> <uj3jdv$1uebr$1@dont-email.me> <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org> <uj3m8j$1unhg$3@dont-email.me> <jwvwmui1red.fsf-monnier+comp.arch@gnu.org> <uj3uak$1vtea$1@dont-email.me>
Lines: 32
Message-ID: <Z2q5N.25376$cAm7.3257@fx18.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 16 Nov 2023 14:54:49 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 16 Nov 2023 14:54:49 GMT
X-Received-Bytes: 2727
 by: Scott Lurndal - Thu, 16 Nov 2023 14:54 UTC

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>On 11/15/2023 4:37 PM, Stefan Monnier wrote:
>>> Actually, for some reason this line of thinking reminds me of the following
>>> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
>>> put corks on the forks of Ruprecht (Steve Martin), to prevent him from
>>> hurting himself. So, basically, Ruprecht would be the programmer and the
>>> Micheal Caine character would be the arch designer thinking that relaxed
>>> models are too complex, and all programmers are mainly morons:
>>
>> Of course, cognitive dissonance would bite those people who have invested
>> efforts into learning about all the intricacies of non-sequential
>> memory models.
>>
>> But if we can make SC's efficiency sufficiently close to that of TSO,
>> the benefit could be significant for all those people who have not
>> invested such efforts.
>>
>> The evolution of computer programming is littered with steps that reduce
>> performance in exchange for a cleaner programming model.
>>
>> You don't need to be a moron to be baffled by the complexity of relaxed
>> memory models.
>
>Mainly, the kernel guys have to know all about them in order to try to
>squeeze as much performance as they can from a given arch. Lock/Wait
>free algorithms are used quite a bit. Creating them is never easy,
>relaxed model or not.

We ran into an issue with linux circa 2006 or thereabouts where
there was an issue accessing an skb (networks stack buffer) that
required adding a barrier (AMD64 processor). Details are fuzzy two decades later....

Re: Memory dependency microbenchmark

<uj63qk$2ecqf$2@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35061&group=comp.arch#35061

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 16 Nov 2023 14:07:16 -0800
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <uj63qk$2ecqf$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org> <uj3ikm$1u83m$1@dont-email.me>
<uj3jdv$1uebr$1@dont-email.me> <jwvjzqi38qc.fsf-monnier+comp.arch@gnu.org>
<uj3m8j$1unhg$3@dont-email.me> <jwvwmui1red.fsf-monnier+comp.arch@gnu.org>
<uj3uak$1vtea$1@dont-email.me> <Z2q5N.25376$cAm7.3257@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 16 Nov 2023 22:07:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3af96b48525dafea7d7b80955cb07c1c";
logging-data="2569039"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+be7uiOMV8g2YkGFADGfXvl8PxuDH6Flk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:iSRKxHVt0reyrywhUb6QHOCaSlg=
Content-Language: en-US
In-Reply-To: <Z2q5N.25376$cAm7.3257@fx18.iad>
 by: Chris M. Thomasson - Thu, 16 Nov 2023 22:07 UTC

On 11/16/2023 6:54 AM, Scott Lurndal wrote:
> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>> On 11/15/2023 4:37 PM, Stefan Monnier wrote:
>>>> Actually, for some reason this line of thinking reminds me of the following
>>>> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
>>>> put corks on the forks of Ruprecht (Steve Martin), to prevent him from
>>>> hurting himself. So, basically, Ruprecht would be the programmer and the
>>>> Micheal Caine character would be the arch designer thinking that relaxed
>>>> models are too complex, and all programmers are mainly morons:
>>>
>>> Of course, cognitive dissonance would bite those people who have invested
>>> efforts into learning about all the intricacies of non-sequential
>>> memory models.
>>>
>>> But if we can make SC's efficiency sufficiently close to that of TSO,
>>> the benefit could be significant for all those people who have not
>>> invested such efforts.
>>>
>>> The evolution of computer programming is littered with steps that reduce
>>> performance in exchange for a cleaner programming model.
>>>
>>> You don't need to be a moron to be baffled by the complexity of relaxed
>>> memory models.
>>
>> Mainly, the kernel guys have to know all about them in order to try to
>> squeeze as much performance as they can from a given arch. Lock/Wait
>> free algorithms are used quite a bit. Creating them is never easy,
>> relaxed model or not.
>
> We ran into an issue with linux circa 2006 or thereabouts where
> there was an issue accessing an skb (networks stack buffer) that
> required adding a barrier (AMD64 processor). Details are fuzzy two decades later....
>

I know that RCU is being used for dynamic routing tables, and a lot more
in the Linux kernel. I cannot remember if the skb used it, I bet it did.
For some damn reason, this makes me think of eieio on the PPC...

Re: Memory dependency microbenchmark

<ksSdnZUAF6TBscr4nZ2dnZfqnPqdnZ2d@supernews.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35065&group=comp.arch#35065

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 17 Nov 2023 09:03:23 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <ksSdnZUAF6TBscr4nZ2dnZfqnPqdnZ2d@supernews.com>
Date: Fri, 17 Nov 2023 09:03:24 +0000
Lines: 59
X-Trace: sv3-BSnLfSqb5LncONo901zQhF5dqtOlSfxyNFEFnxi17gI0fq2ME31gn30JHMwrS5TVHUZvzMCTiDk0fyk!DgZr3DcScZKo/hrhv88mc/CtT2oRVYwMNKqZCwSdVlb6yRgML1Uoz1Cg40nM5NRI+a0+eYRdp9vG!2iWph0E4lkY=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
 by: aph@littlepinkcloud.invalid - Fri, 17 Nov 2023 09:03 UTC

Kent Dickey <kegs@provalid.com> wrote:
> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
> <aph@littlepinkcloud.invalid> wrote:
>>Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>
>>>> I don't think that's really true. The reorderings we see in currently-
>>>> produced hardware are, more or less, a subset of the same reorderings
>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>> reordering in a multi-threaded C program it may well be (probably is!)
>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>> common counter-example to this is for volatile accesses.
>>>
>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>
>>Maybe I wasn't clear enough. If you use std::atomic and
>>std::memory_order_* in such a way that there are no data races, your
>>concurrent program will be fine on both TSO and relaxed memory
>>ordering. If you try to fix data races with volatile instead of
>>std::atomic and std::memory_order_*, that'll mostly fix things on a
>>TSO machine, but not on a machine with relaxed memory ordering.
>
> What you are saying is:
>
> As long as you fully analyze your program, ensure all multithreaded
> accesses are only through atomic variables, and you label every
> access to an atomic variable properly (although my point is: exactly
> what should that be??), then there is no problem.

Well, this is definitely true. But it's not exactly a plan: in
practice, people use careful synchronization boundaries and immutable
data structures.

> What I'm arguing is: the CPU should behave as if
> memory_order_seq_cst is set on all accesses with no special
> trickery.

What I'm saying is: that isn't sufficient if you are using an
optimizing compiler. And if you are programming for an optimizing
compiler you have to follow the rules anway. And the optimizing
compiler can reorder stores and loads as much, if not more, than the
hardware does.

> This acquire/release nonsense is all weakly ordered brain
> damage. The problem is on weakly ordered CPUs, performance
> definitely does matter in terms of getting this stuff right, but
> that's their problem. Being weakly ordered makes them slower when
> they have to execute barriers for correctness, but it's the barriers
> themselves that are the slow down, not ordering the requests
> properly.

How is that? All the barriers do is enforce the ordering.

Take the Apple M1 as an exaple. It has a TSO mode bit. It also has TSO
stores and loads, intended for when TSO mode is turned off. Are you
saying that the TSO stores and loads use a different mechanism to
enforce ordering from the one used when TSO is on by default?

Andrew.

Re: Memory dependency microbenchmark

<jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35066&group=comp.arch#35066

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 10:44:45 -0500
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="82ac6eb6107de4dd38d7b7993ee8f542";
logging-data="3008896"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WcLeBE0nQ2PpzlNsLjevIG6rmUQaFyZ0="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:1K8TNGxaMgiASrOaHU0PlIbP/dY=
sha1:2r5XRCK/RpOm8s+b3/N5bI+Soao=
 by: Stefan Monnier - Fri, 17 Nov 2023 15:44 UTC

> As long as you fully analyze your program, ensure all multithreaded accesses
> are only through atomic variables, and you label every access to an
> atomic variable properly (although my point is: exactly what should that
> be??), then there is no problem.

BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell. This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.

Stefan

Re: Memory dependency microbenchmark

<491d288a9d5a47236c979622b79db056@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35068&group=comp.arch#35068

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 18:37:17 +0000
Organization: novaBBS
Message-ID: <491d288a9d5a47236c979622b79db056@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1243777"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$.Yqyt97Uj3Ofw7WFIM9v.eI.BOaxcrsoooJ9qrGxin1Wf6ycI2wWO
 by: MitchAlsup - Fri, 17 Nov 2023 18:37 UTC

Stefan Monnier wrote:

>> As long as you fully analyze your program, ensure all multithreaded accesses
>> are only through atomic variables, and you label every access to an
>> atomic variable properly (although my point is: exactly what should that
>> be??), then there is no problem.

> BTW, the above sounds daunting when writing in C because you have to do
> that analysis yourself, but there are programming languages out there
> which will do that analysis for you as part of type checking.
> I'm thinking here of languages like Rust or the STM library of
> Haskell. This also solves the problem that memory accesses can be
> reordered by the compiler, since in that case the compiler is fully
> aware of which accesses can be reordered and which can't.

> Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
other doubly linked list:: you would write::
<
BOOLEAN MoveElement( Element *fr, Element *to )
{ fn = fr->next;
fp = fr->prev;
tn = to->next;

if( TRUE )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
fr->next = tn;
return TRUE;
}
return FALSE;
}

In order to change this into a fully qualified ATOMIC event, the code
is decorated as::

BOOLEAN MoveElement( Element *fr, Element *to )
{ esmLOCK( fn = fr->next ); // get data
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn ); // touch data
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn; // move the bits around
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}

Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to
to get the job(s) done.

Re: Memory dependency microbenchmark

<uj8ef9$2u1i9$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35070&group=comp.arch#35070

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 11:21:13 -0800
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <uj8ef9$2u1i9$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 17 Nov 2023 19:21:14 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e3dfad7c4c93b3524f06016549f1a58c";
logging-data="3081801"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fk0xzlGA8QIzNqXKdEJ8EfMgg+G2tJIk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mz7YltMuV6KeTbRLKV3VmRZ+n1w=
Content-Language: en-US
In-Reply-To: <491d288a9d5a47236c979622b79db056@news.novabbs.com>
 by: Chris M. Thomasson - Fri, 17 Nov 2023 19:21 UTC

On 11/17/2023 10:37 AM, MitchAlsup wrote:
> Stefan Monnier wrote:
>
>>> As long as you fully analyze your program, ensure all multithreaded
>>> accesses
>>> are only through atomic variables, and you label every access to an
>>> atomic variable properly (although my point is: exactly what should that
>>> be??), then there is no problem.
>
>> BTW, the above sounds daunting when writing in C because you have to do
>> that analysis yourself, but there are programming languages out there
>> which will do that analysis for you as part of type checking.
>> I'm thinking here of languages like Rust or the STM library of
>> Haskell.  This also solves the problem that memory accesses can be
>> reordered by the compiler, since in that case the compiler is fully
>> aware of which accesses can be reordered and which can't.
>
>
>>         Stefan
> <
> I created the Exotic Synchronization Method such that you could just
> write the code needed to do the work, and then decorate those accesses
> which are participating in the ATOMIC event. So, lets say you want to
> move an element from one doubly linked list to another place in some
> other doubly linked list:: you would write::
> <
> BOOLEAN MoveElement( Element *fr, Element *to )
> {
>     fn = fr->next;
>     fp = fr->prev;
>     tn = to->next;
>
>
>
>     if( TRUE )
>     {
>              fp->next = fn;
>              fn->prev = fp;
>              to->next = fr;
>              tn->prev = fr;
>              fr->prev = to;
>              fr->next = tn;
>              return TRUE;
>     }
>     return FALSE;
> }
>
> In order to change this into a fully qualified ATOMIC event, the code
> is decorated as::
>
> BOOLEAN MoveElement( Element *fr, Element *to )
> {
>     esmLOCK( fn = fr->next );         // get data
>     esmLOCK( fp = fr->prev );
>     esmLOCK( tn = to->next );
>     esmLOCK( fn );                    // touch data
>     esmLOCK( fp );
>     esmLOCK( tn );
>     if( !esmINTERFERENCE() )
>     {
>              fp->next = fn;           // move the bits around
>              fn->prev = fp;
>              to->next = fr;
>              tn->prev = fr;
>              fr->prev = to;
>     esmLOCK( fr->next = tn );
>              return TRUE;
>     }
>     return FALSE;
> }
>
> Having a multiplicity of containers participate in an ATOMIC event
> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
> the job(s) done.

Interesting. Any chance of live lock wrt esmINTERFERENCE() ?

Re: Memory dependency microbenchmark

<uj8eho$2u1ih$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35071&group=comp.arch#35071

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 11:22:31 -0800
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <uj8eho$2u1ih$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 17 Nov 2023 19:22:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e3dfad7c4c93b3524f06016549f1a58c";
logging-data="3081809"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QEI6Ry0PutFypcWEJYtMfY4rsAqFdr7A="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mZA1bmOjSEesvP4Fm2PrhOG4Kxs=
Content-Language: en-US
In-Reply-To: <491d288a9d5a47236c979622b79db056@news.novabbs.com>
 by: Chris M. Thomasson - Fri, 17 Nov 2023 19:22 UTC

On 11/17/2023 10:37 AM, MitchAlsup wrote:
> Stefan Monnier wrote:
>
>>> As long as you fully analyze your program, ensure all multithreaded
>>> accesses
>>> are only through atomic variables, and you label every access to an
>>> atomic variable properly (although my point is: exactly what should that
>>> be??), then there is no problem.
>
>> BTW, the above sounds daunting when writing in C because you have to do
>> that analysis yourself, but there are programming languages out there
>> which will do that analysis for you as part of type checking.
>> I'm thinking here of languages like Rust or the STM library of
>> Haskell.  This also solves the problem that memory accesses can be
>> reordered by the compiler, since in that case the compiler is fully
>> aware of which accesses can be reordered and which can't.
>
>
>>         Stefan
> <
> I created the Exotic Synchronization Method such that you could just
> write the code needed to do the work, and then decorate those accesses
> which are participating in the ATOMIC event. So, lets say you want to
> move an element from one doubly linked list to another place in some
> other doubly linked list:: you would write::
> <
> BOOLEAN MoveElement( Element *fr, Element *to )
> {
>     fn = fr->next;
>     fp = fr->prev;
>     tn = to->next;
>
>
>
>     if( TRUE )
>     {
>              fp->next = fn;
>              fn->prev = fp;
>              to->next = fr;
>              tn->prev = fr;
>              fr->prev = to;
>              fr->next = tn;
>              return TRUE;
>     }
>     return FALSE;
> }
>
> In order to change this into a fully qualified ATOMIC event, the code
> is decorated as::
>
> BOOLEAN MoveElement( Element *fr, Element *to )
> {
>     esmLOCK( fn = fr->next );         // get data
>     esmLOCK( fp = fr->prev );
>     esmLOCK( tn = to->next );
>     esmLOCK( fn );                    // touch data
>     esmLOCK( fp );
>     esmLOCK( tn );
>     if( !esmINTERFERENCE() )
>     {
>              fp->next = fn;           // move the bits around
>              fn->prev = fp;
>              to->next = fr;
>              tn->prev = fr;
>              fr->prev = to;

>     esmLOCK( fr->next = tn );
^^^^^^^^^^^^^^^^^^^^^^^^

Why perform an esmLOCK here? Please correct my confusion.

>              return TRUE;
>     }
>     return FALSE;
> }
>
> Having a multiplicity of containers participate in an ATOMIC event
> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
> the job(s) done.

Re: Memory dependency microbenchmark

<uj8f3a$2u5i4$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35072&group=comp.arch#35072

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 11:31:53 -0800
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <uj8f3a$2u5i4$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad>
<0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>
<uiu4ji$t4c2$1@dont-email.me> <AwX4N.38006$yvY5.31401@fx10.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 17 Nov 2023 19:31:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e3dfad7c4c93b3524f06016549f1a58c";
logging-data="3085892"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UH7sr3Xq1mzNLSaLr43yOO1H1HzgTrYQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:aDgqxpVdcudkD7VDdKYHZWOHe1E=
Content-Language: en-US
In-Reply-To: <AwX4N.38006$yvY5.31401@fx10.iad>
 by: Chris M. Thomasson - Fri, 17 Nov 2023 19:31 UTC

On 11/14/2023 8:10 PM, Branimir Maksimovic wrote:
> On 2023-11-13, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/13/2023 11:29 AM, MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Kent Dickey wrote:
>>>>> In article <uirqj3$9q9q$1@dont-email.me>,
>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>> On 11/12/2023 4:20 PM, MitchAlsup wrote:
>>>>>>> Chris M. Thomasson wrote:
>>>>>>>
>>>>>>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>>>>> workloads.
>>>>>>>>> I know a lot of people believe that statement to be true.  In
>>>>>>>>> general, it
>>>>>>>>> is assumed to be true without proof.
>>>>>>>>>
>>>>>>>>> I believe that statement to be false.  Can you describe some of
>>>>>>>>> these
>>>>>>>>> workloads?
>>>>>>>> Also, think about converting any sound lock-free algorithm's
>>>>>>>> finely tuned memory barriers to _all_ sequential consistency...
>>>>>>>> That would ruin performance right off the bat... Think about it.
>>>>>>> <
>>>>>>> Assuming you are willing to accept the wrong answer fast, rather
>>>>>>> than the right answer later. There are very few algorithms with
>>>>>>> this property.
>>>>>> That does not make any sense to me. Think of a basic mutex. It
>>>>>> basically requires an acquire membar for the lock and a release
>>>>>> membar for the unlock. On SPARC that would be:
>>>>>>
>>>>>> acquire = MEMBAR #LoadStore | #LoadLoad
>>>>>>
>>>>>> release = MEMBAR #LoadStore | #StoreStore
>>>>>>
>>>>>> Okay, fine. However, if I made them sequentially consistent, it
>>>>>> would require a damn #StoreLoad barrier for both acquire and
>>>>>> release. This is not good and should be avoided when possible.
>>>>>>
>>>>>> Also, RCU prides itself with not having to use any memory barriers
>>>>>> for its read side. If RCU was forced to use a seq cst, basically
>>>>>> LOCKED RMW or MFENCE on Intel, it would completely ruin its
>>>>>> performance.
>> [...]
>>> < I suspect SUN lost significant performance by always running TSO and
>>> it still required barrier instructions.
>> [...]
>>
>> Intel still requires an explicit membar for hazard pointers as-is. Sparc
>> in TSO mode still requires a membar for this. Spard needs a #StoreLoad
>> wrt the store followed by a load to another location relationship to
>> hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
> I think that Apple M1 requires, too. I has problems wihout membar.
>
>

Afaict, any TSO model needs an explicit barrier for the store followed
by a load to another location relationship to hold. If the program does
not use this relationship, then it does not need to use them on TSO.

Re: Memory dependency microbenchmark

<6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=35073&group=comp.arch#35073

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 17 Nov 2023 20:49:08 +0000
Organization: novaBBS
Message-ID: <6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <uj8eho$2u1ih$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1255915"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$.gStxAlEqYpyT7.Fa/1aj.7lVVfM2OvX//Owr/mWOvjo9EvJEYuf.
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Level: *
 by: MitchAlsup - Fri, 17 Nov 2023 20:49 UTC

Chris M. Thomasson wrote:

> On 11/17/2023 10:37 AM, MitchAlsup wrote:
>> Stefan Monnier wrote:
>>
>>>> As long as you fully analyze your program, ensure all multithreaded
>>>> accesses
>>>> are only through atomic variables, and you label every access to an
>>>> atomic variable properly (although my point is: exactly what should that
>>>> be??), then there is no problem.
>>
>>> BTW, the above sounds daunting when writing in C because you have to do
>>> that analysis yourself, but there are programming languages out there
>>> which will do that analysis for you as part of type checking.
>>> I'm thinking here of languages like Rust or the STM library of
>>> Haskell.  This also solves the problem that memory accesses can be
>>> reordered by the compiler, since in that case the compiler is fully
>>> aware of which accesses can be reordered and which can't.
>>
>>
>>>         Stefan
>> <
>> I created the Exotic Synchronization Method such that you could just
>> write the code needed to do the work, and then decorate those accesses
>> which are participating in the ATOMIC event. So, lets say you want to
>> move an element from one doubly linked list to another place in some
>> other doubly linked list:: you would write::
>> <
>> BOOLEAN MoveElement( Element *fr, Element *to )
>> {
>>     fn = fr->next;
>>     fp = fr->prev;
>>     tn = to->next;
>>
>>
>>
>>     if( TRUE )
>>     {
>>              fp->next = fn;
>>              fn->prev = fp;
>>              to->next = fr;
>>              tn->prev = fr;
>>              fr->prev = to;
>>              fr->next = tn;
>>              return TRUE;
>>     }
>>     return FALSE;
>> }
>>
>> In order to change this into a fully qualified ATOMIC event, the code
>> is decorated as::
>>
>> BOOLEAN MoveElement( Element *fr, Element *to )
>> {
>>     esmLOCK( fn = fr->next );         // get data
>>     esmLOCK( fp = fr->prev );
>>     esmLOCK( tn = to->next );
>>     esmLOCK( fn );                    // touch data
>>     esmLOCK( fp );
>>     esmLOCK( tn );
>>     if( !esmINTERFERENCE() )
>>     {
>>              fp->next = fn;           // move the bits around
>>              fn->prev = fp;
>>              to->next = fr;
>>              tn->prev = fr;
>>              fr->prev = to;

>>     esmLOCK( fr->next = tn );
> ^^^^^^^^^^^^^^^^^^^^^^^^

> Why perform an esmLOCK here? Please correct my confusion.

esmLOCK on the last participating Store is what tells the HW that the ATOMIC
event is finished. So, the first esmLOCK begins the ATOMIC event and the
only esmLOCK on an outgoing (ST or PostPush) memory reference ends the event.

>>              return TRUE;
>>     }
>>     return FALSE;
>> }
>>
>> Having a multiplicity of containers participate in an ATOMIC event
>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
>> the job(s) done.


devel / comp.arch / Re: Memory dependency microbenchmark

Pages:12345678
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor