Rocksolid Light - comp.arch - Re: Memory dependency microbenchmark

EricP wrote:

> MitchAlsup wrote:
>> Scott Lurndal wrote:
>>
>>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>>> On 11/13/23 1:22 PM, EricP wrote:
>>>>> Kent Dickey wrote:
>>>> [snip]
>>>>>> Thus, the people trapped in Relaxed Ordering Hell then push weird
>>>>>> schemes
>>>>>> on everyone else to try to come up with algorithms which need fewer
>>>>>> barriers. It's crazy.
>>>>>>
>>>>>> Relaxed Ordering is a mistake.
>>>>>>
>>>>>> Kent
>>>>>
>>>>> I suggest something different: the ability to switch between TSO and
>>>>> relaxed with non-privileged user mode instructions.
>>
>>>> Even with a multithreaded program, stack and TLS would be "thread
>>>> private" and not require the same consistency guarantees.
>>
>>> Why do you think that 'stack' would be thread private? It's
>>> quite common to allocate long-lived data structures on the
>>> stack and pass the address of the object to code that may
>>> be executing in the context of other threads. So long as the
>>> lifetime of
>>> the object extends beyond the last reference, of course.
>>
>> Even Thread Local Store is not private to the thread if the thread
>> creates a pointer into it and allows others to see the pointer.
>>
>> The only thing the HW can validate as non-shared is that portion of
>> the stack containing callee save registers (and the return address)
>> but only 2 known architectures have these chunks of memory is an
>> address space where threads cannot read-write-or-execute that chunk.

> The callee save area may be R-W-E page protected against it own thread
> but it doesn't prevent a privileged thread from concurrently accessing
> that save area (say to edit the stack to deliver a signal)
> so the same coherence applies there too.

My 66000 monitors the PTE used to translate the portion of Call Stack
Pointer such that RWE = 000 or will page-fault. This prevents malicious
code from damaging the contract between caller and callee. There is
an SP the application uses to create and destroy local data and there
is a separate CSP the application cannot even see only ENTER, EXIT, and
RET instructions can use CSP and these verify that RWE == 000.

It is certainly true that a different set of mapping tables can access
anything those tables map.

Re: Memory dependency microbenchmark

<ujj9i4$10fss$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35144&group=comp.arch#35144

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 21 Nov 2023 14:04:51 -0800
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <ujj9i4$10fss$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 21 Nov 2023 22:04:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c459a5d173ac1121c0324344c3a8e201";
logging-data="1064860"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+JZ3MI+OOIpwZxBz6p/ClO745PE1RCwcU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Uh5IK7Ozt0KobariZcUArzsCRwY=
In-Reply-To: <Fx57N.20583$BSkc.9831@fx06.iad>
Content-Language: en-US

by: Chris M. Thomasson - Tue, 21 Nov 2023 22:04 UTC

On 11/21/2023 9:11 AM, EricP wrote:
> aph@littlepinkcloud.invalid wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>> But this is my point: in many programs there is no memory that
>>> you can point to and say it is always private to a single thread.
>>
>> That's more about program design. In a multi-threaded program this is
>> something you really should know.
>>
>>> And this is independent of language, its to do with program structure.
>>>
>>> You can say a certain memory range is shared and guarded by locks,
>>> or shared and managed by lock-free code.
>>> And we can point to this because the code in modularized this way.
>>>
>>> But the opposite of 'definitely shared' is not 'definitely private',
>>> it's 'dont know' or 'sometimes'.
>>>
>>> Eg: Your code pops up a dialog box, prompts the user for an integer
>>> value,
>>> and writes that value to a program global variable.
>>> Do you know whether that dialog box is a separate thread?
>>> Should you care? What if the dialog starts out in the same thread,
>>> then a new release changes it to a separate thread.
>>> What if the variable is on the thread stack or in a heap?
>>>
>>> On a weak ordered system I definitely need to know because
>>> I need barriers to ensure I can read the variable properly.
>>
>> In this case, no, I don't think you do. Barriers only control the
>> ordering between accesses, not when they become visible, and here
>> there's only one access. If there are at least two, and you really
>> need to see one before the other, then you need a barrier.
>
> The barriers also ensure the various local buffers, pipelines and
> inbound and outbound comms command and reply message queues are drained.
> It ensures that the operations that came before it have reached
> their coherency point - the cache controller - and that any
> outstanding asynchronous operations are complete.
> And that in turn controls when values become visible.
>
> On a weak order cpu with no store ordering, the cpu is not required
> to propagate any store into the cache within any period of time.
> It can stash it in a write combine buffer waiting to see if more
> updates to the same line appear.
>
> Weak order requires a membar after a store to force the it into the cache,
> triggering the coherence handshake which invalidates other copies,
> so that when remote cores reread a line they see the updated value.
>
> In other words, to retire the membar instruction the core must force the
> prior store values into the coherent cache making them globally visible.

Fwiw, it depends on the barrier. Wrt x86, think along the lines of
LOCK'ED RMW, MFENCE, SFENCE, LFENCE. Iirc, the *FENCE instructions are
for WB memory, also I think MFENCE can be used for non-wb membar as
well... Cannot remember right now, damn it.

On SPARC: #StoreLoad, #LoadStore, #LoadLoad, #StoreStore

I think there is another one on the SPARC that I am forgetting.

> The difference for TSO is that a store has implied membars before it to
> prevent it bypassing (executing before) older loads and stores.
>
>> And even on
>> a TSO machine, you're going to have to do something on both the reader
>> and the writer sides if you need ordering to be protected from a
>> compiler.
>>
>> Andrew.
>
> Compilers are a different discussion.
>
>

C++ Compilers get it right because of std::atomic, a compiler barrier is
different than a memory barrier.

Re: Memory dependency microbenchmark

<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35148&group=comp.arch#35148

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!69.80.99.26.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Wed, 22 Nov 2023 09:35:42 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com> <Fx57N.20583$BSkc.9831@fx06.iad>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
Date: Wed, 22 Nov 2023 09:35:42 +0000
Lines: 41
X-Trace: sv3-6pJLoE3Tpm6U0hN3CImkOGitdqgWcHbtqV+zDNJ74ZXUu3LDj66RWcYdtKSzkGZgxOxtgDKuf2A8z5m!QX0Hn7XKa+FQJZa1a+RUB/7kciB78Mkqv3oEkEiW8dZh7FAvsrQ/lCdQOJfUqXWSE2OpRk5bOfzu!rKneTKD+FWY=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph@littlepinkcloud.invalid - Wed, 22 Nov 2023 09:35 UTC

EricP <ThatWouldBeTelling@thevillage.com> wrote:
> aph@littlepinkcloud.invalid wrote:
>> Barriers only control the ordering between accesses, not when they
>> become visible, and here there's only one access. If there are at
>> least two, and you really need to see one before the other, then
>> you need a barrier.
>
> The barriers also ensure the various local buffers, pipelines and
> inbound and outbound comms command and reply message queues are
> drained.

I'm surprised you say that. All they have to do is make it appear as
if this has been done. How it actually happens is up to the hardware
designer. In modern GBOOO CPUs, most barriers don't require everything
to be pushed to cache. (As I understand it, GBOOO designs can treat
the set of pending accesses as a sort of memory transaction, detect
conflicts, and roll back to the last point of coherence and replay
in-order. But I am not a hardware designer.)

> Weak order requires a membar after a store to force the it into the cache,
> triggering the coherence handshake which invalidates other copies,
> so that when remote cores reread a line they see the updated value.
>
> In other words, to retire the membar instruction the core must force the
> prior store values into the coherent cache making them globally visible.

That's usually true for a StoreLoad, not for any of the others.

> The difference for TSO is that a store has implied membars before it to
> prevent it bypassing (executing before) older loads and stores.

>> And even on a TSO machine, you're going to have to do something on
>> both the reader and the writer sides if you need ordering to be
>> protected from a compiler.
>
> Compilers are a different discussion.

Not for the user. It's the same problem, with the same solution. It
makes no difference to a C programmer where the reordering comes from.

Andrew.

aph@littlepinkcloud.invalid wrote:

> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> aph@littlepinkcloud.invalid wrote:
>>> Barriers only control the ordering between accesses, not when they
>>> become visible, and here there's only one access. If there are at
>>> least two, and you really need to see one before the other, then
>>> you need a barrier.
>>
>> The barriers also ensure the various local buffers, pipelines and
>> inbound and outbound comms command and reply message queues are
>> drained.

> I'm surprised you say that. All they have to do is make it appear as
> if this has been done. How it actually happens is up to the hardware
> designer. In modern GBOOO CPUs, most barriers don't require everything
> to be pushed to cache. (As I understand it, GBOOO designs can treat
> the set of pending accesses as a sort of memory transaction, detect
> conflicts, and roll back to the last point of coherence and replay
> in-order. But I am not a hardware designer.)

Yes, exactly, an MemBar sets a boundary where all younger memory references
of one type (or both) must wait to become visible until all older memory
references of the other Types have become visible. Only 1 or 2 bits of state
per queued memory ref is altered by an explicit MemBar or an implicit barrier
(from change in operating mode).

What if it is not-cacheable ?? or MMI/O ?? or configuration space ??
>>
>> In other words, to retire the membar instruction the core must force the
>> prior store values into the coherent cache making them globally visible.

Just Visible, the rest is not under SW control.

> That's usually true for a StoreLoad, not for any of the others.

>> The difference for TSO is that a store has implied membars before it to
>> prevent it bypassing (executing before) older loads and stores.

Which is why TSO is slow(er).

>>> And even on a TSO machine, you're going to have to do something on
>>> both the reader and the writer sides if you need ordering to be
>>> protected from a compiler.
>>
>> Compilers are a different discussion.

> Not for the user. It's the same problem, with the same solution. It
> makes no difference to a C programmer where the reordering comes from.

Somebody has to create the rules by which HW and SW cooperatively manage
memory order in the presence of a multiplicity of CPUs and devices.

> Andrew.

Re: Memory dependency microbenchmark

<ujluuo$1gjej$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35153&group=comp.arch#35153

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 22 Nov 2023 14:22:15 -0800
Organization: A noiseless patient Spider
Lines: 84
Message-ID: <ujluuo$1gjej$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad> <ujj9i4$10fss$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 22 Nov 2023 22:22:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8413604b88da726f65adfe382b85e83";
logging-data="1592787"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/v3Jx9/UMsLyYYJ3zk6RqEDie+s5FPrTM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:zRYf1GYA/vcaZ6dIbPy7ZxFTwwQ=
Content-Language: en-US
In-Reply-To: <ujj9i4$10fss$1@dont-email.me>

by: Chris M. Thomasson - Wed, 22 Nov 2023 22:22 UTC

On 11/21/2023 2:04 PM, Chris M. Thomasson wrote:
> On 11/21/2023 9:11 AM, EricP wrote:
>> aph@littlepinkcloud.invalid wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>>> But this is my point: in many programs there is no memory that
>>>> you can point to and say it is always private to a single thread.
>>>
>>> That's more about program design. In a multi-threaded program this is
>>> something you really should know.
>>>
>>>> And this is independent of language, its to do with program structure.
>>>>
>>>> You can say a certain memory range is shared and guarded by locks,
>>>> or shared and managed by lock-free code.
>>>> And we can point to this because the code in modularized this way.
>>>>
>>>> But the opposite of 'definitely shared' is not 'definitely private',
>>>> it's 'dont know' or 'sometimes'.
>>>>
>>>> Eg: Your code pops up a dialog box, prompts the user for an integer
>>>> value,
>>>> and writes that value to a program global variable.
>>>> Do you know whether that dialog box is a separate thread?
>>>> Should you care? What if the dialog starts out in the same thread,
>>>> then a new release changes it to a separate thread.
>>>> What if the variable is on the thread stack or in a heap?
>>>>
>>>> On a weak ordered system I definitely need to know because
>>>> I need barriers to ensure I can read the variable properly.
>>>
>>> In this case, no, I don't think you do. Barriers only control the
>>> ordering between accesses, not when they become visible, and here
>>> there's only one access. If there are at least two, and you really
>>> need to see one before the other, then you need a barrier.
>>
>> The barriers also ensure the various local buffers, pipelines and
>> inbound and outbound comms command and reply message queues are drained.
>> It ensures that the operations that came before it have reached
>> their coherency point - the cache controller - and that any
>> outstanding asynchronous operations are complete.
>> And that in turn controls when values become visible.
>>
>> On a weak order cpu with no store ordering, the cpu is not required
>> to propagate any store into the cache within any period of time.
>> It can stash it in a write combine buffer waiting to see if more
>> updates to the same line appear.
>>
>> Weak order requires a membar after a store to force the it into the
>> cache,
>> triggering the coherence handshake which invalidates other copies,
>> so that when remote cores reread a line they see the updated value.
>>
>> In other words, to retire the membar instruction the core must force the
>> prior store values into the coherent cache making them globally visible.
>
> Fwiw, it depends on the barrier. Wrt x86, think along the lines of
> LOCK'ED RMW, MFENCE, SFENCE, LFENCE. Iirc, the *FENCE instructions are
> for WB memory, also I think MFENCE can be used for non-wb membar as
> well... Cannot remember right now, damn it.
>
> On SPARC: #StoreLoad, #LoadStore, #LoadLoad, #StoreStore
>
> I think there is another one on the SPARC that I am forgetting.
>
>
>> The difference for TSO is that a store has implied membars before it to
>> prevent it bypassing (executing before) older loads and stores.
>>
>>> And even on
>>> a TSO machine, you're going to have to do something on both the reader
>>> and the writer sides if you need ordering to be protected from a
>>> compiler.
>>>
>>> Andrew.
>>
>> Compilers are a different discussion.
>>
>>
>
> C++ Compilers get it right because of std::atomic, a compiler barrier is
> different than a memory barrier.
>

Also, lets not forget EIEIO on the PPC. ;^)

Re: Memory dependency microbenchmark

<ujlv3p$1gjej$3@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35154&group=comp.arch#35154

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 22 Nov 2023 14:24:56 -0800
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <ujlv3p$1gjej$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 22 Nov 2023 22:24:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8413604b88da726f65adfe382b85e83";
logging-data="1592787"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18w8Wfr8vhgyQ1o24hWcKxE/NVpOFGy61w="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Eoq0J40QtsQEe6VjEBT5sBaPxn8=
Content-Language: en-US
In-Reply-To: <26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>

by: Chris M. Thomasson - Wed, 22 Nov 2023 22:24 UTC

On 11/22/2023 12:38 PM, MitchAlsup wrote:
> aph@littlepinkcloud.invalid wrote:
>
>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>> aph@littlepinkcloud.invalid wrote:
>>>> Barriers only control the ordering between accesses, not when they
>>>> become visible, and here there's only one access. If there are at
>>>> least two, and you really need to see one before the other, then
>>>> you need a barrier.
>>>
>>> The barriers also ensure the various local buffers, pipelines and
>>> inbound and outbound comms command and reply message queues are
>>> drained.
>
>> I'm surprised you say that. All they have to do is make it appear as
>> if this has been done. How it actually happens is up to the hardware
>> designer. In modern GBOOO CPUs, most barriers don't require everything
>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>> the set of pending accesses as a sort of memory transaction, detect
>> conflicts, and roll back to the last point of coherence and replay
>> in-order. But I am not a hardware designer.)
>
> Yes, exactly, an MemBar sets a boundary where all younger memory references
> of one type (or both) must wait to become visible until all older memory
> references of the other Types have become visible. Only 1 or 2 bits of
> state
> per queued memory ref is altered by an explicit MemBar or an implicit
> barrier
> (from change in operating mode).
[...]

There are different kinds of memory barriers. For instance, a mutex can
use #LoadStore ordering, however, Dekker's algorithm as-is requires
#StoreLoad. Also, think of eieio... ;^) Btw, have you heard of
asymmetric mutexes before?

Re: Memory dependency microbenchmark

<ujmpcd$1nuki$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35157&group=comp.arch#35157

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 22 Nov 2023 21:53:16 -0800
Organization: A noiseless patient Spider
Lines: 226
Message-ID: <ujmpcd$1nuki$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
<ujbbdj$3fmi7$1@dont-email.me>
<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
<ujdonm$3u8mv$2@dont-email.me>
<d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Nov 2023 05:53:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cbc38f4a0bd3a53c236d0e587a7b3e98";
logging-data="1833618"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19R24pwpnjHI30gIqkYW8YaBAvj/YQyP8w="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:230MWiGu/EHpIHuieaB9jcN9kPI=
Content-Language: en-US
In-Reply-To: <d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>

by: Chris M. Thomasson - Thu, 23 Nov 2023 05:53 UTC

On 11/19/2023 2:23 PM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/19/2023 11:32 AM, MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>>> On 11/17/2023 5:03 PM, MitchAlsup wrote:
>>>>> Scott Lurndal wrote:
>>>>>
>>>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>>>> Stefan Monnier wrote:
>>>>>>>
>>>>>>>>> As long as you fully analyze your program, ensure all
>>>>>>>>> multithreaded accesses
>>>>>>>>> are only through atomic variables, and you label every access
>>>>>>>>> to an
>>>>>>>>> atomic variable properly (although my point is: exactly what
>>>>>>>>> should that
>>>>>>>>> be??), then there is no problem.
>>>>>>>
>>>>>>>> BTW, the above sounds daunting when writing in C because you
>>>>>>>> have to do
>>>>>>>> that analysis yourself, but there are programming languages out
>>>>>>>> there
>>>>>>>> which will do that analysis for you as part of type checking.
>>>>>>>> I'm thinking here of languages like Rust or the STM library of
>>>>>>>> Haskell. This also solves the problem that memory accesses can be
>>>>>>>> reordered by the compiler, since in that case the compiler is fully
>>>>>>>> aware of which accesses can be reordered and which can't.
>>>>>>>
>>>>>>>
>>>>>>>>         Stefan
>>>>>>> <
>>>>>>> I created the Exotic Synchronization Method such that you could just
>>>>>>> write the code needed to do the work, and then decorate those
>>>>>>> accesses
>>>>>>> which are participating in the ATOMIC event. So, lets say you
>>>>>>> want to move an element from one doubly linked list to another
>>>>>>> place in some
>>>>>>> other doubly linked list:: you would write::
>>>>>>> <
>>>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>>>> {
>>>>>>>     fn = fr->next;
>>>>>>>     fp = fr->prev;
>>>>>>>     tn = to->next;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     if( TRUE )
>>>>>>>     {
>>>>>>>              fp->next = fn;
>>>>>>>              fn->prev = fp;
>>>>>>>              to->next = fr;
>>>>>>>              tn->prev = fr;
>>>>>>>              fr->prev = to;
>>>>>>>              fr->next = tn;
>>>>>>>              return TRUE;
>>>>>>>     }
>>>>>>>     return FALSE;
>>>>>>> }
>>>>>>>
>>>>>>> In order to change this into a fully qualified ATOMIC event, the
>>>>>>> code
>>>>>>> is decorated as::
>>>>>>>
>>>>>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>>>>>> {
>>>>>>>     esmLOCK( fn = fr->next );         // get data
>>>>>>>     esmLOCK( fp = fr->prev );
>>>>>>>     esmLOCK( tn = to->next );
>>>>>>>     esmLOCK( fn );                    // touch data
>>>>>>>     esmLOCK( fp );
>>>>>>>     esmLOCK( tn );
>>>>>>>     if( !esmINTERFERENCE() )
>>>>>>>     {
>>>>>>>              fp->next = fn;           // move the bits around
>>>>>>>              fn->prev = fp;
>>>>>>>              to->next = fr;
>>>>>>>              tn->prev = fr;
>>>>>>>              fr->prev = to;
>>>>>>>     esmLOCK( fr->next = tn );
>>>>>>>              return TRUE;
>>>>>>>     }
>>>>>>>     return FALSE;
>>>>>>> }
>>>>>>>
>>>>>>> Having a multiplicity of containers participate in an ATOMIC event
>>>>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to
>>>>>>> to get the job(s) done.
>>>>>
>>>>>> That looks suspiciously like transactional memory.
>>>
>>>> Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
>>>
>>> esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
>>> conditional branch where the condition is delivered from the miss
>>> buffer (where I detect interference wrt participating cache lines.)
>
>> So, can false sharing on a participating cache line make
>> esmINTERFERENCE() return true?
>
>
>>>
>>>>>
>>>>> I has some flavors of such, but::
>>>>> it has no nesting,
>>>>> it has a strict limit of 8 participating cache lines,
>>>>> it automagically transfers control when disruptive interference is
>>>>> detected,
>>>>> it is subject to timeouts;
>>>>>
>>>>> But does have the property that all interested 3rd parties see
>>>>> participating
>>>>> memory only in the before or only in the completely after states.
>>>
>>>> Are you familiar with KCSS? K-Compare Single Swap?
>>>
>>>> https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
>>>
>>> Easily done:
>>>      esmLOCK( c1 = p1->condition1 );
>>>      esmLOCK( c2 = p2->condition2 );
>>>      ...
>>>      if( c1 == C1 && C2 == C2 && c2 == C3 ... )
>>>          ...
>>>          esmLOCK( some data );
>>>
>>> Esm was designed to allow any known synchronization means (in 2013)
>>> to be directly implemented in esm either inline or via subroutine
>>> calls.
>
>> I can see how that would work. The problem is that I am not exactly
>> sure how esmINTERFERENCE works internally... Can it detect/prevent
>> live lock?
>
> esmINTERFERENCE is a branch on interference instruction. This is a
> conditional
> branch instruction that queries whether any of the participating cache
> lines
> has seen a read-with-intent or coherent-invalidate. In effect the branch
> logic reaches out to the miss buffer and asks if any of the participating
> cache lines has been snooped-for-write:: and you can't do this in 2
> instruc-
> tions or you loose ATOMICicity.
>
> If it has, control is transferred and the ATOMIC event is failed
> If it has not, and all participating cache lines are present, then this
> core is allowed to NAK all requests to those participating cache lines
> {and control is not transferred}.
>
> So, you gain control over where flow goes on failure, and essentially
> commit the whole event to finish.
>
> So, while it does not eliminate live/dead-lock situations, it allows SW
> to be constructed to avoid live/dead lock situations:: Why is a value
> which is provided when an ATOMIC event fails. 0 means success, negative
> values are spurious (buffer overflows,...) while positives represent
> the number of competing threads, so the following case, skips elements
> on a linked list to decrease future initerference.
>
> Element* getElement( unSigned Key )
> {
>     int count = 0;
>     for( p = structure.head; p ; p = p->next )
>     {
>          if( p->Key == Key )
>          {
>               if( count-- < 0 )
>               {
>                    esmLOCK( p );
>                    prev = p->prev;
>                    esmLOCK( prev );
>                    next = p->next;
>                    esmLOCK( next );
>                    if( !esmINTERFERENCE() )
>                    {
>                         p->prev = next;
>                         p->next = prev;
>                         p->prev = NULL;
>                esmLOCK( p->next = NULL );
>                         return p;
>                    }
>                    else
>                    {
>                         count = esmWHY();
>                         p = structure.head;
>                    }
>               }
>          }
>     }
>     return NULL;
> }
>
> Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
> that happens when a timer goes off and n threads all want access to the
> work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
> to BigO( ln( n ) ) when requesters arrive in random order at random time.
>
>> I remember hearing from my friend Joe Seigh, who worked at IBM, that
>> they had some sort of logic that would prevent live lock in a compare
>> and swap wrt their free pool manipulation logic. Iirc, it was somewhat
>> related to ABA, hard to remember right now, sorry. I need to find that
>> old thread back in comp.programming.threads.
>
> Depending on system size: there can be several system function units
> that grant "order" for ATOMIC events. These are useful for 64+node systems
> and unnecessary for less than 8-node systems. Disjoint memory spaces
> can use independent ATOMIC arbiters and whether they are in use or not is
> invisible to SW.
>
>>>
>>>> ?
>>>
>>>> Check this out the old thread:
>>>
>>>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

Click here to read the complete article

Re: Memory dependency microbenchmark

<bd114daeb51a98aea245c1a842342caf@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35158&group=comp.arch#35158

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 16:33:17 +0000
Organization: novaBBS
Message-ID: <bd114daeb51a98aea245c1a842342caf@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <7bT5N.54414$BbXa.29985@fx16.iad> <47787581fe3402eb5170fda771088ef7@news.novabbs.com> <ujbbdj$3fmi7$1@dont-email.me> <cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com> <ujdonm$3u8mv$2@dont-email.me> <d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com> <ujmpcd$1nuki$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1873661"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Spam-Level: **
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$62uNw2bc1Qf1rmc6T/BrJeSnZtccWsU/E0G3FxZc507W/BK4NBxjm

by: MitchAlsup - Thu, 23 Nov 2023 16:33 UTC

Chris M. Thomasson wrote:

> On 11/19/2023 2:23 PM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>> So, while it does not eliminate live/dead-lock situations, it allows SW
>> to be constructed to avoid live/dead lock situations:: Why is a value
>> which is provided when an ATOMIC event fails. 0 means success, negative
>> values are spurious (buffer overflows,...) while positives represent
>> the number of competing threads, so the following case, skips elements
>> on a linked list to decrease future initerference.
>>
>> Element* getElement( unSigned Key )
>> {
>>     int count = 0;
>>     for( p = structure.head; p ; p = p->next )
>>     {
>>          if( p->Key == Key )
>>          {
>>               if( count-- < 0 )
>>               {
>>                    esmLOCK( p );
>>                    prev = p->prev;
>>                    esmLOCK( prev );
>>                    next = p->next;
>>                    esmLOCK( next );
>>                    if( !esmINTERFERENCE() )
>>                    {
>>                         p->prev = next;
>>                         p->next = prev;
>>                         p->prev = NULL;
>>                esmLOCK( p->next = NULL );
>>                         return p;
>>                    }
>>                    else
>>                    {
>>                         count = esmWHY();
>>                         p = structure.head;
>>                    }
>>               }
>>          }
>>     }
>>     return NULL;
>> }
>>
>> Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
>> that happens when a timer goes off and n threads all want access to the
>> work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
>> to BigO( ln( n ) ) when requesters arrive in random order at random time.
>>
>>> I remember hearing from my friend Joe Seigh, who worked at IBM, that
>>> they had some sort of logic that would prevent live lock in a compare
>>> and swap wrt their free pool manipulation logic. Iirc, it was somewhat
>>> related to ABA, hard to remember right now, sorry. I need to find that
>>> old thread back in comp.programming.threads.
>>
>> Depending on system size: there can be several system function units
>> that grant "order" for ATOMIC events. These are useful for 64+node systems
>> and unnecessary for less than 8-node systems. Disjoint memory spaces
>> can use independent ATOMIC arbiters and whether they are in use or not is
>> invisible to SW.
>>
>>>>
>>>>> ?
>>>>
>>>>> Check this out the old thread:
>>>>
>>>>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

> Humm, you arch seems pretty neat/interesting to me. I need to learn more
> about it. Can it be abused with a rouge thread that keeps altering a
> cacheline(s) that are participating in the atomic block, so to speak?
> Anyway, I am busy with family time. Will get back to you.

While possible, it is a lot less likely than on a similar architecture
without any of the bells and whistles.

> Fwiw, here is some of my work:

> https://youtu.be/HwIkk9zENcg

Octopi in a box playing ball.

aph@littlepinkcloud.invalid wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> aph@littlepinkcloud.invalid wrote:
>>> Barriers only control the ordering between accesses, not when they
>>> become visible, and here there's only one access. If there are at
>>> least two, and you really need to see one before the other, then
>>> you need a barrier.
>> The barriers also ensure the various local buffers, pipelines and
>> inbound and outbound comms command and reply message queues are
>> drained.
>
> I'm surprised you say that. All they have to do is make it appear as
> if this has been done. How it actually happens is up to the hardware
> designer. In modern GBOOO CPUs, most barriers don't require everything
> to be pushed to cache. (As I understand it, GBOOO designs can treat
> the set of pending accesses as a sort of memory transaction, detect
> conflicts, and roll back to the last point of coherence and replay
> in-order. But I am not a hardware designer.)

I say that because that is what the Intel manual says for MFENCE and SFENCE,
"The processor ensures that every store prior to SFENCE is globally visible
before any store after SFENCE becomes globally visible"
and Arm64 for DSB "ensures that memory accesses that occur before the DSB
instruction have completed before the completion of the DSB instruction".

Perhaps you are thinking of Intel LFENCE and Arm64 DMB which are weaker
and do not require the prior operations to complete but just that they
are performed or observed. These appear to only look at the local LSQ.

WRT the draining for synchronization, I'm not sure what you think the
difference is between making it appear to be done and actually doing so.
Sure there may be some optimizations possible, but the net effect on the
externally observable cache state must be the same.

For store values to be globally visible the local node must receive the
cache line in exclusive state AND have received ACK's from all the prior
sharers that they have removed their copies from cache.

If multiple miss buffers allow multiple pending cache misses then all of
these outstanding operations must complete before the membar can retire.
To do that the cache controller must send all pending outbound command
messages and process all inbound replies, which means drain all the
message queues until the misses are complete, then apply pending
operations to the now resident cache lines.

There are likely ways to optimize the above, overlap some sequences,
but I don't see any steps that can be removed.

>> Weak order requires a membar after a store to force the it into the cache,
>> triggering the coherence handshake which invalidates other copies,
>> so that when remote cores reread a line they see the updated value.
>>
>> In other words, to retire the membar instruction the core must force the
>> prior store values into the coherent cache making them globally visible.
>
> That's usually true for a StoreLoad, not for any of the others.

I was thinking of full membars as an example as the others
are subsets of it and may allow some optimizations.

Re: Memory dependency microbenchmark

<jcN7N.64676$cAm7.42877@fx18.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35161&group=comp.arch#35161

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com> <Fx57N.20583$BSkc.9831@fx06.iad> <-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com> <26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
In-Reply-To: <26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 56
Message-ID: <jcN7N.64676$cAm7.42877@fx18.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 23 Nov 2023 18:53:03 UTC
Date: Thu, 23 Nov 2023 13:52:33 -0500
X-Received-Bytes: 3690

by: EricP - Thu, 23 Nov 2023 18:52 UTC

MitchAlsup wrote:
> aph@littlepinkcloud.invalid wrote:
>
>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>> aph@littlepinkcloud.invalid wrote:
>>>> Barriers only control the ordering between accesses, not when they
>>>> become visible, and here there's only one access. If there are at
>>>> least two, and you really need to see one before the other, then
>>>> you need a barrier.
>>>
>>> The barriers also ensure the various local buffers, pipelines and
>>> inbound and outbound comms command and reply message queues are
>>> drained.
>
>> I'm surprised you say that. All they have to do is make it appear as
>> if this has been done. How it actually happens is up to the hardware
>> designer. In modern GBOOO CPUs, most barriers don't require everything
>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>> the set of pending accesses as a sort of memory transaction, detect
>> conflicts, and roll back to the last point of coherence and replay
>> in-order. But I am not a hardware designer.)
>
> Yes, exactly, an MemBar sets a boundary where all younger memory references
> of one type (or both) must wait to become visible until all older memory
> references of the other Types have become visible. Only 1 or 2 bits of
> state
> per queued memory ref is altered by an explicit MemBar or an implicit
> barrier
> (from change in operating mode).
>
>>> Weak order requires a membar after a store to force the it into the
>>> cache,
>>> triggering the coherence handshake which invalidates other copies,
>>> so that when remote cores reread a line they see the updated value.
>
> What if it is not-cacheable ?? or MMI/O ?? or configuration space ??
>>>
>>> In other words, to retire the membar instruction the core must force the
>>> prior store values into the coherent cache making them globally visible.
>
> Just Visible, the rest is not under SW control.

There are two kinds of barriers/fences (I don't know if there are official
terms for them), which are local bypass barriers, and global completion
barriers.

Bypass barriers restrict which younger ops in the local load-store queue
may bypass and start execution before older ops have made a value locally
visible.

Completion barriers block younger ops from starting execution before
older ops have completed and read or written globally visible values.

You appear to be referring to bypass barriers whereas I'm referring to
completion barriers which require globally visible results.

Re: Memory dependency microbenchmark

<8355cf8543ea812a70795520bdd797ef@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35165&group=comp.arch#35165

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 21:14:34 +0000
Organization: novaBBS
Message-ID: <8355cf8543ea812a70795520bdd797ef@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com> <Fx57N.20583$BSkc.9831@fx06.iad> <-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com> <26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com> <jcN7N.64676$cAm7.42877@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1896710"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$9zOv6icwu55M4Yulc7blfeUZLGRVJS1Bo360FluNaF0CojkOgBX8S
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Thu, 23 Nov 2023 21:14 UTC

EricP wrote:

> MitchAlsup wrote:
>> aph@littlepinkcloud.invalid wrote:
>>
>>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>>> aph@littlepinkcloud.invalid wrote:
>>>>> Barriers only control the ordering between accesses, not when they
>>>>> become visible, and here there's only one access. If there are at
>>>>> least two, and you really need to see one before the other, then
>>>>> you need a barrier.
>>>>
>>>> The barriers also ensure the various local buffers, pipelines and
>>>> inbound and outbound comms command and reply message queues are
>>>> drained.
>>
>>> I'm surprised you say that. All they have to do is make it appear as
>>> if this has been done. How it actually happens is up to the hardware
>>> designer. In modern GBOOO CPUs, most barriers don't require everything
>>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>>> the set of pending accesses as a sort of memory transaction, detect
>>> conflicts, and roll back to the last point of coherence and replay
>>> in-order. But I am not a hardware designer.)
>>
>> Yes, exactly, an MemBar sets a boundary where all younger memory references
>> of one type (or both) must wait to become visible until all older memory
>> references of the other Types have become visible. Only 1 or 2 bits of
>> state
>> per queued memory ref is altered by an explicit MemBar or an implicit
>> barrier
>> (from change in operating mode).
>>
>>>> Weak order requires a membar after a store to force the it into the
>>>> cache,
>>>> triggering the coherence handshake which invalidates other copies,
>>>> so that when remote cores reread a line they see the updated value.
>>
>> What if it is not-cacheable ?? or MMI/O ?? or configuration space ??
>>>>
>>>> In other words, to retire the membar instruction the core must force the
>>>> prior store values into the coherent cache making them globally visible.
>>
>> Just Visible, the rest is not under SW control.

> There are two kinds of barriers/fences (I don't know if there are official
> terms for them), which are local bypass barriers, and global completion
> barriers.

Processor order and Global order. One cannot reason even about single threaded
sequential instruction execution (vonNeumann) without processor order, similarly,
one cannot reason about Global (externally visible to 3rd party) order without
"the memory model".

> Bypass barriers restrict which younger ops in the local load-store queue
> may bypass and start execution before older ops have made a value locally
> visible.

Processor order.

> Completion barriers block younger ops from starting execution before
> older ops have completed and read or written globally visible values.

Also processor order.

> You appear to be referring to bypass barriers whereas I'm referring to
> completion barriers which require globally visible results.

I attempt to describe the reasoning capabilities as seen by both the CPU
and by the interested 3rd party.

Re: Memory dependency microbenchmark

<ujogkj$2056e$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35166&group=comp.arch#35166

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 16:36:17 -0500
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <ujogkj$2056e$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<nFM6N.14353$Ubzd.11432@fx36.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Nov 2023 21:36:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="51eaafb1da784ded83345eeb05351d9d";
logging-data="2102478"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+U0BWDejq/JTOdyF0IB/qSIywr2KgaLOQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:xSIdfbU3OfWXAMTBuV5/8krxtcU=
X-Mozilla-News-Host: news://news.eternal-september.org
In-Reply-To: <nFM6N.14353$Ubzd.11432@fx36.iad>

by: Paul A. Clayton - Thu, 23 Nov 2023 21:36 UTC

On 11/20/23 12:26 PM, Scott Lurndal wrote:
> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>> On 11/13/23 1:22â€¯PM, EricP wrote:
>>> Kent Dickey wrote:
>> [snip]
>>>> Thus, the people trapped in Relaxed Ordering Hell then push
>>>> weird schemes
>>>> on everyone else to try to come up with algorithms which need fewer
>>>> barriers.Â It's crazy.
>>>>
>>>> Relaxed Ordering is a mistake.
>>>>
>>>> Kent
>>>
>>> I suggest something different: the ability to switch between TSO and
>>> relaxed with non-privileged user mode instructions.
>
>> Even with a multithreaded program, stack and TLS would be "thread
>> private" and not require the same consistency guarantees.
>
> Why do you think that 'stack' would be thread private? It's
> quite common to allocate long-lived data structures on the
> stack and pass the address of the object to code that may
> be executing in the context of other threads. So long as the lifetime of
> the object extends beyond the last reference, of course.
>
> Objects allocated on the stack in 'main', for instance.

I thought (in my ignorance) that because the lifetime of
stack-allocated data is controlled by the thread (call depth) that
allocated that data it would not be thread safe. Clearly if one
could guarantee that the allocating thread did not reduce its
call depth to below that frame, it would be safe.

(An allocation made in an initialization stage could generally be
easy to guarantee not to be deallocated early — only being
deallocated on program end by the OS — but at that point the early
stack is in some sense not the same stack as for later uses.)

I knew that pointers where allocated for arrays and objects on the
stack and passed to a called function. That feels a bit icky to me
in that I conceive of a stack frame as being just for that
function, i.e., my mental model does not reflect practice. (Even
allocating an array for internal use in a stack frame is
inconsistent with my mental model which exclusively references off
the stack pointer with immediate offsets.)

I do not know how difficult it would be to establish a compilation
system that did not use "the" stack for such allocations.
Allocations to the stack have the advantage of simple management
(just adjust the stack pointer) with the constraint of simple
timing of allocation and free and the advantage of never missing
cache for the pointer. Providing a modest prefetched-on-context-
switch cache (like a Knapsack Cache proposed by Todd Austin for
reduced latency) would allow multiple stacks/regions to have such
benefits by placing the next allocation pointers there.

Effectively extending the register set in that way could have
other advantages. (Context switch overhead would increase.)
Providing an "L2 register" storage seems to have some attraction.

Re: Memory dependency microbenchmark

<ujp2dq$22atn$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35171&group=comp.arch#35171

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaronclayton@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 17:57:40 -0500
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <ujp2dq$22atn$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 02:39:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="750a9ea91031198e4d1c78c1618cb47a";
logging-data="2173879"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/k2RINeTNsEM1FNjLmEObyufrNemyaUq8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:xaP+ZF9ivFg/4uUykGiLCZ7Fsxw=
In-Reply-To: <5yN6N.8007$DADd.5269@fx38.iad>

by: Paul A. Clayton - Thu, 23 Nov 2023 22:57 UTC

On 11/20/23 1:26 PM, EricP wrote:
> Paul A. Clayton wrote:
>> On 11/13/23 1:22 PM, EricP wrote:
[snip]
>>> Where this approach could fail is the kind of laissez-faire
>>> sharing done by many apps, libraries, and OS's behind the
>>> scenes in the real world.
>>
>> Another possibility is for non-shared memory to be handled
>> differently. (This is similar to My 66000's handling of memory
>> types and things mentioned by Mitch Alsup here.)
>>
>> Even with a multithreaded program, stack and TLS would be "thread
>> private" and not require the same consistency guarantees.
>>
>> Various memory partitioning schemes theoretically can provide
>> similar benefits for systems with shared memory controllers where
>> programs do not share most modifiable data with other programs.
>> Even something like a web hosting system might be able to benefit
>> from lack of coherence (much less consistency) between different
>> web hosts.
>
> But this is my point: in many programs there is no memory that
> you can point to and say it is always private to a single thread.
> And this is independent of language, its to do with program
> structure.
>
> You can say a certain memory range is shared and guarded by locks,
> or shared and managed by lock-free code.
> And we can point to this because the code in modularized this way.
>
> But the opposite of 'definitely shared' is not 'definitely private',
> it's 'dont know' or 'sometimes'.

I see your point. I still think that a discipline could be
enforced (above the hardware level) to avoid "laissez-faire
sharing". However, not working until "software behaves properly"
is not a very useful design choice, except possibly in early
research efforts.

Even without such, it would still be possible for hardware to have
something like a coarse-grained snoop filter for the special cases
of localized use. (I think something like this was being proposed
here earlier.) Localization could include single thread/core and
larger groups. Larger groups would be more simply provided by
network topology locality, but one might want to spread threads to
maximize cache availability so supporting conceptual/logical
grouping and not just physical groupings might be desired.

(Side comment: at least one IBM POWER implementation had a
coherence state that indicated the cache block was only present in
a physically local set of caches. I think this implementation used
snooping, so this could significantly conserve interconnect
bandwidth.)

There might also be optimization opportunity for single-writer,
multiple reader memory. Such optimizations might have very limited
utility since such applications are not that common and most such
applications might have other memory locations that have have
multiple writers. ("Single-writer" could also be applied to a
broader locality that still reduces synchronization delay.)

Re: Memory dependency microbenchmark

<ujpjkc$28487$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35177&group=comp.arch#35177

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.nntp4.net!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:33:31 -0800
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <ujpjkc$28487$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
<jcN7N.64676$cAm7.42877@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Nov 2023 07:33:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19AQ55kwUs6fSAmiAtPP+BoFpXlXQt4m/c="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:hfFJaoVi7M8NrIsUgZksNTvlvFI=
In-Reply-To: <jcN7N.64676$cAm7.42877@fx18.iad>
Content-Language: en-US

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:33 UTC

On 11/23/2023 10:52 AM, EricP wrote:
> MitchAlsup wrote:
>> aph@littlepinkcloud.invalid wrote:
>>
>>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>>> aph@littlepinkcloud.invalid wrote:
>>>>> Barriers only control the ordering between accesses, not when they
>>>>> become visible, and here there's only one access. If there are at
>>>>> least two, and you really need to see one before the other, then
>>>>> you need a barrier.
>>>>
>>>> The barriers also ensure the various local buffers, pipelines and
>>>> inbound and outbound comms command and reply message queues are
>>>> drained.
>>
>>> I'm surprised you say that. All they have to do is make it appear as
>>> if this has been done. How it actually happens is up to the hardware
>>> designer. In modern GBOOO CPUs, most barriers don't require everything
>>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>>> the set of pending accesses as a sort of memory transaction, detect
>>> conflicts, and roll back to the last point of coherence and replay
>>> in-order. But I am not a hardware designer.)
>>
>> Yes, exactly, an MemBar sets a boundary where all younger memory
>> references
>> of one type (or both) must wait to become visible until all older memory
>> references of the other Types have become visible. Only 1 or 2 bits of
>> state
>> per queued memory ref is altered by an explicit MemBar or an implicit
>> barrier
>> (from change in operating mode).
>>
>>>> Weak order requires a membar after a store to force the it into the
>>>> cache,
>>>> triggering the coherence handshake which invalidates other copies,
>>>> so that when remote cores reread a line they see the updated value.
>>
>> What if it is not-cacheable ?? or MMI/O ?? or configuration space ??
>>>>
>>>> In other words, to retire the membar instruction the core must force
>>>> the
>>>> prior store values into the coherent cache making them globally
>>>> visible.
>>
>> Just Visible, the rest is not under SW control.
>
> There are two kinds of barriers/fences (I don't know if there are official
> terms for them), which are local bypass barriers, and global completion
> barriers.

Wrt SPARC, #StoreLoad vs #LoadStore, well there a difference in
performance. So, how would a #StoreLoad behave in Mitch's nice system?
Also, how would a #LoadStore act? All the same? I need to learn more
about his arch. Interesting.

>
> Bypass barriers restrict which younger ops in the local load-store queue
> may bypass and start execution before older ops have made a value locally
> visible.
>
> Completion barriers block younger ops from starting execution before
> older ops have completed and read or written globally visible values.
>
> You appear to be referring to bypass barriers whereas I'm referring to
> completion barriers which require globally visible results.
>

Re: Memory dependency microbenchmark

<ujpjoj$28487$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35178&group=comp.arch#35178

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:35:46 -0800
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <ujpjoj$28487$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
<jcN7N.64676$cAm7.42877@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Nov 2023 07:35:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19vJqE1dr2v4RSEWv7mqZZJA8S3bKB+icg="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:vqaNl/DV3Dc1vQCQE2Ki1q5chZ8=
Content-Language: en-US
In-Reply-To: <jcN7N.64676$cAm7.42877@fx18.iad>

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:35 UTC

Basically, how to map the various membar ops into an arch that can be
RMO. Assume the programmers have no problem with it... ;^o SPARC did it,
but, is it worth it now? Is my knowledge of dealing with relaxed
systems, threads/processes and membars obsoleted? shit man... ;^o

Re: Memory dependency microbenchmark

<ujpjt8$28487$3@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35179&group=comp.arch#35179

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:38:15 -0800
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <ujpjt8$28487$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
<ujbbdj$3fmi7$1@dont-email.me>
<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
<ujdonm$3u8mv$2@dont-email.me>
<d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>
<ujmpcd$1nuki$1@dont-email.me>
<bd114daeb51a98aea245c1a842342caf@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 07:38:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fAWMxV/58loRicRLpN2bnTEYKnKXepY4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:kWWVBmqnuvRSW1iviQugNCs1QTw=
Content-Language: en-US
In-Reply-To: <bd114daeb51a98aea245c1a842342caf@news.novabbs.com>

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:38 UTC

On 11/23/2023 8:33 AM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/19/2023 2:23 PM, MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>> So, while it does not eliminate live/dead-lock situations, it allows SW
>>> to be constructed to avoid live/dead lock situations:: Why is a value
>>> which is provided when an ATOMIC event fails. 0 means success, negative
>>> values are spurious (buffer overflows,...) while positives represent
>>> the number of competing threads, so the following case, skips elements
>>> on a linked list to decrease future initerference.
>>>
>>> Element* getElement( unSigned Key )
>>> {
>>>      int count = 0;
>>>      for( p = structure.head; p ; p = p->next )
>>>      {
>>>           if( p->Key == Key )
>>>           {
>>>                if( count-- < 0 )
>>>                {
>>>                     esmLOCK( p );
>>>                     prev = p->prev;
>>>                     esmLOCK( prev );
>>>                     next = p->next;
>>>                     esmLOCK( next );
>>>                     if( !esmINTERFERENCE() )
>>>                     {
>>>                          p->prev = next;
>>>                          p->next = prev;
>>>                          p->prev = NULL;
>>>                 esmLOCK( p->next = NULL );
>>>                          return p;
>>>                     }
>>>                     else
>>>                     {
>>>                          count = esmWHY();
>>>                          p = structure.head;
>>>                     }
>>>                }
>>>           }
>>>      }
>>>      return NULL;
>>> }
>>>
>>> Doing ATOMIC things like this means one can take the BigO( n^3 )
>>> activity
>>> that happens when a timer goes off and n threads all want access to the
>>> work queue, down to BigO( 3 ) yes=constant, but in practice it is
>>> reduced
>>> to BigO( ln( n ) ) when requesters arrive in random order at random
>>> time.
>>>
>>>> I remember hearing from my friend Joe Seigh, who worked at IBM, that
>>>> they had some sort of logic that would prevent live lock in a
>>>> compare and swap wrt their free pool manipulation logic. Iirc, it
>>>> was somewhat related to ABA, hard to remember right now, sorry. I
>>>> need to find that old thread back in comp.programming.threads.
>>>
>>> Depending on system size: there can be several system function units
>>> that grant "order" for ATOMIC events. These are useful for 64+node
>>> systems
>>> and unnecessary for less than 8-node systems. Disjoint memory spaces
>>> can use independent ATOMIC arbiters and whether they are in use or
>>> not is
>>> invisible to SW.
>>>
>>>>>
>>>>>> ?
>>>>>
>>>>>> Check this out the old thread:
>>>>>
>>>>>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
>
>> Humm, you arch seems pretty neat/interesting to me. I need to learn
>> more about it. Can it be abused with a rouge thread that keeps
>> altering a cacheline(s) that are participating in the atomic block, so
>> to speak? Anyway, I am busy with family time. Will get back to you.
>
> While possible, it is a lot less likely than on a similar architecture
> without any of the bells and whistles.

Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live lock
by locking a bus or asserting a signal, that would ensure that a
compare-and-swap would never get into death spiral of always failing.
Iirc, Microsoft has something like this in its lock-free stack SList or
something, Cannot remember exactly right now. Sorry.

Joe Seigh mentioned internal docs, candy striped.

>
>> Fwiw, here is some of my work:
>
>> https://youtu.be/HwIkk9zENcg
>
> Octopi in a box playing ball.

Re: Memory dependency microbenchmark

<ujpk3p$28487$4@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35180&group=comp.arch#35180

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.nntp4.net!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:41:44 -0800
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <ujpk3p$28487$4@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
<ujbbdj$3fmi7$1@dont-email.me>
<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
<ujdonm$3u8mv$2@dont-email.me>
<d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>
<ujmpcd$1nuki$1@dont-email.me>
<bd114daeb51a98aea245c1a842342caf@news.novabbs.com>
<ujpjt8$28487$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 07:41:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GIWaBztguIfqk0PXiIoXMYIaHUXmVhq8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:k/V9VaGDYlAdNk5A/33UwRSKsI0=
In-Reply-To: <ujpjt8$28487$3@dont-email.me>
Content-Language: en-US

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:41 UTC

On 11/23/2023 11:38 PM, Chris M. Thomasson wrote:
> On 11/23/2023 8:33 AM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> On 11/19/2023 2:23 PM, MitchAlsup wrote:
>>>> Chris M. Thomasson wrote:
>>>>
>>>> So, while it does not eliminate live/dead-lock situations, it allows SW
>>>> to be constructed to avoid live/dead lock situations:: Why is a value
>>>> which is provided when an ATOMIC event fails. 0 means success, negative
>>>> values are spurious (buffer overflows,...) while positives represent
>>>> the number of competing threads, so the following case, skips elements
>>>> on a linked list to decrease future initerference.
>>>>
>>>> Element* getElement( unSigned Key )
>>>> {
>>>>      int count = 0;
>>>>      for( p = structure.head; p ; p = p->next )
>>>>      {
>>>>           if( p->Key == Key )
>>>>           {
>>>>                if( count-- < 0 )
>>>>                {
>>>>                     esmLOCK( p );
>>>>                     prev = p->prev;
>>>>                     esmLOCK( prev );
>>>>                     next = p->next;
>>>>                     esmLOCK( next );
>>>>                     if( !esmINTERFERENCE() )
>>>>                     {
>>>>                          p->prev = next;
>>>>                          p->next = prev;
>>>>                          p->prev = NULL;
>>>>                 esmLOCK( p->next = NULL );
>>>>                          return p;
>>>>                     }
>>>>                     else
>>>>                     {
>>>>                          count = esmWHY();
>>>>                          p = structure.head;
>>>>                     }
>>>>                }
>>>>           }
>>>>      }
>>>>      return NULL;
>>>> }
>>>>
>>>> Doing ATOMIC things like this means one can take the BigO( n^3 )
>>>> activity
>>>> that happens when a timer goes off and n threads all want access to the
>>>> work queue, down to BigO( 3 ) yes=constant, but in practice it is
>>>> reduced
>>>> to BigO( ln( n ) ) when requesters arrive in random order at random
>>>> time.
>>>>
>>>>> I remember hearing from my friend Joe Seigh, who worked at IBM,
>>>>> that they had some sort of logic that would prevent live lock in a
>>>>> compare and swap wrt their free pool manipulation logic. Iirc, it
>>>>> was somewhat related to ABA, hard to remember right now, sorry. I
>>>>> need to find that old thread back in comp.programming.threads.
>>>>
>>>> Depending on system size: there can be several system function units
>>>> that grant "order" for ATOMIC events. These are useful for 64+node
>>>> systems
>>>> and unnecessary for less than 8-node systems. Disjoint memory spaces
>>>> can use independent ATOMIC arbiters and whether they are in use or
>>>> not is
>>>> invisible to SW.
>>>>
>>>>>>
>>>>>>> ?
>>>>>>
>>>>>>> Check this out the old thread:
>>>>>>
>>>>>>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
>>
>>> Humm, you arch seems pretty neat/interesting to me. I need to learn
>>> more about it. Can it be abused with a rouge thread that keeps
>>> altering a cacheline(s) that are participating in the atomic block,
>>> so to speak? Anyway, I am busy with family time. Will get back to you.
>>
>> While possible, it is a lot less likely than on a similar architecture
>> without any of the bells and whistles.
>
> Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live lock
> by locking a bus or asserting a signal, that would ensure that a
> compare-and-swap would never get into death spiral of always failing.
> Iirc, Microsoft has something like this in its lock-free stack SList or
> something, Cannot remember exactly right now. Sorry.

Wrt Microsoft's SList, I think it goes into the kernel to handle memory
reclamation issues, aba, and such...

>
> Joe Seigh mentioned internal docs, candy striped.
>
>>
>>> Fwiw, here is some of my work:
>>
>>> https://youtu.be/HwIkk9zENcg
>>
>> Octopi in a box playing ball.
>

Re: Memory dependency microbenchmark

<ujpk8s$28487$5@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35181&group=comp.arch#35181

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:44:27 -0800
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <ujpk8s$28487$5@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<nFM6N.14353$Ubzd.11432@fx36.iad> <ujogkj$2056e$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 07:44:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HDwXDpUsy2idf+HpKbSia4Hre/45n06E="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sAMiBAVw+Ap5mO0RX2zPbo6wWjM=
Content-Language: en-US
In-Reply-To: <ujogkj$2056e$1@dont-email.me>

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:44 UTC

On 11/23/2023 1:36 PM, Paul A. Clayton wrote:
> On 11/20/23 12:26 PM, Scott Lurndal wrote:
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>> On 11/13/23 1:22â€¯PM, EricP wrote:
>>>> Kent Dickey wrote:
>>> [snip]
>>>>> Thus, the people trapped in Relaxed Ordering Hell then push
>>>>> weird schemes
>>>>> on everyone else to try to come up with algorithms which need fewer
>>>>> barriers.Â It's crazy.
>>>>>
>>>>> Relaxed Ordering is a mistake.
>>>>>
>>>>> Kent
>>>>
>>>> I suggest something different: the ability to switch between TSO and
>>>> relaxed with non-privileged user mode instructions.
>>
>>> Even with a multithreaded program, stack and TLS would be "thread
>>> private" and not require the same consistency guarantees.
>>
>> Why do you think that 'stack' would be thread private? It's
>> quite common to allocate long-lived data structures on the
>> stack and pass the address of the object to code that may
>> be executing in the context of other threads. So long as the
>> lifetime of
>> the object extends beyond the last reference, of course.
>>
>> Objects allocated on the stack in 'main', for instance.
>
> I thought (in my ignorance) that because the lifetime of stack-allocated
> data is controlled by the thread (call depth) that allocated that data
> it would not be thread safe. Clearly if one could guarantee that the
> allocating thread did not reduce its
> call depth to below that frame, it would be safe.
>
> (An allocation made in an initialization stage could generally be
> easy to guarantee not to be deallocated early — only being
> deallocated on program end by the OS — but at that point the early
> stack is in some sense not the same stack as for later uses.)
>
> I knew that pointers where allocated for arrays and objects on the
> stack and passed to a called function. That feels a bit icky to me
> in that I conceive of a stack frame as being just for that
> function, i.e., my mental model does not reflect practice. (Even
> allocating an array for internal use in a stack frame is
> inconsistent with my mental model which exclusively references off
> the stack pointer with immediate offsets.)
>
> I do not know how difficult it would be to establish a compilation
> system that did not use "the" stack for such allocations.
> Allocations to the stack have the advantage of simple management
> (just adjust the stack pointer) with the constraint of simple
> timing of allocation and free and the advantage of never missing
> cache for the pointer. Providing a modest prefetched-on-context-
> switch cache (like a Knapsack Cache proposed by Todd Austin for
> reduced latency) would allow multiple stacks/regions to have such
> benefits by placing the next allocation pointers there.
>
> Effectively extending the register set in that way could have
> other advantages. (Context switch overhead would increase.)
> Providing an "L2 register" storage seems to have some attraction.
>
>

I did a memory allocator that was completely based on thread stacks.
Iirc, it was on Quadros. A long time ago!

Re: Memory dependency microbenchmark

<ujpkv4$28487$6@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35182&group=comp.arch#35182

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Thu, 23 Nov 2023 23:56:19 -0800
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <ujpkv4$28487$6@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad> <ujp2dq$22atn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 07:56:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7664f6b07543b36eac45e26755f90876";
logging-data="2363655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19OCh2Luy0vgxpTUUkMZmA8bM0PlQ9Gdhk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:L616ol/cVKvmc/DLqwOo9bel1k4=
Content-Language: en-US
In-Reply-To: <ujp2dq$22atn$1@dont-email.me>

by: Chris M. Thomasson - Fri, 24 Nov 2023 07:56 UTC

On 11/23/2023 2:57 PM, Paul A. Clayton wrote:
> On 11/20/23 1:26 PM, EricP wrote:
>> Paul A. Clayton wrote:
>>> On 11/13/23 1:22 PM, EricP wrote:
> [snip]
>>>> Where this approach could fail is the kind of laissez-faire sharing
>>>> done by many apps, libraries, and OS's behind the scenes in the real
>>>> world.
>>>
>>> Another possibility is for non-shared memory to be handled
>>> differently. (This is similar to My 66000's handling of memory
>>> types and things mentioned by Mitch Alsup here.)
>>>
>>> Even with a multithreaded program, stack and TLS would be "thread
>>> private" and not require the same consistency guarantees.
>>>
>>> Various memory partitioning schemes theoretically can provide
>>> similar benefits for systems with shared memory controllers where
>>> programs do not share most modifiable data with other programs.
>>> Even something like a web hosting system might be able to benefit
>>> from lack of coherence (much less consistency) between different
>>> web hosts.
>>
>> But this is my point: in many programs there is no memory that
>> you can point to and say it is always private to a single thread.
>> And this is independent of language, its to do with program structure.
>>
>> You can say a certain memory range is shared and guarded by locks,
>> or shared and managed by lock-free code.
>> And we can point to this because the code in modularized this way.
>>
>> But the opposite of 'definitely shared' is not 'definitely private',
>> it's 'dont know' or 'sometimes'.
>
> I see your point. I still think that a discipline could be
> enforced (above the hardware level) to avoid "laissez-faire
> sharing". However, not working until "software behaves properly"
> is not a very useful design choice, except possibly in early
> research efforts.
>
> Even without such, it would still be possible for hardware to have
> something like a coarse-grained snoop filter for the special cases
> of localized use. (I think something like this was being proposed
> here earlier.) Localization could include single thread/core and
> larger groups. Larger groups would be more simply provided by
> network topology locality, but one might want to spread threads to
> maximize cache availability so supporting conceptual/logical
> grouping and not just physical groupings might be desired.
>
> (Side comment: at least one IBM POWER implementation had a
> coherence state that indicated the cache block was only present in
> a physically local set of caches. I think this implementation used
> snooping, so this could significantly conserve interconnect
> bandwidth.)
>
> There might also be optimization opportunity for single-writer,
> multiple reader memory.

Side note... Actually, there is a reason to create specialized queues
for communications between threads. Basically, it goes like:

single-producer/single-consumer
multi-producer/single-consumer
single-producer/multi-consumer
multi-producer/multi-consumer

Each one has a specific algorithm. Specialized for its requirements,
single-producer/single-consumer is very fast, and iirc does not even
need #StoreLoad or #LoadStore, #LoadLoad and #StoreStore can work even
in RMO mode of a SPARC.

Such optimizations might have very limited
> utility since such applications are not that common and most such
> applications might have other memory locations that have have
> multiple writers. ("Single-writer" could also be applied to a
> broader locality that still reduces synchronization delay.)

On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
> On 11/15/2023 1:09 PM, Kent Dickey wrote:
>> In article <uiu4t5$t4c2$2@dont-email.me>,
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/12/2023 9:54 PM, Kent Dickey wrote:
>>>> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
>>>> MitchAlsup <mitchalsup@aol.com> wrote:
>>>>> Kent Dickey wrote:
>>>>>
>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>
>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>> workloads.
>>>>>
>>>>>> I know a lot of people believe that statement to be true. In
>>>>>> general, it
>>>>>> is assumed to be true without proof.
>>>>> <
>>>>> In its most general case, relaxed order only provides a performance
>>>>> advantage
>>>>> when the code is single threaded.
>>>>
>>>> I believe a Relaxed Memory model provides a small performance
>>>> improvement
>>>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
>>>> there's nothing to order).
>>>>
>>>> Relazed Memory ordering provides approximately zero performance
>>>> improvement
>>>> to an OoO CPU, and in fact, might actually lower performance
>>>> (depends on
>>>> how barriers are done--if done poorly, it could be a big negative).
>>>>
>>>> Yes, the system designers of the world have said: let's slow down our
>>>> fastest most expensive most profitable CPUs, so we can speed up our
>>>> cheapest
>>>> lowest profit CPUs a few percent, and push a ton of work onto software
>>>> developers.
>>>>
>>>> It's crazy.
>>>>
>>>>>> I believe that statement to be false. Can you describe some of these
>>>>>> workloads?
>>>>> <
>>>>> Relaxed memory order fails spectacularly when multiple threads are
>>>>> accessing
>>>>> data.
>>>>
>>>> Probably need to clarify with "accessing modified data".
>>>>
>>>> Kent
>>>
>>> Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
>>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o
>>
>> You have internalized weakly ordered memory, and you're having trouble
>> seeing beyond it.
>
> Really? Don't project yourself on me. Altering all of the memory
> barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.
>
>
>>
>> CPUs with weakly ordered memory are the ones that need all those flags.
>> Yes, you need the flags if you want to use those CPUs. I'm pointing out:
>> we could all just require better memory ordering and get rid of all this
>> cruft. Give the flag, don't give the flag, the program is still correct
>> and works properly.
>
> Huh? Just cruft? wow. Just because it seems hard for you does not mean
> we should eliminate it. Believe it or not there are people out there
> that know how to use memory barriers. I suppose you would use seq_cst to
> load each node of a lock-free stack iteration in a RCU read-side region.
> This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
> like you do not have all that much experience with them. Humm...
>
>
>>
>> It's like FP denorms--it's generally been decided the hardware cost
>> to implement it is small, so hardware needs to support it at full speed.
>> No need to write code in a careful way to avoid denorms, to use funky
>> CPU-
>> specific calls to turn on flush-to-0, etc., it just works, we move on to
>> other topics. But we still have flush-to-0 calls available--but you
>> don't
>> need to bother to use them. In my opinion, memory ordering is much more
>> complex for programmers to handle. I maintain it's actually so
>> complex most people cannot get it right in software for non-trivial
>> interactions. I've found many hardware designers have a very hard time
>> reasoning about this as well when I report bugs (since the rules are so
>> complex and poorly described). There are over 100 pages describing
>> memory
>> ordering in the Arm Architectureal Reference Manual, and it is very
>> complex (Dependency through registers and memory; Basic Dependency;
>> Address Dependency; Data Dependency; Control Dependency; Pick Basic
>> dependency; Pick Address Dependency; Pick Data Dependency; Pick
>> Control Dependency, Pick Dependency...and this is just from the
>> definition
>> of terms). It's all very abstract and difficult to follow. I'll be
>> honest: I do not understand all of these rules, and I don't care to.
>> I know how to implement a CPU, so I know what they've done, and that's
>> much simpler to understand. But writing a threaded application is much
>> more complex than it should be for software.
>>
>> The cost to do TSO is some out-of-order tracking structures need to get
>> a little bigger, and some instructions have to stay in queues longer
>> (which is why they may need to get bigger), and allow re-issuing loads
>> which now have stale data. The difference between TSO and Sequential
>> Consistency is to just disallow loads seeing stores queued before they
>> write to the data cache (well, you can speculatively let loads happen,
>> but you need to be able to walk it back, which is not difficult). This
>> is why I say the performance cost is low--normal code missing caches and
>> not being pestered by other CPUs can run at the same speed. But when
>> other CPUs begin pestering us, the interference can all be worked out as
>> efficiently as possible using hardware, and barriers just do not
>> compete.
>
> Having access to fine grain memory barriers is a very good thing. Of
> course we can use C++ right now and make everything seq_cst, but that is
> moronic. Why would you want to use seq_cst everywhere when you do not
> have to? There are rather massive performance implications.
>
> Are you thinking about a magic arch that we cannot use right now?

The problem with using seq_cst all over the place on an x86 is that it
would need to use LOCK'ed RMW, even dummy ones, or MFENCE to get the
ordering right.

On 11/24/2023 12:36 AM, Chris M. Thomasson wrote:
> On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
>> On 11/15/2023 1:09 PM, Kent Dickey wrote:
>>> In article <uiu4t5$t4c2$2@dont-email.me>,
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>> On 11/12/2023 9:54 PM, Kent Dickey wrote:
>>>>> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
>>>>> MitchAlsup <mitchalsup@aol.com> wrote:
>>>>>> Kent Dickey wrote:
>>>>>>
>>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>
>>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>>> workloads.
>>>>>>
>>>>>>> I know a lot of people believe that statement to be true. In
>>>>>>> general, it
>>>>>>> is assumed to be true without proof.
>>>>>> <
>>>>>> In its most general case, relaxed order only provides a
>>>>>> performance advantage
>>>>>> when the code is single threaded.
>>>>>
>>>>> I believe a Relaxed Memory model provides a small performance
>>>>> improvement
>>>>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
>>>>> there's nothing to order).
>>>>>
>>>>> Relazed Memory ordering provides approximately zero performance
>>>>> improvement
>>>>> to an OoO CPU, and in fact, might actually lower performance
>>>>> (depends on
>>>>> how barriers are done--if done poorly, it could be a big negative).
>>>>>
>>>>> Yes, the system designers of the world have said: let's slow down our
>>>>> fastest most expensive most profitable CPUs, so we can speed up our
>>>>> cheapest
>>>>> lowest profit CPUs a few percent, and push a ton of work onto software
>>>>> developers.
>>>>>
>>>>> It's crazy.
>>>>>
>>>>>>> I believe that statement to be false. Can you describe some of
>>>>>>> these
>>>>>>> workloads?
>>>>>> <
>>>>>> Relaxed memory order fails spectacularly when multiple threads are
>>>>>> accessing
>>>>>> data.
>>>>>
>>>>> Probably need to clarify with "accessing modified data".
>>>>>
>>>>> Kent
>>>>
>>>> Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to
>>>> even
>>>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o
>>>
>>> You have internalized weakly ordered memory, and you're having trouble
>>> seeing beyond it.
>>
>> Really? Don't project yourself on me. Altering all of the memory
>> barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.
>>
>>
>>>
>>> CPUs with weakly ordered memory are the ones that need all those flags.
>>> Yes, you need the flags if you want to use those CPUs. I'm pointing
>>> out:
>>> we could all just require better memory ordering and get rid of all this
>>> cruft. Give the flag, don't give the flag, the program is still correct
>>> and works properly.
>>
>> Huh? Just cruft? wow. Just because it seems hard for you does not mean
>> we should eliminate it. Believe it or not there are people out there
>> that know how to use memory barriers. I suppose you would use seq_cst
>> to load each node of a lock-free stack iteration in a RCU read-side
>> region. This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a,
>> sort a, seems like you do not have all that much experience with them.
>> Humm...
>>
>>
>>>
>>> It's like FP denorms--it's generally been decided the hardware cost
>>> to implement it is small, so hardware needs to support it at full speed.
>>> No need to write code in a careful way to avoid denorms, to use funky
>>> CPU-
>>> specific calls to turn on flush-to-0, etc., it just works, we move on to
>>> other topics. But we still have flush-to-0 calls available--but you
>>> don't
>>> need to bother to use them. In my opinion, memory ordering is much more
>>> complex for programmers to handle. I maintain it's actually so
>>> complex most people cannot get it right in software for non-trivial
>>> interactions. I've found many hardware designers have a very hard time
>>> reasoning about this as well when I report bugs (since the rules are so
>>> complex and poorly described). There are over 100 pages describing
>>> memory
>>> ordering in the Arm Architectureal Reference Manual, and it is very
>>> complex (Dependency through registers and memory; Basic Dependency;
>>> Address Dependency; Data Dependency; Control Dependency; Pick Basic
>>> dependency; Pick Address Dependency; Pick Data Dependency; Pick
>>> Control Dependency, Pick Dependency...and this is just from the
>>> definition
>>> of terms). It's all very abstract and difficult to follow. I'll be
>>> honest: I do not understand all of these rules, and I don't care to.
>>> I know how to implement a CPU, so I know what they've done, and that's
>>> much simpler to understand. But writing a threaded application is much
>>> more complex than it should be for software.
>>>
>>> The cost to do TSO is some out-of-order tracking structures need to get
>>> a little bigger, and some instructions have to stay in queues longer
>>> (which is why they may need to get bigger), and allow re-issuing loads
>>> which now have stale data. The difference between TSO and Sequential
>>> Consistency is to just disallow loads seeing stores queued before they
>>> write to the data cache (well, you can speculatively let loads happen,
>>> but you need to be able to walk it back, which is not difficult). This
>>> is why I say the performance cost is low--normal code missing caches and
>>> not being pestered by other CPUs can run at the same speed. But when
>>> other CPUs begin pestering us, the interference can all be worked out as
>>> efficiently as possible using hardware, and barriers just do not
>>> compete.
>>
>> Having access to fine grain memory barriers is a very good thing. Of
>> course we can use C++ right now and make everything seq_cst, but that
>> is moronic. Why would you want to use seq_cst everywhere when you do
>> not have to? There are rather massive performance implications.
>>
>> Are you thinking about a magic arch that we cannot use right now?
>
> The problem with using seq_cst all over the place on an x86 is that it
> would need to use LOCK'ed RMW, even dummy ones, or MFENCE to get the
> ordering right.

Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
an interesting way where there is a word(s) being updated. I say word's
is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
system. Basically a DWCAS, or double-word compare-and-swap where it
works with two contiguous words. This is different that a DCAS that can
work with two non-contiguous words.

Re: Memory dependency microbenchmark

<GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35185&group=comp.arch#35185

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.27.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 24 Nov 2023 10:12:01 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com> <Fx57N.20583$BSkc.9831@fx06.iad> <-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com> <ptM7N.70206$_Oab.45835@fx15.iad>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>
Date: Fri, 24 Nov 2023 10:12:01 +0000
Lines: 52
X-Trace: sv3-2nSLR4yJL7UZq+TPT1NfVzO2IktlZErIaiQm/MdK9aMA4NZuu0imUCnxSYzwzTQnea9z5QVSxvwh/gZ!jIgNGUVE88ZX2CtuSYfjopOwZp8MSqTdri2UhtF8rrcwEhXOBDuYGpiyeMsXJPmO2MH3ahqvCMPh!jOQIqN73xvk=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph@littlepinkcloud.invalid - Fri, 24 Nov 2023 10:12 UTC

EricP <ThatWouldBeTelling@thevillage.com> wrote:
> aph@littlepinkcloud.invalid wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>> aph@littlepinkcloud.invalid wrote:
>>>> Barriers only control the ordering between accesses, not when they
>>>> become visible, and here there's only one access. If there are at
>>>> least two, and you really need to see one before the other, then
>>>> you need a barrier.
>>> The barriers also ensure the various local buffers, pipelines and
>>> inbound and outbound comms command and reply message queues are
>>> drained.
>>
>> I'm surprised you say that. All they have to do is make it appear as
>> if this has been done. How it actually happens is up to the hardware
>> designer. In modern GBOOO CPUs, most barriers don't require everything
>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>> the set of pending accesses as a sort of memory transaction, detect
>> conflicts, and roll back to the last point of coherence and replay
>> in-order. But I am not a hardware designer.)
>
> I say that because that is what the Intel manual says for MFENCE and
> SFENCE, "The processor ensures that every store prior to SFENCE is
> globally visible before any store after SFENCE becomes globally
> visible"

Well, hold on. The strongest barrier anyone needs for synchronization
up to sequential consistency on Intel, at least in user mode, is a
LOCK XCHG.

> and Arm64 for DSB "ensures that memory accesses that occur before
> the DSB instruction have completed before the completion of the DSB
> instruction". Perhaps you are thinking of Intel LFENCE and Arm64 DMB
> which are weaker and do not require the prior operations to complete
> but just that they are performed or observed. These appear to only
> look at the local LSQ.

Of course. DMB is what programmers actually use, at least in user
space. DSB is used rarely. You do need a DSB to flush the cache to the
point of coherence, which you might need for persistent memory, and
you need it to flush dcache->icache on some Arm designs. But you don't
need anything more than DMB for lock-free algorithms.

> WRT the draining for synchronization, I'm not sure what you think
> the difference is between making it appear to be done and actually
> doing so. Sure there may be some optimizations possible, but the net
> effect on the externally observable cache state must be the same.

As I wrote, you don't have to actually do the synchronization dance if
it's not needed. You can speculate that there won't be a conflict,
detect any memory-ordering violation, and roll back and replay.

Adrew.

Re: Memory dependency microbenchmark

<904a40b805c70a593356ecaebdf473c3@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35189&group=comp.arch#35189

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 24 Nov 2023 18:32:34 +0000
Organization: novaBBS
Message-ID: <904a40b805c70a593356ecaebdf473c3@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me> <5yN6N.8007$DADd.5269@fx38.iad> <NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com> <Fx57N.20583$BSkc.9831@fx06.iad> <-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com> <26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com> <jcN7N.64676$cAm7.42877@fx18.iad> <ujpjoj$28487$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1988076"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$bZyoOpdpDsfKaL03MnOBf.E83rl7gpQJBWWvsofDMe6aZls9jhQbK
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Fri, 24 Nov 2023 18:32 UTC

Chris M. Thomasson wrote:

> On 11/23/2023 10:52 AM, EricP wrote:
>> MitchAlsup wrote:
>>>
>> There are two kinds of barriers/fences (I don't know if there are official
>> terms for them), which are local bypass barriers, and global completion
>> barriers.
>>
>> Bypass barriers restrict which younger ops in the local load-store queue
>> may bypass and start execution before older ops have made a value locally
>> visible.
>>
>> Completion barriers block younger ops from starting execution before
>> older ops have completed and read or written globally visible values.
>>
>> You appear to be referring to bypass barriers whereas I'm referring to
>> completion barriers which require globally visible results.
>>

> Basically, how to map the various membar ops into an arch that can be
> RMO. Assume the programmers have no problem with it... ;^o SPARC did it,
> but, is it worth it now? Is my knowledge of dealing with relaxed
> systems, threads/processes and membars obsoleted? shit man... ;^o

If the page being mapped is properly identified in the PTE, then there is
no reason to need any MemBars.

Also Note:: MemBars are the WRONG abstraction--a MemBar is like a wall
whereas what programmers want is a bridge. As long as you are on the
bridge (inside an ATOMIC event) you want one memory model and when you
leave the bridge you are free to use a more performant model. MemBars
only demark the edges of the bridge, they don't cover the whole bridge.

Chris M. Thomasson wrote:

> Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
> an interesting way where there is a word(s) being updated. I say word's
> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
> system. Basically a DWCAS, or double-word compare-and-swap where it
> works with two contiguous words. This is different that a DCAS that can
> work with two non-contiguous words.

The later is generally known as DCADS Double compare and Double swap.
I did see some academic literature a decade ago in wanting TCADS
triple compare and double swap.

It is for stuff like this that I invented esm so SW writers can
program up any number of compares and and width of swapping. This
means ISA does not have to change in the future wrt synchronization
capabilities.

It ends up that one of the most important properties is also found
in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
event and SC denotes the end. LL/SC provide a bridge model to SW
while ATOMIC events ending with xCAxS only provide notion one is
in a ATOMIC event at the last instruction of the event.

LL/SC can perform interference detection based on address while;
xCAxS can only perform the interference based on the data at that address.....

Re: Memory dependency microbenchmark

<ujrdhu$2gjhg$5@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35199&group=comp.arch#35199

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Fri, 24 Nov 2023 16:02:05 -0800
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <ujrdhu$2gjhg$5@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<ptM7N.70206$_Oab.45835@fx15.iad>
<GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Nov 2023 00:02:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a478786200890072e344b77cbd100045";
logging-data="2641456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/7US9gvjZ7OkX4VpInIvgqW7B2OKx3W4E="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:DWqNNv2Z7m+6JddtXsw8CJviW5A=
In-Reply-To: <GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Sat, 25 Nov 2023 00:02 UTC

On 11/24/2023 2:12 AM, aph@littlepinkcloud.invalid wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> aph@littlepinkcloud.invalid wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>>> aph@littlepinkcloud.invalid wrote:
>>>>> Barriers only control the ordering between accesses, not when they
>>>>> become visible, and here there's only one access. If there are at
>>>>> least two, and you really need to see one before the other, then
>>>>> you need a barrier.
>>>> The barriers also ensure the various local buffers, pipelines and
>>>> inbound and outbound comms command and reply message queues are
>>>> drained.
>>>
>>> I'm surprised you say that. All they have to do is make it appear as
>>> if this has been done. How it actually happens is up to the hardware
>>> designer. In modern GBOOO CPUs, most barriers don't require everything
>>> to be pushed to cache. (As I understand it, GBOOO designs can treat
>>> the set of pending accesses as a sort of memory transaction, detect
>>> conflicts, and roll back to the last point of coherence and replay
>>> in-order. But I am not a hardware designer.)
>>
>> I say that because that is what the Intel manual says for MFENCE and
>> SFENCE, "The processor ensures that every store prior to SFENCE is
>> globally visible before any store after SFENCE becomes globally
>> visible"
>
> Well, hold on. The strongest barrier anyone needs for synchronization
> up to sequential consistency on Intel, at least in user mode, is a
> LOCK XCHG.

Side note, XCHG all by itself automatically implies a LOCK prefix... :^)

>
>> and Arm64 for DSB "ensures that memory accesses that occur before
>> the DSB instruction have completed before the completion of the DSB
>> instruction". Perhaps you are thinking of Intel LFENCE and Arm64 DMB
>> which are weaker and do not require the prior operations to complete
>> but just that they are performed or observed. These appear to only
>> look at the local LSQ.
>
> Of course. DMB is what programmers actually use, at least in user
> space. DSB is used rarely. You do need a DSB to flush the cache to the
> point of coherence, which you might need for persistent memory, and
> you need it to flush dcache->icache on some Arm designs. But you don't
> need anything more than DMB for lock-free algorithms.
>
>> WRT the draining for synchronization, I'm not sure what you think
>> the difference is between making it appear to be done and actually
>> doing so. Sure there may be some optimizations possible, but the net
>> effect on the externally observable cache state must be the same.
>
> As I wrote, you don't have to actually do the synchronization dance if
> it's not needed. You can speculate that there won't be a conflict,
> detect any memory-ordering violation, and roll back and replay.
>
> Adrew.

186,000 Miles per Second. It's not just a good idea. IT'S THE LAW.

devel / comp.arch / Re: Memory dependency microbenchmark

devel / comp.arch / Re: Memory dependency microbenchmark

Subject	Author
Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Branimir Maksimovic
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Terje Mathisen
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
weak consistency and the supercomputer attitude (was: Memory dependency microben	Anton Ertl
Re: weak consistency and the supercomputer attitude	Stefan Monnier
Re: weak consistency and the supercomputer attitude	MitchAlsup
Re: weak consistency and the supercomputer attitude	Paul A. Clayton
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Anton Ertl
Alder Lake results for the memory dependency microbenchmark	Anton Ertl