Welcome to Rocksolid Light

mail files register newsreader groups login

Message-ID:

Adapt. Enjoy. Survive.

Re: Memory dependency microbenchmark

Subject	Author
Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Branimir Maksimovic
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Terje Mathisen
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
weak consistency and the supercomputer attitude (was: Memory dependency microben	Anton Ertl
Re: weak consistency and the supercomputer attitude	Stefan Monnier
Re: weak consistency and the supercomputer attitude	MitchAlsup
Re: weak consistency and the supercomputer attitude	Paul A. Clayton
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Anton Ertl
Alder Lake results for the memory dependency microbenchmark	Anton Ertl

Pages:123 4 5 6 7 8

Re: Memory dependency microbenchmark

<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34951&group=comp.arch#34951

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 00:19:21 +0000
Organization: novaBBS
Message-ID: <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="724861"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$5LjNzyqzvHpoe9yBpl94TOA4WtbVSkuC8l2LQlKEn5z.RR9iYMKmW
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Mon, 13 Nov 2023 00:19 UTC

Kent Dickey wrote:

> In article <uiri0a$85mp$2@dont-email.me>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

>>A highly relaxed memory model can be beneficial for certain workloads.

> I know a lot of people believe that statement to be true. In general, it
> is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
<
> I believe that statement to be false. Can you describe some of these
> workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
<
> Kent

Re: Memory dependency microbenchmark

<uirqj3$9q9q$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34952&group=comp.arch#34952

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sun, 12 Nov 2023 16:28:20 -0800
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <uirqj3$9q9q$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 13 Nov 2023 00:28:20 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="55c5ba4d49df2a7646321f8f0cd32f12";
logging-data="321850"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195ygBEZzwj5rcdOjRYUevkACjZXLfyOnU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:9jLuQ4MyLzpUgDMpRiS6Y6F7OGE=
In-Reply-To: <36011a9597060e08d46db0eddfed0976@news.novabbs.com>
Content-Language: en-US

by: Chris M. Thomasson - Mon, 13 Nov 2023 00:28 UTC

On 11/12/2023 4:20 PM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>> In article <uiri0a$85mp$2@dont-email.me>,
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>
>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>
>>> I know a lot of people believe that statement to be true. In
>>> general, it
>>> is assumed to be true without proof.
>>>
>>> I believe that statement to be false. Can you describe some of these
>>> workloads?
>
>> Also, think about converting any sound lock-free algorithm's finely
>> tuned memory barriers to _all_ sequential consistency... That would
>> ruin performance right off the bat... Think about it.
> <
> Assuming you are willing to accept the wrong answer fast, rather than
> the right answer later. There are very few algorithms with this property.

That does not make any sense to me. Think of a basic mutex. It basically
requires an acquire membar for the lock and a release membar for the
unlock. On SPARC that would be:

acquire = MEMBAR #LoadStore | #LoadLoad

release = MEMBAR #LoadStore | #StoreStore

Okay, fine. However, if I made them sequentially consistent, it would
require a damn #StoreLoad barrier for both acquire and release. This is
not good and should be avoided when possible.

Also, RCU prides itself with not having to use any memory barriers for
its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
or MFENCE on Intel, it would completely ruin its performance.

Re: Memory dependency microbenchmark

<uisd4b$gd4s$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34961&group=comp.arch#34961

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 05:44:43 -0000 (UTC)
Organization: provalid.com
Lines: 95
Message-ID: <uisd4b$gd4s$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me>
Injection-Date: Mon, 13 Nov 2023 05:44:43 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4c04f5c884225b3ec58c7f2fb358c74d";
logging-data="537756"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Vz1medqLQFr4HtEU5fYKL"
Cancel-Lock: sha1:JGXXU2Jwlwhgi5KeGH2m59QAol0=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: kegs@provalid.com (Kent Dickey)

by: Kent Dickey - Mon, 13 Nov 2023 05:44 UTC

In article <uirqj3$9q9q$1@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>On 11/12/2023 4:20 PM, MitchAlsup wrote:
>> Chris M. Thomasson wrote:
>>
>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>
>>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>>
>>>> I know a lot of people believe that statement to be true. In
>>>> general, it
>>>> is assumed to be true without proof.
>>>>
>>>> I believe that statement to be false. Can you describe some of these
>>>> workloads?
>>
>>> Also, think about converting any sound lock-free algorithm's finely
>>> tuned memory barriers to _all_ sequential consistency... That would
>>> ruin performance right off the bat... Think about it.
>> <
>> Assuming you are willing to accept the wrong answer fast, rather than
>> the right answer later. There are very few algorithms with this property.
>
>That does not make any sense to me. Think of a basic mutex. It basically
>requires an acquire membar for the lock and a release membar for the
>unlock. On SPARC that would be:
>
>acquire = MEMBAR #LoadStore | #LoadLoad
>
>release = MEMBAR #LoadStore | #StoreStore
>
>Okay, fine. However, if I made them sequentially consistent, it would
>require a damn #StoreLoad barrier for both acquire and release. This is
>not good and should be avoided when possible.
>
>Also, RCU prides itself with not having to use any memory barriers for
>its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
>or MFENCE on Intel, it would completely ruin its performance.

You are using the terms in the exact opposite meaning as I would understand
computer architecture.

We'll just assume 3 choices for CPU ordering:

- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.

So, a highly relaxed memory model is the ONLY model which needs barriers.
If you want to get rid of barriers, use a better memory model.

A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're going
to require software to put in some very difficult to understand barriers.
And we're going to have a 45 page academic paper using all the greek
alphabet to describe when you need to put in barriers. Literally no one
understands all the rules, so the best bet is put in too many barriers
and wait for someone to nitpick your code and fix it for you.

[I have a theorem: there is no correct non-trivial multithreaded program
on an architecture which requires barriers for correctness.].

A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if you
believe that, a Relaxed Ordering system WILL be slower than TSO or
Sequential for workloads which often use barriers (instructions tagged
with acquire/release are barriers). In a Relaxed ordering system, the
barriers will not be as efficient as the automatic barriers of TSO/SC
(otherwise, why not just do that?), so if barriers are executed often,
performance will be lower than hardware TSO/SC, even if there are no
contentions or any work for the barriers to do. In fact, performance
could be significantly lower.

People know this, it's why they keep trying to get rid of barriers in
their code. So get rid of all them and demand TSO ordering.

Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.

Relaxed Ordering is a mistake.

Kent

Re: Memory dependency microbenchmark

<uisdmn$gd4s$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34962&group=comp.arch#34962

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 05:54:31 -0000 (UTC)
Organization: provalid.com
Lines: 39
Message-ID: <uisdmn$gd4s$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
Injection-Date: Mon, 13 Nov 2023 05:54:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4c04f5c884225b3ec58c7f2fb358c74d";
logging-data="537756"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dTOAbRNMG95d/Oz6KyyBa"
Cancel-Lock: sha1:pbpWOOa/9P7L6Kx3p1vZvINmf2s=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)

by: Kent Dickey - Mon, 13 Nov 2023 05:54 UTC

In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
MitchAlsup <mitchalsup@aol.com> wrote:
>Kent Dickey wrote:
>
>> In article <uiri0a$85mp$2@dont-email.me>,
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>
>>>A highly relaxed memory model can be beneficial for certain workloads.
>
>> I know a lot of people believe that statement to be true. In general, it
>> is assumed to be true without proof.
><
>In its most general case, relaxed order only provides a performance advantage
>when the code is single threaded.

I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).

Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).

Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.

It's crazy.

>> I believe that statement to be false. Can you describe some of these
>> workloads?
><
>Relaxed memory order fails spectacularly when multiple threads are accessing
>data.

Probably need to clarify with "accessing modified data".

Kent

weak consistency and the supercomputer attitude (was: Memory dependency microbenchmark)

<2023Nov13.084835@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34963&group=comp.arch#34963

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: weak consistency and the supercomputer attitude (was: Memory dependency microbenchmark)
Date: Mon, 13 Nov 2023 07:48:35 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 80
Message-ID: <2023Nov13.084835@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me> <uirke6$8hef$3@dont-email.me>
Injection-Info: dont-email.me; posting-host="01f55a8a1ac4ffac205481f7bb73db09";
logging-data="620516"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187DFlwa8IDurJXYQD4YM7t"
Cancel-Lock: sha1:zneeypbzOJQafSYBNbop91LF4D4=
X-newsreader: xrn 10.11

by: Anton Ertl - Mon, 13 Nov 2023 07:48 UTC

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>Also, think about converting any sound lock-free algorithm's finely
>tuned memory barriers to _all_ sequential consistency... That would ruin
>performance right off the bat... Think about it.

Proof by claim?

I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

2) A little earlier, the Alpha designers thought that they could gain
speed by not implementing denormal numbers and by implementing
imprecise exceptions for FP operations, so that it is not possible to
implement denormal numbers through a software fixup, either. For
dealing properly with denormal numbers, you had to insert a trap
barrier right after each FP instruction, and presumably this cost a
lot of performance on early Alpha implementations. However, when I
measured it on the 21264 (released six years after the first Alpha),
the cost was like that of a nop; I guess that the trap barrier was
actually a nop on the 21264, because, as an OoO processor, the 21264
performs precise exceptions without breaking a sweat. And the 21264
is faster than the models where the trap barrier actually does
something. Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).

3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a
write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.

Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.

Given that multi-processors come out of supercomputing, it's no
surprise that the supercomputing attitude is particularly strong
there.

But if you look at it from an architecture (i.e., hardware/software
interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34964&group=comp.arch#34964

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!69.80.99.27.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Mon, 13 Nov 2023 11:49:31 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
Date: Mon, 13 Nov 2023 11:49:31 +0000
Lines: 50
X-Trace: sv3-0iYYfJQsI4Wqp+yZWIswFH2YjxXN98aIAsBQRWUwCrdSTTRzr6DHMP/DSYTTmmNTIm5Gek4ncVxSMWE!Vhwo/v2U41zif/3Jek3k2duQJFwMoKvo9UBXg8l/PNXwE/q71suO81F4COmAIsPA3Ekvdr5dbPbe!/n4JXV0wRnw=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph@littlepinkcloud.invalid - Mon, 13 Nov 2023 11:49 UTC

Kent Dickey <kegs@provalid.com> wrote:
> We'll just assume 3 choices for CPU ordering:
>
> - Sequential consistency (SC). Hardware does everything, there are no
> barriers needed, ld/st instructions appear to happen in some
> global order.
> - Total Store Ordering (TSO) (x86). Stores appear to be done in program
> order, but a CPU can peek and see its own local store results
> before other CPUs can. Loads appear to be done in some total
> program order (not counting hitting its own local stores).
> TSO is like SC except there's effectively a store queue, and
> stores can finish when placed in the queue, and the queue drains
> in FIFO order. Needs no barriers, except for special cases like
> Lamport's algorithm (it's easy to avoid barriers).
> - Relaxed. Loads and stores are not ordered, users have to put in memory
> barriers and hope they did it right.
>
> So, a highly relaxed memory model is the ONLY model which needs barriers.

But this isn't true. Real processors aren't anywhere near as wildly
chaotic as this.

The common form of relaxed memory we see today is causally consistent
and multi-copy atomic (and cache coherent). So, all other threads see
stores to a single location in the same order, and you don't get the
extraordinary going-backwards-in-time behaviour of DEC Alpha.

> A relaxed ordering SYSTEM says rather than spending a few thousand
> gates getting ordering right by hardware in the CPU, instead we're
> going to require software to put in some very difficult to
> understand barriers.

I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.

> A very simple thought exercise shows even if Sequential Consistency
> and/or TSO were slower (and I maintain they are not), but even if
> you believe that, a Relaxed Ordering system WILL be slower than TSO
> or Sequential for workloads which often use barriers (instructions
> tagged with acquire/release are barriers). In a Relaxed ordering
> system, the barriers will not be as efficient as the automatic
> barriers of TSO/SC (otherwise, why not just do that?),

Whyever not? They do the same thing.

Andrew.

Re: weak consistency and the supercomputer attitude

<jwvo7fxacy1.fsf-monnier+comp.arch@gnu.org>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34967&group=comp.arch#34967

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: weak consistency and the supercomputer attitude
Date: Mon, 13 Nov 2023 10:36:52 -0500
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <jwvo7fxacy1.fsf-monnier+comp.arch@gnu.org>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uimulp$38v2q$1@dont-email.me>
<79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
<uirke6$8hef$3@dont-email.me>
<2023Nov13.084835@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="16f9ef7df38d3fc9f1f0aa027a6fa203";
logging-data="754841"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX198pgu+bNJXp+ScaWA6WI+1z2RaH9lT8S8="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:z7jO3LPF3AHd4WDIvEGf+GuYC80=
sha1:ftOP6kbSNGWrAGcTvNcu8SKUlv8=

by: Stefan Monnier - Mon, 13 Nov 2023 15:36 UTC

IIRC the main argument for the Mill wasn't that it was going to be
faster but that it would give a better performance per watt by avoiding
the administrative cost of managing those hundreds of reordered
in-flight instructions, without losing too much peak performance.

The fact that performance per watt of in-order ARM cores is not really
lower than that of OOO cores suggests that the Mill wouldn't deliver on
this "promise" either.
Still, I really would like to see how it plays out in practice, instead
of having to guess.

Stefan

Re: Memory dependency microbenchmark

<sQt4N.2604$rx%7.497@fx47.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34970&group=comp.arch#34970

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
In-Reply-To: <uisd4b$gd4s$1@dont-email.me>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 129
Message-ID: <sQt4N.2604$rx%7.497@fx47.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 13 Nov 2023 18:23:20 UTC
Date: Mon, 13 Nov 2023 13:22:44 -0500
X-Received-Bytes: 7213

by: EricP - Mon, 13 Nov 2023 18:22 UTC

Kent Dickey wrote:
> In article <uirqj3$9q9q$1@dont-email.me>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/12/2023 4:20 PM, MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>
>>>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>>> I know a lot of people believe that statement to be true.Â In
>>>>> general, it
>>>>> is assumed to be true without proof.
>>>>>
>>>>> I believe that statement to be false.Â Can you describe some of these
>>>>> workloads?
>>>> Also, think about converting any sound lock-free algorithm's finely
>>>> tuned memory barriers to _all_ sequential consistency... That would
>>>> ruin performance right off the bat... Think about it.
>>> <
>>> Assuming you are willing to accept the wrong answer fast, rather than
>>> the right answer later. There are very few algorithms with this property.
>> That does not make any sense to me. Think of a basic mutex. It basically
>> requires an acquire membar for the lock and a release membar for the
>> unlock. On SPARC that would be:
>>
>> acquire = MEMBAR #LoadStore | #LoadLoad
>>
>> release = MEMBAR #LoadStore | #StoreStore
>>
>> Okay, fine. However, if I made them sequentially consistent, it would
>> require a damn #StoreLoad barrier for both acquire and release. This is
>> not good and should be avoided when possible.
>>
>> Also, RCU prides itself with not having to use any memory barriers for
>> its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
>> or MFENCE on Intel, it would completely ruin its performance.
>
> You are using the terms in the exact opposite meaning as I would understand
> computer architecture.
>
> We'll just assume 3 choices for CPU ordering:
>
> - Sequential consistency (SC). Hardware does everything, there are no
> barriers needed, ld/st instructions appear to happen in some
> global order.
> - Total Store Ordering (TSO) (x86). Stores appear to be done in program
> order, but a CPU can peek and see its own local store results
> before other CPUs can. Loads appear to be done in some total
> program order (not counting hitting its own local stores).
> TSO is like SC except there's effectively a store queue, and
> stores can finish when placed in the queue, and the queue drains
> in FIFO order. Needs no barriers, except for special cases like
> Lamport's algorithm (it's easy to avoid barriers).
> - Relaxed. Loads and stores are not ordered, users have to put in memory
> barriers and hope they did it right.
>
> So, a highly relaxed memory model is the ONLY model which needs barriers.
> If you want to get rid of barriers, use a better memory model.
>
> A relaxed ordering SYSTEM says rather than spending a few thousand
> gates getting ordering right by hardware in the CPU, instead we're going
> to require software to put in some very difficult to understand barriers.
> And we're going to have a 45 page academic paper using all the greek
> alphabet to describe when you need to put in barriers. Literally no one
> understands all the rules, so the best bet is put in too many barriers
> and wait for someone to nitpick your code and fix it for you.
>
> [I have a theorem: there is no correct non-trivial multithreaded program
> on an architecture which requires barriers for correctness.].
>
> A very simple thought exercise shows even if Sequential Consistency
> and/or TSO were slower (and I maintain they are not), but even if you
> believe that, a Relaxed Ordering system WILL be slower than TSO or
> Sequential for workloads which often use barriers (instructions tagged
> with acquire/release are barriers). In a Relaxed ordering system, the
> barriers will not be as efficient as the automatic barriers of TSO/SC
> (otherwise, why not just do that?), so if barriers are executed often,
> performance will be lower than hardware TSO/SC, even if there are no
> contentions or any work for the barriers to do. In fact, performance
> could be significantly lower.
>
> People know this, it's why they keep trying to get rid of barriers in
> their code. So get rid of all them and demand TSO ordering.
>
> Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
> on everyone else to try to come up with algorithms which need fewer
> barriers. It's crazy.
>
> Relaxed Ordering is a mistake.
>
> Kent

I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.

Non-concurrent code does not see the relaxed ordering, and should benefit
from extra concurrency in the Load Store Queue and cache that relaxed rules
allow, because the local core always sees its own memory as consistent.
For example, relaxed ordering allows multiple LD and ST to be in
multiple pipelines to multiple cache banks at once without regard
as to the exact order the operations are applied.

This is fine for non concurrently accessed data structures,
either non-shared data areas or shared but guarded by mutexes.

But relaxed is hard for people to reason about for concurrently accessed
lock free data structures. Now these don't just appear out of thin air so
it is reasonable for a program to emit TSO_START and TSO_END instructions.

On the other hand, almost no code is lock-free or ever will be.
So why have all the extra HW logic to support TSO if its only really
needed for this rare kind of programming.

But there is also a category of memory area that is not covered by the
above rules, one where one core thinks its memory is local and not shared
but in fact it is being accessed concurrently.

If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
a subroutine passing a pointer to a value on T1's stack, and that pointer
is passed to thread T2 (a driver) on core C2 which accesses that memory,
then even if T2 declared itself to be using TSO rules it would not force
T1 on C1 obey them.

Where this approach could fail is the kind of laissez-faire sharing done
by many apps, libraries, and OS's behind the scenes in the real world.

Re: weak consistency and the supercomputer attitude

<c63218b80d8a14b488e10fc81f23405b@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34971&group=comp.arch#34971

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: weak consistency and the supercomputer attitude
Date: Mon, 13 Nov 2023 19:11:51 +0000
Organization: novaBBS
Message-ID: <c63218b80d8a14b488e10fc81f23405b@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uimulp$38v2q$1@dont-email.me> <79bcf6d10b3c231ba7e3693bad8db3c1@news.novabbs.com> <uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me> <uirke6$8hef$3@dont-email.me> <2023Nov13.084835@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="808208"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$c731bvXEGRsEH8ihiQl1ku3z040fbhgBRe94BG2iBZni1sxBDFHcm
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Mon, 13 Nov 2023 19:11 UTC

Anton Ertl wrote:

> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>Also, think about converting any sound lock-free algorithm's finely
>>tuned memory barriers to _all_ sequential consistency... That would ruin
>>performance right off the bat... Think about it.

> Proof by claim?

> I think about several similar instances, where people went for
> simple-minded hardware designs and threw the complexity over the wall
> to the software people, and claimed that it was for performance; I
> call that the "supercomputing attitude", and it may work in areas
> where the software crisis has not yet struck[1], but is a bad attitude
> in areas like general-purpose computing where it has struck.

> 1) People thought that they could achieve faster hardware by throwing
> the task of scheduling instructions for maximum instruction-level
> parallelism over to the compiler people. Several companies (in
> particular, Intel, HP, and Transmeta) invested a lot of money into
> this dream (and the Mill project relives this dream), but it turned
> out that doing the scheduling in hardware is faster.
<
Not faster, but easier to do with acceptable HW costs. The pipeline
is 1-3 stages longer, but HW has dynamic information that SW cannot have.
<
> 2) A little earlier, the Alpha designers thought that they could gain
> speed by not implementing denormal numbers and by implementing
> imprecise exceptions for FP operations, so that it is not possible to
> implement denormal numbers through a software fixup, either. For
<
So did I in Mc 88100--just as wrong then as it is now.
<
> dealing properly with denormal numbers, you had to insert a trap
> barrier right after each FP instruction, and presumably this cost a
> lot of performance on early Alpha implementations. However, when I
> measured it on the 21264 (released six years after the first Alpha),
> the cost was like that of a nop; I guess that the trap barrier was
> actually a nop on the 21264, because, as an OoO processor, the 21264
> performs precise exceptions without breaking a sweat. And the 21264
> is faster than the models where the trap barrier actually does
> something. Meanwhile, Mitch Alsup also has posted that he can
> implement fast denormal numbers with IIRC 30 extra gates (which is
> probably less than what is needed for implementing the trap barrier).
 3) The Alpha is a rich source of examples of the supercomputer
> attitude: It started out without instructions for accessing 8-bit and
> 16-bit data in memory. Instead, the idea was that for accessing
> memory, you would use instruction sequences, and for accessing I/O
> devices, the device was mapped three times or so: In one address range
> you performed bytewise access, in another address range 16-bit
> accesses, and in the third address range 32-bit and 64-bit accesses;
> I/O driver writers had to write or modify their drivers for this
> model. The rationale for that was that they required ECC for
> permanent storage and that would supposedly require slow RMW accesses
> for writing bytes to write-back caches. Now the 21064 and 21164 had a
> write-through D-cache. That made it easy to add byte and word
> accesses (BWX) in the 21164A (released 1996), but they could have done
> it from the start. The 21164A is in no way slower than the 21164; it
> has the same IPC and a higher clock rate.

> Some people welcome and celebrate the challenges that the
> supercomputer attitude poses for software, and justify it with
> "performance", but as the examples above show, such claims often turn
> out to be false when you actually invest effort into more capable
> hardware.

> Given that multi-processors come out of supercomputing, it's no
> surprise that the supercomputing attitude is particularly strong
> there.

> But if you look at it from an architecture (i.e., hardware/software
> interface) perspective, weak consistency is just bad architecture:
> good architecture says what happens to the architectural state when
> software performs some instruction. From that perspective sequential
> consistency is architecturally best. Weaker consistency models
> describe how the architecture does not provide the sequential
> consistency guarantees that are so easy to describe; the weaker the
> model, the more deviations it has to describe.
<
The problem that the weak consistency models enabled comes from the
fact it was universal over all accesses. However the TLB can be used
to solve that problem so each access has its own model and the HW has
to perform with that model often across a multiplicity of memory
references. For my part I have 4 memory models and the CPUs switch to
the appropriate model upon detection without needing instructions. So
when the first instruction of an ATOMIC event is detected (decode),
All weaker outstanding request are allowed to complete, and all of
the ATOMIC requests are performed in sequentially consistent manner,
then afterwards the memory model is weakened, again.
<
> [1] The software crisis is that software costs are higher than
> hardware costs, and supercomputing with its gigantic hardware costs
> notices the software crisis much later than general-purpose computing.

> - anton

Re: Memory dependency microbenchmark

<f844d292a8b07f85e553d2599e325cca@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34972&group=comp.arch#34972

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 19:16:41 +0000
Organization: novaBBS
Message-ID: <f844d292a8b07f85e553d2599e325cca@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="808396"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$Gp1sj.GmyGwcIYVoN/gr7.pYbCP7jwoqERy02TJV3KARuA3aMPbb2
X-Spam-Level: *

by: MitchAlsup - Mon, 13 Nov 2023 19:16 UTC

aph@littlepinkcloud.invalid wrote:

> Kent Dickey <kegs@provalid.com> wrote:
>> We'll just assume 3 choices for CPU ordering:
>>
>> - Sequential consistency (SC). Hardware does everything, there are no
>> barriers needed, ld/st instructions appear to happen in some
>> global order.
>> - Total Store Ordering (TSO) (x86). Stores appear to be done in program
>> order, but a CPU can peek and see its own local store results
>> before other CPUs can. Loads appear to be done in some total
>> program order (not counting hitting its own local stores).
>> TSO is like SC except there's effectively a store queue, and
>> stores can finish when placed in the queue, and the queue drains
>> in FIFO order. Needs no barriers, except for special cases like
>> Lamport's algorithm (it's easy to avoid barriers).
>> - Relaxed. Loads and stores are not ordered, users have to put in memory
>> barriers and hope they did it right.
>>
>> So, a highly relaxed memory model is the ONLY model which needs barriers.

> But this isn't true. Real processors aren't anywhere near as wildly
> chaotic as this.

> The common form of relaxed memory we see today is causally consistent
> and multi-copy atomic (and cache coherent). So, all other threads see
> stores to a single location in the same order, and you don't get the
> extraordinary going-backwards-in-time behaviour of DEC Alpha.

>> A relaxed ordering SYSTEM says rather than spending a few thousand
>> gates getting ordering right by hardware in the CPU, instead we're
>> going to require software to put in some very difficult to
>> understand barriers.
<
The control logic to perform may be that small, the state to maintain it
is at least (5×48^2)×1.2-gates.

> I don't think that's really true. The reorderings we see in currently-
> produced hardware are, more or less, a subset of the same reorderings
> that C compilers perform. Therefore, if you see a confusing hardware
> reordering in a multi-threaded C program it may well be (probably is!)
> a bug (according to the C standard) *even on a TSO machine*. The only
> common counter-example to this is for volatile accesses.
<
Andy Glew used to report that there was surprisingly few instructions
performed out-of-order on an GBOoO machine.

>> A very simple thought exercise shows even if Sequential Consistency
>> and/or TSO were slower (and I maintain they are not), but even if
>> you believe that, a Relaxed Ordering system WILL be slower than TSO
>> or Sequential for workloads which often use barriers (instructions
>> tagged with acquire/release are barriers). In a Relaxed ordering
>> system, the barriers will not be as efficient as the automatic
>> barriers of TSO/SC (otherwise, why not just do that?),

> Whyever not? They do the same thing.

> Andrew.

Re: Memory dependency microbenchmark

<0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34973&group=comp.arch#34973

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 19:29:30 +0000
Organization: novaBBS
Message-ID: <0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <sQt4N.2604$rx%7.497@fx47.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="809420"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Site: $2y$10$aL5KBV/n3Q.Ql74C1FetfOVj254mE8cfoRpj81F.KN93cwer/3oL6
X-Spam-Level: *

by: MitchAlsup - Mon, 13 Nov 2023 19:29 UTC

EricP wrote:

> Kent Dickey wrote:
>> In article <uirqj3$9q9q$1@dont-email.me>,
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/12/2023 4:20 PM, MitchAlsup wrote:
>>>> Chris M. Thomasson wrote:
>>>>
>>>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>
>>>>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>>>> I know a lot of people believe that statement to be true.Â In
>>>>>> general, it
>>>>>> is assumed to be true without proof.
>>>>>>
>>>>>> I believe that statement to be false.Â Can you describe some of these
>>>>>> workloads?
>>>>> Also, think about converting any sound lock-free algorithm's finely
>>>>> tuned memory barriers to _all_ sequential consistency... That would
>>>>> ruin performance right off the bat... Think about it.
>>>> <
>>>> Assuming you are willing to accept the wrong answer fast, rather than
>>>> the right answer later. There are very few algorithms with this property.
>>> That does not make any sense to me. Think of a basic mutex. It basically
>>> requires an acquire membar for the lock and a release membar for the
>>> unlock. On SPARC that would be:
>>>
>>> acquire = MEMBAR #LoadStore | #LoadLoad
>>>
>>> release = MEMBAR #LoadStore | #StoreStore
>>>
>>> Okay, fine. However, if I made them sequentially consistent, it would
>>> require a damn #StoreLoad barrier for both acquire and release. This is
>>> not good and should be avoided when possible.
>>>
>>> Also, RCU prides itself with not having to use any memory barriers for
>>> its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
>>> or MFENCE on Intel, it would completely ruin its performance.
>>
>> You are using the terms in the exact opposite meaning as I would understand
>> computer architecture.
>>
>> We'll just assume 3 choices for CPU ordering:
>>
>> - Sequential consistency (SC). Hardware does everything, there are no
>> barriers needed, ld/st instructions appear to happen in some
>> global order.
>> - Total Store Ordering (TSO) (x86). Stores appear to be done in program
>> order, but a CPU can peek and see its own local store results
>> before other CPUs can. Loads appear to be done in some total
>> program order (not counting hitting its own local stores).
>> TSO is like SC except there's effectively a store queue, and
>> stores can finish when placed in the queue, and the queue drains
>> in FIFO order. Needs no barriers, except for special cases like
>> Lamport's algorithm (it's easy to avoid barriers).
>> - Relaxed. Loads and stores are not ordered, users have to put in memory
>> barriers and hope they did it right.
>>
>> So, a highly relaxed memory model is the ONLY model which needs barriers.
>> If you want to get rid of barriers, use a better memory model.
>>
>> A relaxed ordering SYSTEM says rather than spending a few thousand
>> gates getting ordering right by hardware in the CPU, instead we're going
>> to require software to put in some very difficult to understand barriers.
>> And we're going to have a 45 page academic paper using all the greek
>> alphabet to describe when you need to put in barriers. Literally no one
>> understands all the rules, so the best bet is put in too many barriers
>> and wait for someone to nitpick your code and fix it for you.
>>
>> [I have a theorem: there is no correct non-trivial multithreaded program
>> on an architecture which requires barriers for correctness.].
>>
>> A very simple thought exercise shows even if Sequential Consistency
>> and/or TSO were slower (and I maintain they are not), but even if you
>> believe that, a Relaxed Ordering system WILL be slower than TSO or
>> Sequential for workloads which often use barriers (instructions tagged
>> with acquire/release are barriers). In a Relaxed ordering system, the
>> barriers will not be as efficient as the automatic barriers of TSO/SC
>> (otherwise, why not just do that?), so if barriers are executed often,
>> performance will be lower than hardware TSO/SC, even if there are no
>> contentions or any work for the barriers to do. In fact, performance
>> could be significantly lower.
>>
>> People know this, it's why they keep trying to get rid of barriers in
>> their code. So get rid of all them and demand TSO ordering.
>>
>> Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
>> on everyone else to try to come up with algorithms which need fewer
>> barriers. It's crazy.
>>
>> Relaxed Ordering is a mistake.
>>
>> Kent

> I suggest something different: the ability to switch between TSO and
> relaxed with non-privileged user mode instructions.
<
I suggest this switch between modes be done without executing any extra
instructions. I do the switches based on the address space of the access
{DRAM, MMI/O, config, ROM} and I also switch to SC when an ATOMIC event
begins.
<
> Non-concurrent code does not see the relaxed ordering, and should benefit
> from extra concurrency in the Load Store Queue and cache that relaxed rules
> allow, because the local core always sees its own memory as consistent.
> For example, relaxed ordering allows multiple LD and ST to be in
> multiple pipelines to multiple cache banks at once without regard
> as to the exact order the operations are applied.
 This is fine for non concurrently accessed data structures,
> either non-shared data areas or shared but guarded by mutexes.

> But relaxed is hard for people to reason about for concurrently accessed
> lock free data structures.
<
It is hard for people who have a completely vonNeumann thinking pattern
pattern to reason about these things, it is not hard for someone whos
entire career was spent doing multiplicity of things concurrently.
<
SW languages (and debuggers,...) teach people to think:: this happens than
that happens then something else happens. HW languages teach people to think
"crap all of this is happening at once, how do I make sense out of it".
<
In CPU design (or chip design in general) one is NEVER given the illusion
that a single <vast> state describes the moment. It is s shame the SW
did not follow similar route.
<
> Now these don't just appear out of thin air so
> it is reasonable for a program to emit TSO_START and TSO_END instructions.

> On the other hand, almost no code is lock-free or ever will be.
> So why have all the extra HW logic to support TSO if its only really
> needed for this rare kind of programming.
 But there is also a category of memory area that is not covered by the
> above rules, one where one core thinks its memory is local and not shared
> but in fact it is being accessed concurrently.

> If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
> a subroutine passing a pointer to a value on T1's stack, and that pointer
> is passed to thread T2 (a driver) on core C2 which accesses that memory,
> then even if T2 declared itself to be using TSO rules it would not force
> T1 on C1 obey them.

> Where this approach could fail is the kind of laissez-faire sharing done
> by many apps, libraries, and OS's behind the scenes in the real world.

So, anything written in JavaScript........

Re: Memory dependency microbenchmark

<uiu3dp$svfh$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34980&group=comp.arch#34980

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 13:11:18 -0800
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <uiu3dp$svfh$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 13 Nov 2023 21:11:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="55c5ba4d49df2a7646321f8f0cd32f12";
logging-data="949745"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Qb6ysfwkgXlX1XrXNRiERhMDzSwrEWc0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:IHlFJrU65XB3qu9A+V1nWZCHnNs=
Content-Language: en-US
In-Reply-To: <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>

by: Chris M. Thomasson - Mon, 13 Nov 2023 21:11 UTC

On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
> Kent Dickey <kegs@provalid.com> wrote:
>> We'll just assume 3 choices for CPU ordering:
>>
>> - Sequential consistency (SC). Hardware does everything, there are no
>> barriers needed, ld/st instructions appear to happen in some
>> global order.
>> - Total Store Ordering (TSO) (x86). Stores appear to be done in program
>> order, but a CPU can peek and see its own local store results
>> before other CPUs can. Loads appear to be done in some total
>> program order (not counting hitting its own local stores).
>> TSO is like SC except there's effectively a store queue, and
>> stores can finish when placed in the queue, and the queue drains
>> in FIFO order. Needs no barriers, except for special cases like
>> Lamport's algorithm (it's easy to avoid barriers).
>> - Relaxed. Loads and stores are not ordered, users have to put in memory
>> barriers and hope they did it right.
>>
>> So, a highly relaxed memory model is the ONLY model which needs barriers.
>
> But this isn't true. Real processors aren't anywhere near as wildly
> chaotic as this.
>
> The common form of relaxed memory we see today is causally consistent
> and multi-copy atomic (and cache coherent). So, all other threads see
> stores to a single location in the same order, and you don't get the
> extraordinary going-backwards-in-time behaviour of DEC Alpha.
>
>> A relaxed ordering SYSTEM says rather than spending a few thousand
>> gates getting ordering right by hardware in the CPU, instead we're
>> going to require software to put in some very difficult to
>> understand barriers.
>
> I don't think that's really true. The reorderings we see in currently-
> produced hardware are, more or less, a subset of the same reorderings
> that C compilers perform. Therefore, if you see a confusing hardware
> reordering in a multi-threaded C program it may well be (probably is!)
> a bug (according to the C standard) *even on a TSO machine*. The only
> common counter-example to this is for volatile accesses.

Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

>
>> A very simple thought exercise shows even if Sequential Consistency
>> and/or TSO were slower (and I maintain they are not), but even if
>> you believe that, a Relaxed Ordering system WILL be slower than TSO
>> or Sequential for workloads which often use barriers (instructions
>> tagged with acquire/release are barriers). In a Relaxed ordering
>> system, the barriers will not be as efficient as the automatic
>> barriers of TSO/SC (otherwise, why not just do that?),
>
> Whyever not? They do the same thing.
>
> Andrew.

Re: Memory dependency microbenchmark

<uiu4ji$t4c2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34981&group=comp.arch#34981

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 13:31:27 -0800
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <uiu4ji$t4c2$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad>
<0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 13 Nov 2023 21:31:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="55c5ba4d49df2a7646321f8f0cd32f12";
logging-data="954754"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/HEFoPH3zK3HiGm9aoavXFzHuZNL0iqYA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:G5t5J3KwSy0L3z7FeoNMooBHEOM=
Content-Language: en-US
In-Reply-To: <0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>

by: Chris M. Thomasson - Mon, 13 Nov 2023 21:31 UTC

On 11/13/2023 11:29 AM, MitchAlsup wrote:
> EricP wrote:
>
>> Kent Dickey wrote:
>>> In article <uirqj3$9q9q$1@dont-email.me>,
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>> On 11/12/2023 4:20 PM, MitchAlsup wrote:
>>>>> Chris M. Thomasson wrote:
>>>>>
>>>>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>>
>>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>>> workloads.
>>>>>>> I know a lot of people believe that statement to be true.Â In
>>>>>>> general, it
>>>>>>> is assumed to be true without proof.
>>>>>>>
>>>>>>> I believe that statement to be false.Â Can you describe some of
>>>>>>> these
>>>>>>> workloads?
>>>>>> Also, think about converting any sound lock-free algorithm's
>>>>>> finely tuned memory barriers to _all_ sequential consistency...
>>>>>> That would ruin performance right off the bat... Think about it.
>>>>> <
>>>>> Assuming you are willing to accept the wrong answer fast, rather
>>>>> than the right answer later. There are very few algorithms with
>>>>> this property.
>>>> That does not make any sense to me. Think of a basic mutex. It
>>>> basically requires an acquire membar for the lock and a release
>>>> membar for the unlock. On SPARC that would be:
>>>>
>>>> acquire = MEMBAR #LoadStore | #LoadLoad
>>>>
>>>> release = MEMBAR #LoadStore | #StoreStore
>>>>
>>>> Okay, fine. However, if I made them sequentially consistent, it
>>>> would require a damn #StoreLoad barrier for both acquire and
>>>> release. This is not good and should be avoided when possible.
>>>>
>>>> Also, RCU prides itself with not having to use any memory barriers
>>>> for its read side. If RCU was forced to use a seq cst, basically
>>>> LOCKED RMW or MFENCE on Intel, it would completely ruin its
>>>> performance.
[...]
> it still required barrier instructions.
[...]

Intel still requires an explicit membar for hazard pointers as-is. Sparc
in TSO mode still requires a membar for this. Spard needs a #StoreLoad
wrt the store followed by a load to another location relationship to
hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.

Re: Memory dependency microbenchmark

<uiu4t5$t4c2$2@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34982&group=comp.arch#34982

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 13 Nov 2023 13:36:37 -0800
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <uiu4t5$t4c2$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uiri0a$85mp$2@dont-email.me> <uirjr7$8dlh$2@dont-email.me>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 13 Nov 2023 21:36:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="55c5ba4d49df2a7646321f8f0cd32f12";
logging-data="954754"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/LRLZhBb4kVNvMjTMtzh9e+1ZImx3IWeA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:41cM1wyJp6bWVbY01qx0McYVSyk=
Content-Language: en-US
In-Reply-To: <uisdmn$gd4s$2@dont-email.me>

by: Chris M. Thomasson - Mon, 13 Nov 2023 21:36 UTC

On 11/12/2023 9:54 PM, Kent Dickey wrote:
> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
> MitchAlsup <mitchalsup@aol.com> wrote:
>> Kent Dickey wrote:
>>
>>> In article <uiri0a$85mp$2@dont-email.me>,
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>
>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>
>>> I know a lot of people believe that statement to be true. In general, it
>>> is assumed to be true without proof.
>> <
>> In its most general case, relaxed order only provides a performance advantage
>> when the code is single threaded.
>
> I believe a Relaxed Memory model provides a small performance improvement
> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
> there's nothing to order).
>
> Relazed Memory ordering provides approximately zero performance improvement
> to an OoO CPU, and in fact, might actually lower performance (depends on
> how barriers are done--if done poorly, it could be a big negative).
>
> Yes, the system designers of the world have said: let's slow down our
> fastest most expensive most profitable CPUs, so we can speed up our cheapest
> lowest profit CPUs a few percent, and push a ton of work onto software
> developers.
>
> It's crazy.
>
>>> I believe that statement to be false. Can you describe some of these
>>> workloads?
>> <
>> Relaxed memory order fails spectacularly when multiple threads are accessing
>> data.
>
> Probably need to clarify with "accessing modified data".
>
> Kent

Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
exist? I must be misunderstanding you point here. Sorry if I am. ;^o

Re: Memory dependency microbenchmark

<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=34997&group=comp.arch#34997

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 14 Nov 2023 10:25:22 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <uirke6$8hef$3@dont-email.me> <36011a9597060e08d46db0eddfed0976@news.novabbs.com> <uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
Date: Tue, 14 Nov 2023 10:25:22 +0000
Lines: 30
X-Trace: sv3-sDfhPJU2ldDvUDv/igFMcPs5Ew65kSX3TdDUsT+spNjESG1dbA7Vt7SxSFyHNLiHp+MWEmrk7FCNc91!FONaaxTMR59Ynx8I3pqRZHiIvk1UU137TgeYaz/u7XfYTtl5giQzIa3AvyGq/XcCU22JC3o2UypT!DOyJ2mNOHDs=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph@littlepinkcloud.invalid - Tue, 14 Nov 2023 10:25 UTC

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>
>> I don't think that's really true. The reorderings we see in currently-
>> produced hardware are, more or less, a subset of the same reorderings
>> that C compilers perform. Therefore, if you see a confusing hardware
>> reordering in a multi-threaded C program it may well be (probably is!)
>> a bug (according to the C standard) *even on a TSO machine*. The only
>> common counter-example to this is for volatile accesses.
>
> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.

(For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's
Algorithm, which needs something stronger.)

Because of this, the assertion that programming a non-TSO machine is
"harder" doesn't IMO stand up, at least in C programs, because the
same data-race bugs can manifest themselves as either compiler
optimizations or hardware reorderings. And a compiler can, at least in
theory, can do things that are far weirder than any memory system
does.

Andrew.

Re: Memory dependency microbenchmark

<uj0m7s$1ch3f$4@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35002&group=comp.arch#35002

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 14 Nov 2023 12:44:44 -0800
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <uj0m7s$1ch3f$4@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Nov 2023 20:44:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6b2b7e169255d8eb4f38ce399d6fe008";
logging-data="1459311"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/y0S8bO7YohPV+SiU64xUV4Qr2w4ciOGI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Dhp1H491SB5e6brVLqGPAiJ5nqI=
In-Reply-To: <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Tue, 14 Nov 2023 20:44 UTC

On 11/14/2023 2:25 AM, aph@littlepinkcloud.invalid wrote:
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>
>>> I don't think that's really true. The reorderings we see in currently-
>>> produced hardware are, more or less, a subset of the same reorderings
>>> that C compilers perform. Therefore, if you see a confusing hardware
>>> reordering in a multi-threaded C program it may well be (probably is!)
>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>> common counter-example to this is for volatile accesses.
>>
>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>
> Maybe I wasn't clear enough.

Well, I think I was being a bit too dense here and missed your main
point. Sorry.

> If you use std::atomic and
> std::memory_order_* in such a way that there are no data races, your
> concurrent program will be fine on both TSO and relaxed memory
> ordering.

Agreed.

> If you try to fix data races with volatile instead of
> std::atomic and std::memory_order_*, that'll mostly fix things on a
> TSO machine, but not on a machine with relaxed memory ordering.
>
> (For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's
> Algorithm, which needs something stronger.)

Indeed, it does.

> Because of this, the assertion that programming a non-TSO machine is
> "harder" doesn't IMO stand up, at least in C programs, because the
> same data-race bugs can manifest themselves as either compiler
> optimizations or hardware reorderings. And a compiler can, at least in
> theory, can do things that are far weirder than any memory system
> does.

Re: Memory dependency microbenchmark

<AwX4N.38006$yvY5.31401@fx10.iad>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35005&group=comp.arch#35005

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!3.eu.feeder.erje.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir.maksimovic@icloud.com (Branimir Maksimovic)
Subject: Re: Memory dependency microbenchmark
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad>
<0f58091b40e44bd01e446dd8335e647e@news.novabbs.com>
<uiu4ji$t4c2$1@dont-email.me>
User-Agent: slrn/1.0.3 (Darwin)
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 62
Message-ID: <AwX4N.38006$yvY5.31401@fx10.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Wed, 15 Nov 2023 04:10:08 UTC
Organization: usenet-news.net
Date: Wed, 15 Nov 2023 04:10:08 GMT
X-Received-Bytes: 3521

by: Branimir Maksimovic - Wed, 15 Nov 2023 04:10 UTC

On 2023-11-13, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
> On 11/13/2023 11:29 AM, MitchAlsup wrote:
>> EricP wrote:
>>
>>> Kent Dickey wrote:
>>>> In article <uirqj3$9q9q$1@dont-email.me>,
>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>> On 11/12/2023 4:20 PM, MitchAlsup wrote:
>>>>>> Chris M. Thomasson wrote:
>>>>>>
>>>>>>> On 11/12/2023 2:33 PM, Kent Dickey wrote:
>>>>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> A highly relaxed memory model can be beneficial for certain
>>>>>>>>> workloads.
>>>>>>>> I know a lot of people believe that statement to be true.Â In
>>>>>>>> general, it
>>>>>>>> is assumed to be true without proof.
>>>>>>>>
>>>>>>>> I believe that statement to be false.Â Can you describe some of
>>>>>>>> these
>>>>>>>> workloads?
>>>>>>> Also, think about converting any sound lock-free algorithm's
>>>>>>> finely tuned memory barriers to _all_ sequential consistency...
>>>>>>> That would ruin performance right off the bat... Think about it.
>>>>>> <
>>>>>> Assuming you are willing to accept the wrong answer fast, rather
>>>>>> than the right answer later. There are very few algorithms with
>>>>>> this property.
>>>>> That does not make any sense to me. Think of a basic mutex. It
>>>>> basically requires an acquire membar for the lock and a release
>>>>> membar for the unlock. On SPARC that would be:
>>>>>
>>>>> acquire = MEMBAR #LoadStore | #LoadLoad
>>>>>
>>>>> release = MEMBAR #LoadStore | #StoreStore
>>>>>
>>>>> Okay, fine. However, if I made them sequentially consistent, it
>>>>> would require a damn #StoreLoad barrier for both acquire and
>>>>> release. This is not good and should be avoided when possible.
>>>>>
>>>>> Also, RCU prides itself with not having to use any memory barriers
>>>>> for its read side. If RCU was forced to use a seq cst, basically
>>>>> LOCKED RMW or MFENCE on Intel, it would completely ruin its
>>>>> performance.
> [...]
>> > it still required barrier instructions.
> [...]
>
> Intel still requires an explicit membar for hazard pointers as-is. Sparc
> in TSO mode still requires a membar for this. Spard needs a #StoreLoad
> wrt the store followed by a load to another location relationship to
> hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
I think that Apple M1 requires, too. I has problems wihout membar.

7-77-777, Evil Sinner!
https://www.linkedin.com/in/branimir-maksimovic-6762bbaa/

Re: Memory dependency microbenchmark

<2023Nov15.193240@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35009&group=comp.arch#35009

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 18:32:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 38
Message-ID: <2023Nov15.193240@mips.complang.tuwien.ac.at>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
Injection-Info: dont-email.me; posting-host="8fb31677718b54c291a8a54dcef69b93";
logging-data="1963990"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185pTm5aoRKhpzbIZwonx+h"
Cancel-Lock: sha1:d2qHPtontkErltkxI8UHpBSbHCw=
X-newsreader: xrn 10.11

by: Anton Ertl - Wed, 15 Nov 2023 18:32 UTC

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I have written a microbenchmark for measuring how memory dependencies
>affect the performance of various microarchitectures. You can find it
>along with a description and results on
><http://www.complang.tuwien.ac.at/anton/memdep/>.

I have now added parameter combinations for detecting something that I
currently call memory renaming: If the microarchitecture can overlap
independent accesses to the same memory location (analogous to
register renaming).

Does anybody know if there is a commonly-used term for this feature?
I would also be interested in a commonly-used term for bypassing the
store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
Lake).

Concerning memory renaming, it turns out that Intel added it in
Nehalem, and AMD added it between K10 and Excavator (probably in the
Bulldozer). ARM has not implemented this in any microarchitecture I
have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
P-core), but not in Icestorm (M1 E-Core).

One interesting benefit of memory renaming is that when you are bad at
register allocation, and keep local variables in memory (or virtual
machine registers, or virtual machine stack items), independent
computations that use the same local variable still can execute in
parallel. Say, stuff like

a = b[1]; c = a+1;
a = b[2]; d = a-1;

The hardware can execute them in parallel, even if a is located in
memory.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<3c1a0bb6cdcb6359b59e207253273f6e@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35011&group=comp.arch#35011

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 19:05:42 +0000
Organization: novaBBS
Message-ID: <3c1a0bb6cdcb6359b59e207253273f6e@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <2023Nov15.193240@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1024757"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
X-Rslight-Site: $2y$10$RjrckGBpgdkvFt9Uv22w6.j0w4.veSXuvpyPYEDVB3zk3CEo2U6Vi
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us

by: MitchAlsup - Wed, 15 Nov 2023 19:05 UTC

Anton Ertl wrote:

> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>I have written a microbenchmark for measuring how memory dependencies
>>affect the performance of various microarchitectures. You can find it
>>along with a description and results on
>><http://www.complang.tuwien.ac.at/anton/memdep/>.

> I have now added parameter combinations for detecting something that I
> currently call memory renaming: If the microarchitecture can overlap
> independent accesses to the same memory location (analogous to
> register renaming).

> Does anybody know if there is a commonly-used term for this feature?
 I would also be interested in a commonly-used term for bypassing the
> store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
> Lake).
<
Store-to-Load-Forwarding is the only term I know here.
<
> Concerning memory renaming, it turns out that Intel added it in
> Nehalem, and AMD added it between K10 and Excavator (probably in the
> Bulldozer). ARM has not implemented this in any microarchitecture I
> have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
> P-core), but not in Icestorm (M1 E-Core).

> One interesting benefit of memory renaming is that when you are bad at
> register allocation,
<
Or bad at alias analysis...
<
< and keep local variables in memory (or virtual
> machine registers, or virtual machine stack items), independent
> computations that use the same local variable still can execute in
> parallel. Say, stuff like

> a = b[1]; c = a+1;
> a = b[2]; d = a-1;

> The hardware can execute them in parallel, even if a is located in
> memory.
<
And why not, they are independent containers.

> - anton

Re: Memory dependency microbenchmark

<uj39sg$1stm2$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35016&group=comp.arch#35016

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 20:32:16 -0000 (UTC)
Organization: provalid.com
Lines: 50
Message-ID: <uj39sg$1stm2$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
Injection-Date: Wed, 15 Nov 2023 20:32:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f47d7e43317492e428cf8650bfdda4fb";
logging-data="1996482"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/C5aSptmahiWnEYaSzbwWB"
Cancel-Lock: sha1:XdWwkf/eGDXn+JvKaTOa+CotxjM=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: kegs@provalid.com (Kent Dickey)

by: Kent Dickey - Wed, 15 Nov 2023 20:32 UTC

In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
<aph@littlepinkcloud.invalid> wrote:
>Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>
>>> I don't think that's really true. The reorderings we see in currently-
>>> produced hardware are, more or less, a subset of the same reorderings
>>> that C compilers perform. Therefore, if you see a confusing hardware
>>> reordering in a multi-threaded C program it may well be (probably is!)
>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>> common counter-example to this is for volatile accesses.
>>
>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>
>Maybe I wasn't clear enough. If you use std::atomic and
>std::memory_order_* in such a way that there are no data races, your
>concurrent program will be fine on both TSO and relaxed memory
>ordering. If you try to fix data races with volatile instead of
>std::atomic and std::memory_order_*, that'll mostly fix things on a
>TSO machine, but not on a machine with relaxed memory ordering.

What you are saying is:

As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.

If you assume you've already solved the problem, then you find the
problem is solved! Magic!

What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery. This acquire/release
nonsense is all weakly ordered brain damage. The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem. Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.

If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.

It also would be nice if multithreaded programs could be written in C99,
or pre-C++11. It's kinda surprising we've only been able to write threaded
programs for about 10 years.

Kent

Re: Memory dependency microbenchmark

<uj3af5$1skku$9@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35017&group=comp.arch#35017

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 12:42:13 -0800
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <uj3af5$1skku$9@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 15 Nov 2023 20:42:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="1987230"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/XKVaJ3RNS2UtziqEPeR7nEVZFJxBoMxk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:PQ9uPPUX4PRWw/SigtYZygjeneM=
In-Reply-To: <uj39sg$1stm2$1@dont-email.me>
Content-Language: en-US

by: Chris M. Thomasson - Wed, 15 Nov 2023 20:42 UTC

On 11/15/2023 12:32 PM, Kent Dickey wrote:
> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
> <aph@littlepinkcloud.invalid> wrote:
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>
>>>> I don't think that's really true. The reorderings we see in currently-
>>>> produced hardware are, more or less, a subset of the same reorderings
>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>> reordering in a multi-threaded C program it may well be (probably is!)
>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>> common counter-example to this is for volatile accesses.
>>>
>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>
>> Maybe I wasn't clear enough. If you use std::atomic and
>> std::memory_order_* in such a way that there are no data races, your
>> concurrent program will be fine on both TSO and relaxed memory
>> ordering. If you try to fix data races with volatile instead of
>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>> TSO machine, but not on a machine with relaxed memory ordering.
>
> What you are saying is:
>
> As long as you fully analyze your program, ensure all multithreaded accesses
> are only through atomic variables, and you label every access to an
> atomic variable properly (although my point is: exactly what should that
> be??), then there is no problem.
>
> If you assume you've already solved the problem, then you find the
> problem is solved! Magic!
>
> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
> is set on all accesses with no special trickery. This acquire/release
> nonsense is all weakly ordered brain damage.

Huh? Just because you have a problem with memory ordering, does not mean
that all std::memory_order_* should default to seq_cst! You are being
radically obtuse here. Is as if you have no idea about what you are
writing about.

[...]

Re: Memory dependency microbenchmark

<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35018&group=comp.arch#35018

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 20:56:39 +0000
Organization: novaBBS
Message-ID: <394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1034612"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$oy/LBzZ12LLEt52VVn/bwehvOux/u5kQ1UwkXG8FqDQY/kko1hYv2
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949

by: MitchAlsup - Wed, 15 Nov 2023 20:56 UTC

Kent Dickey wrote:

> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
> <aph@littlepinkcloud.invalid> wrote:
>>Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>
>>>> I don't think that's really true. The reorderings we see in currently-
>>>> produced hardware are, more or less, a subset of the same reorderings
>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>> reordering in a multi-threaded C program it may well be (probably is!)
>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>> common counter-example to this is for volatile accesses.
>>>
>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>
>>Maybe I wasn't clear enough. If you use std::atomic and
>>std::memory_order_* in such a way that there are no data races, your
>>concurrent program will be fine on both TSO and relaxed memory
>>ordering. If you try to fix data races with volatile instead of
>>std::atomic and std::memory_order_*, that'll mostly fix things on a
>>TSO machine, but not on a machine with relaxed memory ordering.

> What you are saying is:

> As long as you fully analyze your program, ensure all multithreaded accesses
> are only through atomic variables, and you label every access to an
> atomic variable properly (although my point is: exactly what should that
> be??), then there is no problem.

> If you assume you've already solved the problem, then you find the
> problem is solved! Magic!

> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
> is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
> This acquire/release
> nonsense is all weakly ordered brain damage.
<
Agreed ...
<
> The problem is on weakly
> ordered CPUs, performance definitely does matter in terms of getting this
> stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
> Being weakly ordered makes them
> slower when they have to execute barriers for correctness, but it's the
> barriers themselves that are the slow down, not ordering the requests
> properly.
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
> If the CPU takes ownership of ordering, then the only rule is: you just
> have to use atomic properly (even then, you can often get away with
> volatile for most producer/consumer cases), and these subtypes for all
> accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the (multiplicity
of) memory controllers do, the CPUs just have to figure out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
> It also would be nice if multithreaded programs could be written in C99,
> or pre-C++11. It's kinda surprising we've only been able to write threaded
> programs for about 10 years.
 Kent

Re: Memory dependency microbenchmark

<uj3bgm$1t58r$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35019&group=comp.arch#35019

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:00:06 -0800
Organization: A noiseless patient Spider
Lines: 91
Message-ID: <uj3bgm$1t58r$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Nov 2023 21:00:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2004251"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18/Q5k3an5+tKoJjYzXi095571lxjtWFi4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LZRCRUUX6x8kwBt7PggaCHTNZR4=
Content-Language: en-US
In-Reply-To: <394f2ed4be2eeb98d7d6c24735182b86@news.novabbs.com>

by: Chris M. Thomasson - Wed, 15 Nov 2023 21:00 UTC

On 11/15/2023 12:56 PM, MitchAlsup wrote:
> Kent Dickey wrote:
>
>> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
>> <aph@littlepinkcloud.invalid> wrote:
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>>
>>>>> I don't think that's really true. The reorderings we see in currently-
>>>>> produced hardware are, more or less, a subset of the same reorderings
>>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>>> reordering in a multi-threaded C program it may well be (probably is!)
>>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>>> common counter-example to this is for volatile accesses.
>>>>
>>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>>
>>> Maybe I wasn't clear enough. If you use std::atomic and
>>> std::memory_order_* in such a way that there are no data races, your
>>> concurrent program will be fine on both TSO and relaxed memory
>>> ordering. If you try to fix data races with volatile instead of
>>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>>> TSO machine, but not on a machine with relaxed memory ordering.
>
>> What you are saying is:
>
>> As long as you fully analyze your program, ensure all multithreaded
>> accesses
>> are only through atomic variables, and you label every access to an
>> atomic variable properly (although my point is: exactly what should that
>> be??), then there is no problem.
>
>> If you assume you've already solved the problem, then you find the
>> problem is solved! Magic!
>
>> What I'm arguing is: the CPU should behave as if memory_order_seq_cst
>> is set on all accesses with no special trickery.
> <
> Should appear as if.....not behave as if.
> <
>> This acquire/release
>> nonsense is all weakly ordered brain damage.
> <
> Agreed ...
> <

Why do you "seem" to think its brain damage? Knowing how to use them
properly is a good thing.

>> The problem is on weakly
>> ordered CPUs, performance definitely does matter in terms of getting this
>> stuff right, but that's their problem.
> <
> Not necessarily !! If you have a weakly ordered machine and start doing
> something ATOMIC, the processor can switch to sequential consistency upon
> the detection of the ATOMIC event starting (LL for example) stay SC until
> the ATOMIC even is done, then revert back to weakly ordered as it pleases.
> The single threaded code gets is performance while the multithreaded code
> gets is SC (as Lamport demonstrated was necessary).
> <
>> Being weakly ordered makes them
>> slower when they have to execute barriers for correctness, but it's the
>> barriers themselves that are the slow down, not ordering the requests
>> properly.
> <
> But You see, doing it my way gets rid of the MemBar instructions but not
> their necessary effects. In addition, in my model, every access within
> an ATOMIC event is SC not just a MemBar at the front and end.
> <
>> If the CPU takes ownership of ordering, then the only rule is: you just
>> have to use atomic properly (even then, you can often get away with
>> volatile for most producer/consumer cases), and these subtypes for all
>> accesses don't matter for correctness or performance.
> <
> But a CPU is not in a position to determine Memory Order, the
> (multiplicity of) memory controllers do, the CPUs just have to figure
> out how to put up with
> the imposed order based on the kind of things the CPU is doing at that
> instant.
> <
>> It also would be nice if multithreaded programs could be written in C99,
>> or pre-C++11. It's kinda surprising we've only been able to write
>> threaded
>> programs for about 10 years.
> <
> I wrote multithreaded programs on an 8085.....
> <
>> Kent

Re: Memory dependency microbenchmark

<uj3c29$1t9an$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35021&group=comp.arch#35021

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 21:09:30 -0000 (UTC)
Organization: provalid.com
Lines: 91
Message-ID: <uj3c29$1t9an$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com> <uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
Injection-Date: Wed, 15 Nov 2023 21:09:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f47d7e43317492e428cf8650bfdda4fb";
logging-data="2008407"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LhwJH/W54WDcmhounQDs3"
Cancel-Lock: sha1:sJN0ud+cL7vL5wbg6AG7vy0Q5AQ=
Originator: kegs@provalid.com (Kent Dickey)
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)

by: Kent Dickey - Wed, 15 Nov 2023 21:09 UTC

In article <uiu4t5$t4c2$2@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>On 11/12/2023 9:54 PM, Kent Dickey wrote:
>> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
>> MitchAlsup <mitchalsup@aol.com> wrote:
>>> Kent Dickey wrote:
>>>
>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>
>>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>
>>>> I know a lot of people believe that statement to be true. In general, it
>>>> is assumed to be true without proof.
>>> <
>>> In its most general case, relaxed order only provides a performance advantage
>>> when the code is single threaded.
>>
>> I believe a Relaxed Memory model provides a small performance improvement
>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
>> there's nothing to order).
>>
>> Relazed Memory ordering provides approximately zero performance improvement
>> to an OoO CPU, and in fact, might actually lower performance (depends on
>> how barriers are done--if done poorly, it could be a big negative).
>>
>> Yes, the system designers of the world have said: let's slow down our
>> fastest most expensive most profitable CPUs, so we can speed up our cheapest
>> lowest profit CPUs a few percent, and push a ton of work onto software
>> developers.
>>
>> It's crazy.
>>
>>>> I believe that statement to be false. Can you describe some of these
>>>> workloads?
>>> <
>>> Relaxed memory order fails spectacularly when multiple threads are accessing
>>> data.
>>
>> Probably need to clarify with "accessing modified data".
>>
>> Kent
>
>Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
>exist? I must be misunderstanding you point here. Sorry if I am. ;^o

You have internalized weakly ordered memory, and you're having trouble
seeing beyond it.

CPUs with weakly ordered memory are the ones that need all those flags.
Yes, you need the flags if you want to use those CPUs. I'm pointing out:
we could all just require better memory ordering and get rid of all this
cruft. Give the flag, don't give the flag, the program is still correct
and works properly.

It's like FP denorms--it's generally been decided the hardware cost
to implement it is small, so hardware needs to support it at full speed.
No need to write code in a careful way to avoid denorms, to use funky CPU-
specific calls to turn on flush-to-0, etc., it just works, we move on to
other topics. But we still have flush-to-0 calls available--but you don't
need to bother to use them. In my opinion, memory ordering is much more
complex for programmers to handle. I maintain it's actually so
complex most people cannot get it right in software for non-trivial
interactions. I've found many hardware designers have a very hard time
reasoning about this as well when I report bugs (since the rules are so
complex and poorly described). There are over 100 pages describing memory
ordering in the Arm Architectureal Reference Manual, and it is very
complex (Dependency through registers and memory; Basic Dependency;
Address Dependency; Data Dependency; Control Dependency; Pick Basic
dependency; Pick Address Dependency; Pick Data Dependency; Pick
Control Dependency, Pick Dependency...and this is just from the definition
of terms). It's all very abstract and difficult to follow. I'll be
honest: I do not understand all of these rules, and I don't care to.
I know how to implement a CPU, so I know what they've done, and that's
much simpler to understand. But writing a threaded application is much
more complex than it should be for software.

The cost to do TSO is some out-of-order tracking structures need to get
a little bigger, and some instructions have to stay in queues longer
(which is why they may need to get bigger), and allow re-issuing loads
which now have stale data. The difference between TSO and Sequential
Consistency is to just disallow loads seeing stores queued before they
write to the data cache (well, you can speculatively let loads happen,
but you need to be able to walk it back, which is not difficult). This
is why I say the performance cost is low--normal code missing caches and
not being pestered by other CPUs can run at the same speed. But when
other CPUs begin pestering us, the interference can all be worked out as
efficiently as possible using hardware, and barriers just do not
compete.

Kent

Re: Memory dependency microbenchmark

<uj3d0a$1tb8u$1@dont-email.me>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=35023&group=comp.arch#35023

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m.thomasson.1@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Wed, 15 Nov 2023 13:25:30 -0800
Organization: A noiseless patient Spider
Lines: 111
Message-ID: <uj3d0a$1tb8u$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 15 Nov 2023 21:25:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c1eb1e7d2816baa503f549a694bca0e3";
logging-data="2010398"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3OZyhhzMU978uRZCok+zKnYwj9dlNbCQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:j84QynO73Y8OxNxHZBpN8DmZWHI=
Content-Language: en-US
In-Reply-To: <uj3c29$1t9an$1@dont-email.me>

by: Chris M. Thomasson - Wed, 15 Nov 2023 21:25 UTC

On 11/15/2023 1:09 PM, Kent Dickey wrote:
> In article <uiu4t5$t4c2$2@dont-email.me>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/12/2023 9:54 PM, Kent Dickey wrote:
>>> In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
>>> MitchAlsup <mitchalsup@aol.com> wrote:
>>>> Kent Dickey wrote:
>>>>
>>>>> In article <uiri0a$85mp$2@dont-email.me>,
>>>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>>
>>>>>> A highly relaxed memory model can be beneficial for certain workloads.
>>>>
>>>>> I know a lot of people believe that statement to be true. In general, it
>>>>> is assumed to be true without proof.
>>>> <
>>>> In its most general case, relaxed order only provides a performance advantage
>>>> when the code is single threaded.
>>>
>>> I believe a Relaxed Memory model provides a small performance improvement
>>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
>>> there's nothing to order).
>>>
>>> Relazed Memory ordering provides approximately zero performance improvement
>>> to an OoO CPU, and in fact, might actually lower performance (depends on
>>> how barriers are done--if done poorly, it could be a big negative).
>>>
>>> Yes, the system designers of the world have said: let's slow down our
>>> fastest most expensive most profitable CPUs, so we can speed up our cheapest
>>> lowest profit CPUs a few percent, and push a ton of work onto software
>>> developers.
>>>
>>> It's crazy.
>>>
>>>>> I believe that statement to be false. Can you describe some of these
>>>>> workloads?
>>>> <
>>>> Relaxed memory order fails spectacularly when multiple threads are accessing
>>>> data.
>>>
>>> Probably need to clarify with "accessing modified data".
>>>
>>> Kent
>>
>> Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o
>
> You have internalized weakly ordered memory, and you're having trouble
> seeing beyond it.

Really? Don't project yourself on me. Altering all of the memory
barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.

>
> CPUs with weakly ordered memory are the ones that need all those flags.
> Yes, you need the flags if you want to use those CPUs. I'm pointing out:
> we could all just require better memory ordering and get rid of all this
> cruft. Give the flag, don't give the flag, the program is still correct
> and works properly.

Huh? Just cruft? wow. Just because it seems hard for you does not mean
we should eliminate it. Believe it or not there are people out there
that know how to use memory barriers. I suppose you would use seq_cst to
load each node of a lock-free stack iteration in a RCU read-side region.
This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
like you do not have all that much experience with them. Humm...

>
> It's like FP denorms--it's generally been decided the hardware cost
> to implement it is small, so hardware needs to support it at full speed.
> No need to write code in a careful way to avoid denorms, to use funky CPU-
> specific calls to turn on flush-to-0, etc., it just works, we move on to
> other topics. But we still have flush-to-0 calls available--but you don't
> need to bother to use them. In my opinion, memory ordering is much more
> complex for programmers to handle. I maintain it's actually so
> complex most people cannot get it right in software for non-trivial
> interactions. I've found many hardware designers have a very hard time
> reasoning about this as well when I report bugs (since the rules are so
> complex and poorly described). There are over 100 pages describing memory
> ordering in the Arm Architectureal Reference Manual, and it is very
> complex (Dependency through registers and memory; Basic Dependency;
> Address Dependency; Data Dependency; Control Dependency; Pick Basic
> dependency; Pick Address Dependency; Pick Data Dependency; Pick
> Control Dependency, Pick Dependency...and this is just from the definition
> of terms). It's all very abstract and difficult to follow. I'll be
> honest: I do not understand all of these rules, and I don't care to.
> I know how to implement a CPU, so I know what they've done, and that's
> much simpler to understand. But writing a threaded application is much
> more complex than it should be for software.
>
> The cost to do TSO is some out-of-order tracking structures need to get
> a little bigger, and some instructions have to stay in queues longer
> (which is why they may need to get bigger), and allow re-issuing loads
> which now have stale data. The difference between TSO and Sequential
> Consistency is to just disallow loads seeing stores queued before they
> write to the data cache (well, you can speculatively let loads happen,
> but you need to be able to walk it back, which is not difficult). This
> is why I say the performance cost is low--normal code missing caches and
> not being pestered by other CPUs can run at the same speed. But when
> other CPUs begin pestering us, the interference can all be worked out as
> efficiently as possible using hardware, and barriers just do not
> compete.

Having access to fine grain memory barriers is a very good thing. Of
course we can use C++ right now and make everything seq_cst, but that is
moronic. Why would you want to use seq_cst everywhere when you do not
have to? There are rather massive performance implications.

Are you thinking about a magic arch that we cannot use right now?

devel / comp.arch / Re: Memory dependency microbenchmark

Pages:123 4 5 6 7 8

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor