Rocksolid Light - comp.lang.forth

Re: Shared memory

<2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>

https://news.novabbs.org/devel/article-flat.php?id=26183&group=comp.lang.forth#26183

Path: i2pn2.org!.POSTED!not-for-mail
From: mhx@iae.nl (mhx)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Tue, 27 Feb 2024 23:41:10 +0000
Organization: novaBBS
Message-ID: <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <nnd$7cbc3a63$6f83f13a@6f30e38e1afa5ef4> <2e9e6cc3-b12a-4202-95ac-cecd9ab4f391n@googlegroups.com> <4c78905d-c2de-4342-b704-70fa022a857bn@googlegroups.com> <2023Jan21.161446@mips.complang.tuwien.ac.at> <7589123a-c72c-4499-bc8c-2026d9e6776dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="165520"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Hr65hPrSCL8OIAtfXiOvXesYmbNBf93xEFfmiwAf9SMzsYNb6ibVu
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24

by: mhx - Tue, 27 Feb 2024 23:41 UTC

I have been polishing my shared memory application (iSPICE) a bit more.
The benchmark I previously showed compared running a circuit simulation
with a variable number of communicating CPUs. Only a minimum amount of data
is shared (a page with published parameters and achieved results, plus the
ready! flags). With this setup I got about a factor of 3 improvement for
8 CPUs. I hoped to improve this factor a bit with better hardware and maybe
some software tweaking.

What I didn't try until today was checking how fast the circuit simulation
ran on a single CPU, *not* using the shared memory framework. And indeed,
that is a problem, in that without shared memory the runtime is *3 times
less* than with shared memory. In other words, there is no net gain in
having 8 mem-shared cpu's. As a additional check I started the circuit run
in 3 separate windows. They all achieved the same speed as the single run
non-shared version, proving that the hardware (cpu/memory/disk) is amply
sufficient to provide an 8 times speed-up.

I will now start working on Anton's suggesting of a shared file. Or maybe
I should try this on Linux first, maybe shared memory works better there.

-marcel

Re: Shared memory

<e618341b72fce0fc25688b0e4f2b866f@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26184&group=comp.lang.forth#26184

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: minforth@gmx.net (minforth)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Wed, 28 Feb 2024 01:46:14 +0000
Organization: novaBBS
Message-ID: <e618341b72fce0fc25688b0e4f2b866f@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <nnd$7cbc3a63$6f83f13a@6f30e38e1afa5ef4> <2e9e6cc3-b12a-4202-95ac-cecd9ab4f391n@googlegroups.com> <4c78905d-c2de-4342-b704-70fa022a857bn@googlegroups.com> <2023Jan21.161446@mips.complang.tuwien.ac.at> <7589123a-c72c-4499-bc8c-2026d9e6776dn@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="174173"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$GCMe452VWd4h0JkxP.W4WubG8ftHrRlwo8NFH.6X4TKh4TJj155B.
X-Rslight-Posting-User: d2a19558f194e2f1f8393b8d9be9ef51734a4da3

by: minforth - Wed, 28 Feb 2024 01:46 UTC

Perhaps this is the reason why:

Windows shared memory is not the same as Linux only some things are similar.

The Unix mmap() API is practically equivalent to the CreateFileMapping/
MapViewOfFile Windows API. Both can map files and/or can create shared
(anonymous) maps that are backed by the swap device (if any). As a matter of
fact, glibc uses anonymous mmap() to implement malloc() when the requested
memory size is sufficiently large.

The biggest difference is the memory allocation granularity size. Linux is 4K
and Windows is 64K. If it's important to have say arbitrary 8K pages mapped
into specific 8K destinations well you are stuck on Windows and it just can't
be done.

Another difference is you can mmap a new page over the top of an existing page
effectively replacing the first page mapping. In Windows you can't do this
but instead must destroy the entire view and rebuild the entire view in what
ever new layout that is required. So if the "view" contains 1024 pages and
1 page changes then in Linux you can just change that one page. In Windows
you must drop all 1024 pages and re-view the same 1023 pages + the one new page.

IOW with only minimal data to share, Linux should be faster. A normal file
will probably do the job already, since most probably it is buffered in memory
anyway.

Re: Shared memory

<c828aa044a0e5666fe620004d6244c91@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26185&group=comp.lang.forth#26185

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: mhx@iae.nl (mhx)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Wed, 28 Feb 2024 08:29:28 +0000
Organization: novaBBS
Message-ID: <c828aa044a0e5666fe620004d6244c91@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <nnd$7cbc3a63$6f83f13a@6f30e38e1afa5ef4> <2e9e6cc3-b12a-4202-95ac-cecd9ab4f391n@googlegroups.com> <4c78905d-c2de-4342-b704-70fa022a857bn@googlegroups.com> <2023Jan21.161446@mips.complang.tuwien.ac.at> <7589123a-c72c-4499-bc8c-2026d9e6776dn@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com> <e618341b72fce0fc25688b0e4f2b866f@www.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="203207"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24
X-Rslight-Site: $2y$10$qdDg4X7.AgVwhrthKP.xleI0CSN6UQcj21xn/Wg/SSnpowJ1Tmdmm

by: mhx - Wed, 28 Feb 2024 08:29 UTC

This is certainly interesting. Previously I wrote:

> Only a minimum amount of data is shared (a page with published
> parameters and achieved results, plus the ready! flags).

However, I see now that I asked for 'arbitrary size' in the system
call. Combined with a locked address, this could cause Windows to
swap a huge amount of memory on accesses, explaining the slow
execution.

I will have to spend more time reading the documentation after all.

Thanks a lot everybody, for the helpful comments!

-marcel

Re: Shared memory

<nnd$581d87cf$2fc35a6f@a97fc5da776b1602>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26186&group=comp.lang.forth#26186

copy link Newsgroups: comp.lang.forth

Newsgroups: comp.lang.forth
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <2023Jan21.161446@mips.complang.tuwien.ac.at> <7589123a-c72c-4499-bc8c-2026d9e6776dn@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>
From: albert@spenarnc.xs4all.nl
Subject: Re: Shared memory
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: albert@cherry.(none) (albert)
Message-ID: <nnd$581d87cf$2fc35a6f@a97fc5da776b1602>
Organization: KPN B.V.
Date: Wed, 28 Feb 2024 11:40:00 +0100
Path: i2pn2.org!rocksolid2!news.neodome.net!weretis.net!feeder8.news.weretis.net!nntp.comgw.net!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!feed.abavia.com!abe006.abavia.com!abp001.abavia.com!news.kpn.nl!not-for-mail
Lines: 74
Injection-Date: Wed, 28 Feb 2024 11:40:00 +0100
Injection-Info: news.kpn.nl; mail-complaints-to="abuse@kpn.com"
X-Received-Bytes: 4038

by: albert@spenarnc.xs4all.nl - Wed, 28 Feb 2024 10:40 UTC

In article <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:
>I have been polishing my shared memory application (iSPICE) a bit more.
>The benchmark I previously showed compared running a circuit simulation
>with a variable number of communicating CPUs. Only a minimum amount of data
>is shared (a page with published parameters and achieved results, plus the
>ready! flags). With this setup I got about a factor of 3 improvement for
>8 CPUs. I hoped to improve this factor a bit with better hardware and maybe
>some software tweaking.
>
>What I didn't try until today was checking how fast the circuit simulation
>ran on a single CPU, *not* using the shared memory framework. And indeed,
>that is a problem, in that without shared memory the runtime is *3 times
>less* than with shared memory. In other words, there is no net gain in
>having 8 mem-shared cpu's. As a additional check I started the circuit run
>in 3 separate windows. They all achieved the same speed as the single run
>non-shared version, proving that the hardware (cpu/memory/disk) is amply
>sufficient to provide an 8 times speed-up.
>
>I will now start working on Anton's suggesting of a shared file. Or maybe
>I should try this on Linux first, maybe shared memory works better there.

I simply use the clone system call on linux ( NR number is 56 for 64 bits)

( THREAD-PET KILL-PET PAUSE-PET ) CF: ?LI \ B5dec2
"CTA" WANTED "-syscalls-" WANTED HEX
\ Exit a thread. Indeed this is exit().
: EXIT-PET 0 _ _ __NR_exit XOS ;
\ Do a preemptive pause. ( abuse MS )
: PAUSE-PET 1 MS ;
\ Create a thread with dictionary SPACE. Execute XT in thread.
: THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
ROT RSP! 2 CELLS - ( DSP) , ( TASK) , ( pid) 0 ,
DOES> DUP @ >R SWAP OVER CELL+ @ R@ 2! ( clone S: tp,xt)
100 R> _ __NR_clone XOS DUP IF
( Mother) DUP ?ERRUR SWAP 2 CELLS + ! ELSE
( Child) DROP RSP! CATCH DUP IF ERROR THEN EXIT-PET THEN ;
\ Kill a THREAD-PET , preemptively. Throw errors.
: KILL-PET >BODY 2 CELLS + @ 9 _ __NR_kill XOS ?ERRUR ;
DECIMAL

The idea is
1000 ( dictionary space ) CREATE extra

Now you run an xt as follows :
xt extra
The xt runs until it does an EXIT-PET, or is killed by a KILL-PET.

In r10par.frt it run 41 sec on one 27 on two processors for 10^12.
This was more a demonstration of parallel processing, the
communication and work load balancing kills the advantages for
more processors.

(This was prime counting)

Maybe try something simple before jumping into sockets and mapped
files.

The CTA words carves out a small dictionary space for the new
processes to be used, plus stacks and user space.
This is utterly system dependant, but in the ciforth model it
is just one screen, and portable over 32/64 arm/86 linux/windows.
It helps if you have a simple Forth to begin with ;-)
(CTA is used in cooperative multi tasking as well.)
>
>-marcel

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

Re: Shared memory

<c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26187&group=comp.lang.forth#26187

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: mhx@iae.nl (mhx)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Wed, 28 Feb 2024 11:11:12 +0000
Organization: novaBBS
Message-ID: <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <2023Jan21.161446@mips.complang.tuwien.ac.at> <7589123a-c72c-4499-bc8c-2026d9e6776dn@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com> <nnd$581d87cf$2fc35a6f@a97fc5da776b1602>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="216441"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24
X-Rslight-Site: $2y$10$x8PBVcDuZyfoMRYZEjdRZedwauiZnom3QE24hvnDn.4/sJf5vpY3a

by: mhx - Wed, 28 Feb 2024 11:11 UTC

> Maybe try something simple before jumping into sockets and mapped
> files.

I have tried that way for the past 20 years already, and indeed it
works fine. However, my simple example shown above needs 24 threads
/processes/cores (whatever) each having about 2 to 4 GB of memory.

-marcel

Re: Shared memory

<nnd$346dc707$39b68eb8@f0aeef389c7accd8>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26188&group=comp.lang.forth#26188

copy link Newsgroups: comp.lang.forth

Newsgroups: comp.lang.forth
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com> <nnd$581d87cf$2fc35a6f@a97fc5da776b1602> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>
From: albert@spenarnc.xs4all.nl
Subject: Re: Shared memory
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: albert@cherry.(none) (albert)
Message-ID: <nnd$346dc707$39b68eb8@f0aeef389c7accd8>
Organization: KPN B.V.
Date: Wed, 28 Feb 2024 13:25:26 +0100
Path: i2pn2.org!i2pn.org!paganini.bofh.team!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!feeder.usenetexpress.com!tr1.eu1.usenetexpress.com!2001:67c:174:101:1:67:202:6.MISMATCH!feed.abavia.com!abe006.abavia.com!abp003.abavia.com!news.kpn.nl!not-for-mail
Lines: 35
Injection-Date: Wed, 28 Feb 2024 13:25:26 +0100
Injection-Info: news.kpn.nl; mail-complaints-to="abuse@kpn.com"
X-Received-Bytes: 2013

by: albert@spenarnc.xs4all.nl - Wed, 28 Feb 2024 12:25 UTC

In article <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:
>> Maybe try something simple before jumping into sockets and mapped
>> files.
>
>I have tried that way for the past 20 years already, and indeed it
>works fine. However, my simple example shown above needs 24 threads
>/processes/cores (whatever) each having about 2 to 4 GB of memory.

I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)

And what with
lina -g 96000 lina96G

lina96G -e
...
WANT UNUSED
S[ ] OK UNUSED S>D DEC.

0,000,000,000,000,000,000,000,000,000,100,730,247,992
S[ ] OK
I'm sure most Forth's can do something similar.
(Overcommitting but not with my hp workstation with 256 Gbyte RAM).

>
>-marcel

Re: Shared memory

<c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26217&group=comp.lang.forth#26217

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: mhx@iae.nl (mhx)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Sat, 2 Mar 2024 18:14:18 +0000
Organization: novaBBS
Message-ID: <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com> <nnd$581d87cf$2fc35a6f@a97fc5da776b1602> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com> <nnd$346dc707$39b68eb8@f0aeef389c7accd8>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="601789"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24
X-Rslight-Site: $2y$10$W/N/EE67fjMnia3T1GIUUO5HMAFP40ap22k.oyq5MnRVtKv/JmSx6

by: mhx - Sat, 2 Mar 2024 18:14 UTC

> I have lost context, can you tell more about the simple example?
> (My provider purges old messages swiftly)

I was in the exploring/debugging phase and have only very recently
completed the experiments.

The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use. The way shared memory is implemented
on Windows is with a memory-mapped file that uses the OS
pagefile as backup. The file is guaranteed to not be swapped
out under reasonable conditions, and Windows keeps its
management invisible for users.

I tried to make the file as small as possible. For this
iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
is touched very infrequently, just 24 result writes and
then a loop over the 11 words to see if all cpu's finished
(check at 10ms intervals). At the moment I have no idea
what happens with very frequent read/writes (it is not
the intended type of use).

[During debugging I was lucky. When setting the number of
working cpu's interactively, completely wrong results
were obtained. This happened because #|cpus was defined
as a VALUE in a configuration file. When changing #|cpus
from the console, the value in sconfig.frt stayed the
same (of course) while all the dynamically started cores
used the on-disk value, not the value I typed in on
CPU #0. Easy to understand in hindsight, but this type
of 'black-hole' mistake can take hours to find in a 7000+
line program. For some reason I just knew that it had to
be #|cpus that was causing the problem.]

The benchmark is a circuit file that defines a voltage
source and a 2-resistor divider, all parameterized.
These values were swept for a total of 24 different
circuits. To calculate the result for one of the
combinations takes 2.277s on a single core with iSPICE,
or 24 x that value, 54.648s, for all 24 combinations.
In the benchmark the 24 simulations are spread out over
11 processes on an 8-core CPU :

iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):

% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok

For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.

In the above table the performance ration for a single
cpu is 1.46 (1.46 times faster than doing the 24
simulations on a single core *without* shared memory),
which might seem strange. I think the phenomenon is
caused by the fact that a single combination takes
only 2.277s and this may be too slow for the processor
(or Windows) to ramp up the clock frequency. If the
performance factor is normalized by the timing for
1 cpu, the maximum speedup decreases to 5.25.
We'll see what happens on an HPZ840.
-marcel

Re: Shared memory

<nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26228&group=comp.lang.forth#26228

copy link Newsgroups: comp.lang.forth

Newsgroups: comp.lang.forth
Subject: Re: Shared memory
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com> <nnd$346dc707$39b68eb8@f0aeef389c7accd8> <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>
From: albert@spenarnc.xs4all.nl
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: albert@cherry.(none) (albert)
Message-ID: <nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>
Organization: KPN B.V.
Date: Sun, 03 Mar 2024 12:19:13 +0100
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!tr2.iad1.usenetexpress.com!feeder.usenetexpress.com!tr1.eu1.usenetexpress.com!2001:67c:174:101:1:67:202:5.MISMATCH!feed.abavia.com!abe005.abavia.com!abp001.abavia.com!news.kpn.nl!not-for-mail
Lines: 111
Injection-Date: Sun, 03 Mar 2024 12:19:13 +0100
Injection-Info: news.kpn.nl; mail-complaints-to="abuse@kpn.com"

by: albert@spenarnc.xs4all.nl - Sun, 3 Mar 2024 11:19 UTC

In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:
>> I have lost context, can you tell more about the simple example?
>> (My provider purges old messages swiftly)
>
>I was in the exploring/debugging phase and have only very recently
>completed the experiments.

>
>The final results are that with shared memory, on Windows
>11, it is possible to get an almost linear speedup with the
>number of cores in use. The way shared memory is implemented
>on Windows is with a memory-mapped file that uses the OS
>pagefile as backup. The file is guaranteed to not be swapped
>out under reasonable conditions, and Windows keeps its
>management invisible for users.

Linear speedup? That must depend on the program.
Can I surmise that the context is that you're comparing your
version/clone iSpice with LTSpice.
>
>I tried to make the file as small as possible. For this
>iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
>extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
>is touched very infrequently, just 24 result writes and
>then a loop over the 11 words to see if all cpu's finished
>(check at 10ms intervals). At the moment I have no idea
>what happens with very frequent read/writes (it is not
>the intended type of use).

>
>[During debugging I was lucky. When setting the number of
> working cpu's interactively, completely wrong results
> were obtained. This happened because #|cpus was defined
> as a VALUE in a configuration file. When changing #|cpus
> from the console, the value in sconfig.frt stayed the
> same (of course) while all the dynamically started cores
> used the on-disk value, not the value I typed in on
> CPU #0. Easy to understand in hindsight, but this type
> of 'black-hole' mistake can take hours to find in a 7000+
> line program. For some reason I just knew that it had to
> be #|cpus that was causing the problem.]

>
>The benchmark is a circuit file that defines a voltage
>source and a 2-resistor divider, all parameterized.
>These values were swept for a total of 24 different
>circuits. To calculate the result for one of the
>combinations takes 2.277s on a single core with iSPICE,
>or 24 x that value, 54.648s, for all 24 combinations.
>In the benchmark the 24 simulations are spread out over
>11 processes on an 8-core CPU :
>
>iSPICE> .ticker-info
>AMD Ryzen 7 5800X 8-Core Processor
> TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
> Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
>
>The aim is to get an 8 times speedup, or more if
>hyperthreads bring something, and do all combinations
>in less than 6.831 seconds. The best I managed is
>7.694s or about 7.67 "cores", which I consider not
>that bad. Here are the details (run 4 times):
>
>% cpus time [s] perf. ratio
> 1 49.874 1.46
> 2 25.314 2.39
> 3 17.391 3.23
> 4 13.335 4.11
> 5 10.565 5.17
> 6 9.468 5.71
> 7 8.712 6.22
> 8 7.694 7.67
> 9 7.260 7.37
> 10 7.874 6.72
> 11 7.856 6.73 ok
>
>For your information: Running the same 24 variations
>with LTspice 17.1.15, one of the fastest SPICE
>implementations currently available, takes 382.265
>seconds, almost exactly 7 times slower than the iSPICE
>single-core run. Using 8 cores (LTspice pretends to
>use 16 threads), that ratio becomes 62 times.

So LT spice becomes slower by using 8 cores
going from 7 times slower to 62 time slower than iSPICE.
There must be a mistake here.
>
>In the above table the performance ration for a single
>cpu is 1.46 (1.46 times faster than doing the 24
>simulations on a single core *without* shared memory),
>which might seem strange. I think the phenomenon is
>caused by the fact that a single combination takes
>only 2.277s and this may be too slow for the processor
>(or Windows) to ramp up the clock frequency. If the
>performance factor is normalized by the timing for
>1 cpu, the maximum speedup decreases to 5.25.
>We'll see what happens on an HPZ840.

You are going to run Windows 11 on the HP work station?
I'm going to install a Linux version, for I want to
experiment with CUDA.

>
>-marcel
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

Re: Shared memory

<0678be0fc5470e4edb09427823d40717@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26284&group=comp.lang.forth#26284

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: mhx@iae.nl (mhx)
Newsgroups: comp.lang.forth
Subject: Re: Shared memory
Date: Wed, 6 Mar 2024 10:54:55 +0000
Organization: novaBBS
Message-ID: <0678be0fc5470e4edb09427823d40717@www.novabbs.com>
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com> <nnd$346dc707$39b68eb8@f0aeef389c7accd8> <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com> <nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1040994"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Xo3y6aqoVa430rK6t4grAOLt8mB3IHCE9KtF9hWkiU10zgknCvXbm

by: mhx - Wed, 6 Mar 2024 10:54 UTC

albert@spenarnc.xs4all.nl wrote:

> In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
> mhx <mhx@iae.nl> wrote:
>>> I have lost context, can you tell more about the simple example?
[..]
>>The final results are that with shared memory, on Windows
>>11, it is possible to get an almost linear speedup with the
>>number of cores in use.
[..]
> Linear speedup? That must depend on the program.
> Can I surmise that the context is that you're comparing your
> version/clone iSpice with LTSpice.

The example is *not* about trying to speed up programs
by adding threads to work on parts that can be parallelized.
A circuit simulator is used as the example here. Circuits
contain on average about 30% of operations that can be done
in parallel, so a fine-grained threaded approach with an
infinite amount of threads can at most give 30% of a speedup.

Most circuit simulation problems can not be solved with a
single simulation. In almost every case one wants to re-run
a job with small variations on the original specification.
The variations can be on the circuit components themselves,
on variations in environmental conditions like temperature,
humidity, noise, variations on input sources or output loads,
or even parameters of their (digital) control algorithms.
Between 10 and many thousands of simulations could be
necessary. At the top level, this problem is trivial to solve
by editing the input netlist with the necessary changes,
re-run the simulation, and store the results in a database.
When all runs are done, the data is evaluated by querying.

In practice, it is difficult to keep the administration
straight if the above is done by hand. What I am looking for
is a simple way to specify variations, create a list
of all the simulations needed, then distribute the tasks
to as many cpu cores as are available (locally, on the network,
or in the Cloud), combine the results, and generate reports.

To do this in Forth, I found it useful to use either shared
memory, or a shared file. The post is about experiments with
shared memory (useful when the number of cores is less than
256 and the main memory requirement is less than 1 TByte.)

The concrete example is to run N variations of a circuit on
an 8 core system with 32GB of memory, with the features I
describe above. The question was: is it possible to get
a speedup of 8 when the benchmark runs on an 8 core CPU.

>>iSPICE> .ticker-info
>>AMD Ryzen 7 5800X 8-Core Processor
>> TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
>> Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
>>
>>The aim is to get an 8 times speedup, or more if
>>hyperthreads bring something, and do all combinations
>>in less than 6.831 seconds. The best I managed is
>>7.694s or about 7.67 "cores", which I consider not
>>that bad. Here are the details (run 4 times):
>>
>>% cpus time [s] perf. ratio
>> 1 49.874 1.46
>> 2 25.314 2.39
>> 3 17.391 3.23
>> 4 13.335 4.11
>> 5 10.565 5.17
>> 6 9.468 5.71
>> 7 8.712 6.22
>> 8 7.694 7.67
>> 9 7.260 7.37
>> 10 7.874 6.72
>> 11 7.856 6.73 ok
>>

>>For your information: Running the same 24 variations
>>with LTspice 17.1.15, one of the fastest SPICE
>>implementations currently available, takes 382.265
>>seconds, almost exactly 7 times slower than the iSPICE
>>single-core run. Using 8 cores (LTspice pretends to
>>use 16 threads), that ratio becomes 62 times.

I realize now that this comparison of iSPICE with LTspice
can confuse the reader. It does not matter at all for this
benchmark which SPICE simulator is used.

> So LT spice becomes slower by using 8 cores
> going from 7 times slower to 62 time slower than iSPICE.
> There must be a mistake here.

There is no mistake. LTspice is 7 slower than iSPICE for
the specific type of task used here. Although LTspice has
a mechanism to run multiple variations, and claims to use
8 cores / 16 threads, it does not appear to use them as
efficiently as iSPICE does using shared memory.

[..]
>>We'll see what happens on an HPZ840.

> You are going to run Windows 11 on the HP work station?
> I'm going to install a Linux version, for I want to
> experiment with CUDA.

I certainly want to see what happens if I run iSPICE on
my 44-core HPZ840 :--) The fastest way to implement that
should be to install Windows 10 or 11 on the HP. However,
if that proves problematic I have no problem using Linux.
I did not try iSPICE on Linux/WSL2 yet and I probably will
do that first.

I also want to experiment with CUDA (BTW, why not OpenCL,
did you already find arguments against that route?),
however, that would be to investigate a new way of circuit
simulation that not uses the standard SPICE algorithms.

-marcel

FACILITY REJECTED 100044200000;

devel / comp.lang.forth / Re: Shared memory

Subject	Author
Re: Shared memory	mhx
Re: Shared memory	minforth
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx