Message-ID:

"What I've done, of course, is total garbage." -- R. Willard, Pure Math 430a

devel / comp.lang.forth / Re: New performance features in gforth-fast

gforth-fast has acquired two performance features this summer:

1) Many of the ip updates are now optimized away (all architectures).

2) On AMD64 gforth-fast can now use stack caching with up to 3
registers (previously 1).

For the word

: cubed dup dup * * ;

this results in the following differences in the resulting code:

Before: without ip updates and with 3 regs
$7F75EC8FB240 dup
add $0x8,%rbx
mov %r8,0x0(%r13) mov %r8,0x0(%r13) mov %r8,%r15
sub $0x8,%r13 sub $0x8,%r13
$7F75EC8FB248 dup
add $0x8,%rbx
mov %r8,0x0(%r13) mov %r8,0x0(%r13)
sub $0x8,%r13 sub $0x8,%r13 mov %r15,%r9
$7F75EC8FB250 *
add $0x8,%rbx
imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r9,%r15
add $0x8,%r13 add $0x8,%r13
$7F75EC8FB258 *
add $0x8,%rbx
imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r15,%r8
add $0x8,%r13 add $0x8,%r13
$7F75EC8FB260 ;s
mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
jmp *%rax jmp *%rax jmp *%rax

(Actually, the real Before variant used a different register
allocation, but the same number of instructions. The shown version is
the engine with optimization, but

Here's a comparison with some other Forth systems on AMD64:

gforth-fast iforth SwiftForth x64 VFX Forth 64
mov %r8,%r15 pop rbx -8 [RBP] RBP LEA MOV RDX, RBX
mov %r15,%r9 mov rdi, rbx RBX 0 [RBP] MOV IMUL RBX, RDX
imul %r9,%r15 imul rdi, rbx -8 [RBP] RBP LEA IMUL RBX, RDX
imul %r15,%r8 imul rbx, rdi RBX 0 [RBP] MOV RET/NEXT
mov (%r14),%rbx push rbx 0 [RBP] RAX MOV
add $0x8,%r14 ; RBX MUL
mov (%rbx),%rax RAX RBX MOV
jmp *%rax 8 [RBP] RBP LEA
0 [RBP] RAX MOV
RBX MUL
RAX RBX MOV
8 [RBP] RBP LEA
RET

1) Optimize ip updates:

At its heart, gforth (including gforth-fast) is still a threaded-code
system and falls back to threaded code when needed; in particular,
it's control flow works through the threaded-code mechanism; e.g., the
;S in the example above loads the threaded-code address of the next
(primitive) word in the caller, and performs a direct-threaded
dispatch. Also immediate arguments (e.g. for literals) are accessed
through the threaded-code instruction pointer. Therefore Gforth
maintains a threaded-code instruction pointer (ip).

But it does not need to maintain the ip everywhere. In the CUBED
example, no primitive uses the ip of the threaded code cell of the
primitive, so no ip updates are necessary except for restoring the
caller's ip at the end.

And this is what this optimization does, in a nutshell.

This optimization is controlled with --opt-ip-updates=n, where n=0
means no ip-update optimization, and higher n mean more optimization;
currently the highest level is n=4 IIRC, and the highest level is the
default.

2) 3 registers for stack caching:

Up until this summer I believed that we would not convince gcc to use
caller-saved registers as additional stack cache registers, and the
dearth of callee-saved registers on AMD64 meant that we were limited
to using 1 register as a stack cache (we have been using 3 registers
on ARM A64 and RISC-V for quite some time). This summer I got an idea
on how to do it, and, with the help of Bernd Paysan, did it; if you
want to read more about it, posting <23-10-001@comp.compilers> in
comp.compilers (in the web:
<https://compilers.iecc.com/comparch/article/23-10-001> or
<http://al.howardknight.net/?ID=169728532800>) discusses the topic in
more depth.

Here are results for small benchmarks on a Xeon W-1370P (5.2GHz Rocket
Lake):

sieve bubble matrix fib fft
0.089 0.131 0.048 0.084 0.031 gforth
0.058 0.066 0.033 0.043 0.014 gforth-fast --ss-states=2 --opt-ip-updates=0
0.052 0.053 0.018 0.036 0.014 gforth-fast --ss-states=2
0.038 0.042 0.014 0.032 0.014 gforth-fast

The new optimizations provide good speedups on Rocket Lake. Sometimes
the ip-update optimization alone helps a lot, sometimes the
combination of both optimizations helps a lot more (for now
--opt-ip-updates=0 does not work with --ss-states=4, so I cannot
completely isolate the effects of the optimizations).

gforth (the debugging engine) does not benefit from either
optimization, because stack caching is disabled for better stack
underflow reporting, and ip updates are disabled in order to get
proper backtraces in case of exceptions.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

Re: New performance features in gforth-fast

<2023Oct15.121851@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=24934&group=comp.lang.forth#24934

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: New performance features in gforth-fast
Date: Sun, 15 Oct 2023 10:18:51 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 41
Message-ID: <2023Oct15.121851@mips.complang.tuwien.ac.at>
References: <2023Oct14.130859@mips.complang.tuwien.ac.at>
Injection-Info: dont-email.me; posting-host="df5c662c313b363265708271958f5174";
logging-data="494970"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/O/fY1dniy419Tm1mR0vS4"
Cancel-Lock: sha1:IPmyZ6xHSrY58V/Eb5Y3V+HnSXE=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 15 Oct 2023 10:18 UTC

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>gforth-fast has acquired two performance features this summer:
>
>1) Many of the ip updates are now optimized away (all architectures).
>
>2) On AMD64 gforth-fast can now use stack caching with up to 3
> registers (previously 1).
....
>Here are results for small benchmarks on a Xeon W-1370P (5.2GHz Rocket
>Lake):

I fixed the bug that prevented "gforth-fast --opt-ip-updates=0" from
working, resulting in:

sieve bubble matrix fib fft
0.089 0.131 0.048 0.084 0.031 gforth
0.058 0.066 0.033 0.043 0.014 gforth-fast --ss-states=2 --opt-ip-updates=0
0.057 0.062 0.032 0.042 0.014 gforth-fast --opt-ip-updates=0
0.052 0.053 0.018 0.036 0.014 gforth-fast --ss-states=2
0.038 0.042 0.014 0.032 0.014 gforth-fast

So using 3 registers without ip-update optimization had a small
effect, the ip-update optimization alone had a larger effect
(especially on matrix and fib), but for sieve, the combination of both
had a much larger effect than one might have suspected looking at the
individual effects.

Actually, apart from fft, which does not benefit from these
optimizations at all, every benchmark performed better with both
optimizations on than what one would expect by either multiplying the
factors that we see for each individual optimization over having both
optimizations off, or by subtracting both time differences of the
results for the individual optimizations from the result without
these optimizations.

Subject	Author
New performance features in gforth-fast	Anton Ertl
Re: New performance features in gforth-fast	Anton Ertl