Message-ID:

I have never seen anything fill up a vacuum so fast and still suck. -- Rob Pike, on X.

devel / comp.lang.forth / Re: How stack caching and ip-update optimization influence the primitives

How stack caching and ip-update optimization influence the primitives

<2024Feb5.134000@mips.complang.tuwien.ac.at>

https://news.novabbs.org/devel/article-flat.php?id=26092&group=comp.lang.forth#26092

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: How stack caching and ip-update optimization influence the primitives
Date: Mon, 05 Feb 2024 12:40:00 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 108
Message-ID: <2024Feb5.134000@mips.complang.tuwien.ac.at>
Injection-Info: dont-email.me; posting-host="21e8e08af84274a3b4770214f7c13a8a";
logging-data="339336"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19txenTaCTGMtUJEfMcW4Tk"
Cancel-Lock: sha1:2tMebmevf6sEtyJjuQKgcCeh1mQ=
X-newsreader: xrn 10.11

by: Anton Ertl - Mon, 5 Feb 2024 12:40 UTC

In June 2023 gforth-fast on AMD64 imlemented only stack caching with
one register, and every primitive or superinstruction also had an
instruction pointer (ip) update. Since then, we managed to have up to
three stack items in a register and eliminated most ip-updates. As an
example, let's look at DECIMAL. It's source code is (clarified):

: decimal ( -- )
#10 base ! ;

Here's what SEE-CODE DECIMAL shows for two versions:

June 2023 November 2023
lit 1->1 lit 1->2
#10 #10
mov [r10],r8 mov r14,$08[r13]
add rbx,$10
mov r8,-$10[rbx]
sub r10,$08
useraddr ! 1->1 useraddr 2->3
#112 #112
! add r13,$20
mov rax,[rbx] mov r15,$08[rsp]
mov rdi,$08[rsp] add r15,-$08[r13]
add r10,$08 ! 3->1
add rbx,$18 mov [r15],r14
mov [rdi][rax],r8
mov r8,[r10]
;s 1->1 ;s 1->1
mov rbx,$00[r13] mov rax,[r10]
add r13,$08 add r10,$08
add rbx,$08 mov r13,rax
mov rcx,-$08[rbx] mov rax,$00[r13]
jmp ecx jmp eax

We still used the same primitives. In November Gforth's code
generator chose not to use the "useraddr !" static superinstruction,
because static superinstructions as currently implemented only work
for 1->1 stack state transitions. So this static superinstruction is
less valuable than it used to be. In the meantime, with these
enhanced and new optimizations, we eliminated 13 superinstructions,
leaving 36.

In recent days we discovered that these optimizations also reduce the
need for primitives with immediate arguments like USERADDR above. The
November USERADDR starts with an ip update ("add r13, ..."), because
the immediate argument requires the ip to be up-to-date. In the
current version, USERADDR with immediate argument 112 is replaced with
"UP@ 112 +", and here you see the two pieces of code:

November 2023 February 2024
lit 1->2 lit 1->2
#10 #10
mov r14,$08[r13] mov r15,$08[rbx]
useraddr 2->3 up@ 2->3
#112 mov r9,$18[rsp]
add r13,$20 lit+ 3->3
mov r15,$08[rsp] #112
add r15,-$08[r13] add r9,$20[rbx]
! 3->1 ! 3->1
mov [r15],r14 mov [r9],r15
;s 1->1 ;s 1->1
mov rax,[r10] mov rbx,[r14]
add r10,$08 add r14,$08
mov r13,rax mov rax,[rbx]
mov rax,$00[r13] jmp eax
jmp eax

If you delete the ip update from the start of USERADDR, you get the
same code as for "UP@ LIT+" (if you ignore the different register
allocation and the different offset of UP from rsp in memory.

One other difference is that useraddr accessed its immediate argument
with offset -8 (because ip points right after the immediate at this
point), while LIT+ accesses its immediate argument with offset $20.
This is because for a few frequent primitives (like LIT+) we have
several versions for different offsets from the actual ip relative to
the instruction, and we therefore just choose the right version of
LIT+ rather than updating the ip in many cases, including the one in
DECIMAL.

With this extra optimization, UP@ LIT+ comes out ahead of USERADDR in
this case. But even without it, they would both produce three
instructions.

Given that UP@ 112 + is proper Forth code that can be interpreted,
while USERADDR with immediate argument 112 is a special-purpose
primitive that is only useful in compiled code, UP@ 112 + looks
preferable to me.

So in the last few days we replaced a number of primitives having to
do with the user pointer, the locals pointer, or the current object by
sequences that usually involve UP@, LP@, or O.

So having these optimizations also results in a cleaner set of
primitives.

You may also wonder why the code for ;S in the November version
contains the register-register move that could be optimized away (and
that is not present in the February 2024 version
shown). Unfortunately, I have no idea. The gcc versions used were
12.2 for the November version and 10.2 for the February version.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

Re: How stack caching and ip-update optimization influence the primitives

<992bbef7075dc3072a5cc3c575992648@www.novabbs.com>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26093&group=comp.lang.forth#26093

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!.POSTED!not-for-mail
From: minforth@gmx.net (minforth)
Newsgroups: comp.lang.forth
Subject: Re: How stack caching and ip-update optimization influence the primitives
Date: Mon, 5 Feb 2024 15:50:30 +0000
Organization: novaBBS
Message-ID: <992bbef7075dc3072a5cc3c575992648@www.novabbs.com>
References: <2024Feb5.134000@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1788243"; mail-complaints-to="usenet@i2pn2.org";
posting-account="t+lO0yBNO1zGxasPvGSZV1BRu71QKx+JE37DnW+83jQ";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$.lNk5V1k8hA28R3FxlP9x.Bgmlgb0oupdJ7aT6aTHO2h.9Uitalou
X-Rslight-Posting-User: d2a19558f194e2f1f8393b8d9be9ef51734a4da3

by: minforth - Mon, 5 Feb 2024 15:50 UTC

With all due respect, is this really a suitable benchmark for the optimizer ??

My Forth does:

# see decimal
C DECIMAL <137>
0000000000401ee3 <mf_DECIMAL>:
401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
401eee: c3 retq

or
0000000000401eef <mf_HEX>:
401eef: 48 c7 05 76 64 02 00 10 00 00 00 movq $0x10,0x26476(%rip) # 428370 <MFBASE>
401efa: c3 retq

I had expected gforth to compile similarly compact.

Re: How stack caching and ip-update optimization influence the primitives

<2024Feb5.173757@mips.complang.tuwien.ac.at>

copy mid

https://news.novabbs.org/devel/article-flat.php?id=26094&group=comp.lang.forth#26094

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: How stack caching and ip-update optimization influence the primitives
Date: Mon, 05 Feb 2024 16:37:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 58
Message-ID: <2024Feb5.173757@mips.complang.tuwien.ac.at>
References: <2024Feb5.134000@mips.complang.tuwien.ac.at> <992bbef7075dc3072a5cc3c575992648@www.novabbs.com>
Injection-Info: dont-email.me; posting-host="21e8e08af84274a3b4770214f7c13a8a";
logging-data="408892"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19xM/icL2lFVUw6zBLS/oP8"
Cancel-Lock: sha1:XKPNaPYi6f2b2YV0EIHITFYbgJ8=
X-newsreader: xrn 10.11

by: Anton Ertl - Mon, 5 Feb 2024 16:37 UTC

minforth@gmx.net (minforth) writes:
>With all due respect, is this really a suitable benchmark for the optimizer ??

Benchmark? My article is about a specific point, and DECIMAL is a
good example for demonstrating this point.

For benchmarking I run various programs, e.g., those from
<http://www.complang.tuwien.ac.at/forth/appbench.zip>.

>My Forth does:
>
># see decimal
>C DECIMAL <137>
>0000000000401ee3 <mf_DECIMAL>:
> 401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
> 401eee: c3 retq

Congratulations.

>I had expected gforth to compile similarly compact.

Why? Gforth is still based on threaded code, and it shows in,
e.g., needing threaded-code instruction pointer to access any immediate
arguments.

Looking at DECIMAL in proper native-code compilers is interesting:

iforth:
$10138348 : DECIMAL 488BC04883ED088F4500 H.@H.m..E.
$10138352 mov rbx, #10 d# 48C7C30A000000 HGC....
$10138359 push rbx 53 S
$1013835A lea rbx, [r15 #520 +] qword
498D9F08020000 I......
$10138361 pop [rbx] qword 8F03 ..
$10138363 ; 488B45004883C508FFE0 H.E.H.E..` ok

lxf:
80498D2 8D4620 lea eax , [esi+20h]
80498D5 B90A000000 mov ecx , # Ah
80498DA 8908 mov [eax] , ecx
80498DC C3 ret near

SwiftForth x64:
40A392 A # 88 [RSI] MOV 48C786880000000A000000
40A39D RET C3 ok

VFX64:
( 0041CEA0 48C746300A000000 ) MOV QWord [RSI+30], # 0000000A
( 0041CEA8 C3 ) RET/NEXT
( 9 bytes, 2 instructions )

Subject	Author
How stack caching and ip-update optimization influence the primitives	Anton Ertl
Re: How stack caching and ip-update optimization influence the primitives	minforth
Re: How stack caching and ip-update optimization influence the primitives	Anton Ertl