Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

Row, row, row your bits, gently down the stream...


devel / comp.arch / Faster Sort Algorithms

SubjectAuthor
* Faster Sort AlgorithmsMitchAlsup
+- Re: Faster Sort AlgorithmsBGB
+* Re: Faster Sort AlgorithmsTerje Mathisen
|+- Re: Faster Sort AlgorithmsThomas Koenig
|`* Re: Faster Sort AlgorithmsStephen Fuld
| `* Re: Faster Sort AlgorithmsTerje Mathisen
|  `* Re: Faster Sort AlgorithmsTerje Mathisen
|   `- Re: Faster Sort AlgorithmsMichael S
+* Re: Faster Sort AlgorithmsEricP
|`* Re: Faster Sort AlgorithmsEricP
| `* Re: Faster Sort AlgorithmsMitchAlsup
|  +- Re: Faster Sort AlgorithmsTerje Mathisen
|  `* Re: Faster Sort AlgorithmsEricP
|   +* Re: Faster Sort AlgorithmsAnton Ertl
|   |+* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||+* Re: Faster Sort AlgorithmsBGB
|   |||`* Re: Faster Sort AlgorithmsMitchAlsup
|   ||| `- Re: Faster Sort AlgorithmsBGB
|   ||+* Re: Faster Sort AlgorithmsMitchAlsup
|   |||+- Re: Faster Sort AlgorithmsEricP
|   |||`* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||| `* Re: Faster Sort AlgorithmsBGB
|   |||  `* Re: Faster Sort Algorithmsrobf...@gmail.com
|   |||   +- Re: Faster Sort Algorithmsrobf...@gmail.com
|   |||   `- Re: Faster Sort AlgorithmsBGB
|   ||`* Re: Faster Sort AlgorithmsEricP
|   || +- Re: Faster Sort AlgorithmsMitchAlsup
|   || `* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||  `* Re: Faster Sort AlgorithmsThomas Koenig
|   ||   `- Re: Faster Sort Algorithmsrobf...@gmail.com
|   |`* Re: Faster Sort AlgorithmsEricP
|   | +- Re: Faster Sort AlgorithmsScott Lurndal
|   | +- Re: Faster Sort AlgorithmsMitchAlsup
|   | +* Re: Faster Sort AlgorithmsThomas Koenig
|   | |+* Re: Faster Sort AlgorithmsEricP
|   | ||+* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||`* Re: Faster Sort AlgorithmsThomas Koenig
|   | ||| `* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||  +- Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||  `* Re: Faster Sort AlgorithmsJohn Levine
|   | |||   `* Re: Faster Sort AlgorithmsScott Lurndal
|   | |||    `* Re: Faster Sort AlgorithmsBGB
|   | |||     `* Re: Faster Sort AlgorithmsJohn Levine
|   | |||      +* Re: Faster Sort AlgorithmsBGB
|   | |||      |`* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||      | +- Re: Faster Sort AlgorithmsBGB
|   | |||      | +* Re: Faster Sort AlgorithmsJohn Levine
|   | |||      | |`- Re: Faster Sort AlgorithmsBGB
|   | |||      | `* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||      |  +* Re: Faster Sort AlgorithmsBGB
|   | |||      |  |`* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||      |  | `- Re: Faster Sort AlgorithmsBGB
|   | |||      |  `* Re: off topic - burying electrical linesGeorge Neuner
|   | |||      |   +- Re: off topic - burying electrical linesBGB
|   | |||      |   `- Re: off topic - burying electrical linesScott Lurndal
|   | |||      `- Re: Faster Sort AlgorithmsTerje Mathisen
|   | ||`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | |+* Re: Faster Sort AlgorithmsMitchAlsup
|   | ||+* Re: Faster Sort AlgorithmsThomas Koenig
|   | |||+* Re: Faster Sort AlgorithmsMichael S
|   | ||||`* Re: Faster Sort Algorithmsrobf...@gmail.com
|   | |||| `- Re: Faster Sort AlgorithmsMichael S
|   | |||`* Re: Faster Sort AlgorithmsMitchAlsup
|   | ||| `* Re: Faster Sort AlgorithmsThomas Koenig
|   | |||  `- Re: Faster Sort AlgorithmsMitchAlsup
|   | ||+* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | ||`- Re: Faster Sort AlgorithmsScott Lurndal
|   | |`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | `- Re: Faster Sort AlgorithmsAnton Ertl
|   `- Re: Faster Sort AlgorithmsTerje Mathisen
+- Re: Faster Sort AlgorithmsMarcus
`* Re: Faster Sort AlgorithmsMichael S
 `* Re: Faster Sort AlgorithmsMitchAlsup
  +- Re: Faster Sort AlgorithmsBill Findlay
  `* Re: Faster Sort AlgorithmsScott Lurndal
   `- Re: Faster Sort AlgorithmsMitchAlsup

Pages:1234
Faster Sort Algorithms

<4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32694&group=comp.arch#32694

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e06:0:b0:5ef:5726:ba25 with SMTP id dl6-20020ad44e06000000b005ef5726ba25mr28984qvb.0.1686174020553;
Wed, 07 Jun 2023 14:40:20 -0700 (PDT)
X-Received: by 2002:a05:6871:6a95:b0:19a:2c41:f48f with SMTP id
zf21-20020a0568716a9500b0019a2c41f48fmr2441710oab.6.1686174020277; Wed, 07
Jun 2023 14:40:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 7 Jun 2023 14:40:19 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:cfc:1f77:9db7:3326;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:cfc:1f77:9db7:3326
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
Subject: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Wed, 07 Jun 2023 21:40:20 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1960
 by: MitchAlsup - Wed, 7 Jun 2023 21:40 UTC

An AI bot has figured out some faster sorting algorithms.

https://www.nature.com/articles/s41586-023-06004-9

Looking at the C++* code output from the paper::

void variable_sort_2( int length, int *a )
{ switch( length )
{
case 0:
case 1:
return;
case 2:
int temp = a[0];
a[0] = (a[1] < a[0] ) ? a[1] : a[0];
a[1] = (a[1] < temp ) ? temp : a[1];
}
}

There is x86 asm code in the article, consisting of 11 instructions.

My 66000 ASM

variable_sort_2:
CMP R3,R1,#2
PLT R3,TTTTTT
LD R4,[R2,0]
LD R5,[R2,4]
MAX R6,R4,R5
MIN R7,R4,R5
ST R7,[R2,0]
ST R5,[R2,4]
RET

For a count of 9 instructions.

(*) it looks like C code to me instead of C++ code........

Re: Faster Sort Algorithms

<u5rb30$19fuu$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32697&group=comp.arch#32697

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Wed, 7 Jun 2023 20:32:14 -0500
Organization: A noiseless patient Spider
Lines: 100
Message-ID: <u5rb30$19fuu$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jun 2023 01:32:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bcd573884740ecaa37630831ce17e483";
logging-data="1359838"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+YuL/q+cw4x6OhE5ulW/ah"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:qln6rMY0qh9LmA3J5hUtLXTec5c=
Content-Language: en-US
In-Reply-To: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
 by: BGB - Thu, 8 Jun 2023 01:32 UTC

On 6/7/2023 4:40 PM, MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }
>
> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.
>

Hmm, if "hand compiling" (very loose translation):
CMPGT 1, R4 //1c
BF .L0 //2c (*)
MOV.L (R5, 0), R6 //2c
MOV.L (R5, 4), R7 //3c
CMPGT R7, R6 //1c
MOV.L?T R7, (R5, 0) //1c
MOV.L?T R6, (R5, 4) //1c
.L0: RTS //2c

Length: 8 ops.
Latency: around 13 clock cycles (best case).

*: Depending on branch predictor (2c | 8c).
"RTS?F" could have been used, but this would have ended up with a higher
total latency.

> (*) it looks like C code to me instead of C++ code........

Yeah.

If I were to write similar in C:
void variable_sort_2(int length, int *arr)
{ int t0, t1;
if(length>1)
{
t0=arr[0]; t1=arr[1];
if(t0>t1)
{ arr[0]=t1; arr[1]=t0; }
}
}

Could generalize it further (to larger N), but decided to leave it out
(starts getting pretty hairy pretty quick).

In other news, generalized the:
long long x, y;
y=x*10;
Transform to handle all patterns up to:
y=(x<<S0)+(x<<S1)+(x<<S2)+(x<<S3);
Which can cover "most" of the values between 3 and 64 (but drops off
pretty quickly for larger values).

Despite its limitations, it seems to have an unexpectedly high hit rate
for long-long multiply.
Granted, the most common numbers in this case seem to be:
35, 10, 3, 5, 2097, ...

Excluding powers of 2 which are turned into shifts.

I guess leaving an open question of how many more "obvious cases" I may
have missed...

....

Re: Faster Sort Algorithms

<u5rvqt$1evb3$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32698&group=comp.arch#32698

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Thu, 8 Jun 2023 09:26:21 +0200
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <u5rvqt$1evb3$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jun 2023 07:26:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="706fe6401c4d4f5bd539c37585ea3ea5";
logging-data="1539427"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19JhiidU0wwD2beWb5F3ZlV1CHHKR4kPrXgI5Ooq52Lvg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:Tcud2S6aFrtOnnch7bUD409gX8c=
In-Reply-To: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
 by: Terje Mathisen - Thu, 8 Jun 2023 07:26 UTC

MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }

This is branchless sorting, something I started to investigate decades
ago: Every time I've done so I've found that since I force write
operations to store back the sorted pair, even when it was already in
order, the cost of those stores outweigh the advantage I got from being
branchless.

Things might be different now, but I did start with the assumption that
stores back to an already $L1 resident cache line would be very close to
free.

I have seen similar effects from branchless compressed token extraction,
where I keep reloading the source words even when the source
byte/word/dword index is the same, and here I also have big problems
beating naive code based on branch prediction working well.

> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.

That's nice, but see my caveat about the actual cost of "free" stores.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u5sg0c$2nl45$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32699&group=comp.arch#32699

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Thu, 8 Jun 2023 12:02:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u5sg0c$2nl45$1@newsreader4.netcologne.de>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<u5rvqt$1evb3$1@dont-email.me>
Injection-Date: Thu, 8 Jun 2023 12:02:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2d6e:0:7285:c2ff:fe6c:992d";
logging-data="2872453"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Thu, 8 Jun 2023 12:02 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> MitchAlsup wrote:
>> An AI bot has figured out some faster sorting algorithms.
>>
>> https://www.nature.com/articles/s41586-023-06004-9
>>
>> Looking at the C++* code output from the paper::
>>
>> void variable_sort_2( int length, int *a )
>> {
>> switch( length )
>> {
>> case 0:
>> case 1:
>> return;
>> case 2:
>> int temp = a[0];
>> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
>> a[1] = (a[1] < temp ) ? temp : a[1];
>> }
>> }
>
> This is branchless sorting, something I started to investigate decades
> ago: Every time I've done so I've found that since I force write
> operations to store back the sorted pair, even when it was already in
> order, the cost of those stores outweigh the advantage I got from being
> branchless.
>
> Things might be different now, but I did start with the assumption that
> stores back to an already $L1 resident cache line would be very close to
> free.
>
> I have seen similar effects from branchless compressed token extraction,
> where I keep reloading the source words even when the source
> byte/word/dword index is the same, and here I also have big problems
> beating naive code based on branch prediction working well.
>
>
>> There is x86 asm code in the article, consisting of 11 instructions.
>>
>> My 66000 ASM
>>
>> variable_sort_2:
>> CMP R3,R1,#2
>> PLT R3,TTTTTT
>> LD R4,[R2,0]
>> LD R5,[R2,4]
>> MAX R6,R4,R5
>> MIN R7,R4,R5
>> ST R7,[R2,0]
>> ST R5,[R2,4]
>> RET
>>
>> For a count of 9 instructions.
>
> That's nice, but see my caveat about the actual cost of "free" stores.

Hmm... so better code could then be

cmp r1,r1,#2
pne r1,0,FFFFFF
ldsw r1,[r2]
ldsw r3,[r2,4]
cmp r4,r1,r3
ple r4,0,FF
stw r3,[r2]
stw r1,[r2,4]
ret

? Also branchless, and does not do the store if the condition is
not fulfilled.

Re: Faster Sort Algorithms

<FVlgM.4315$fZx2.2545@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32700&group=comp.arch#32700

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
In-Reply-To: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 40
Message-ID: <FVlgM.4315$fZx2.2545@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 08 Jun 2023 14:51:17 UTC
Date: Thu, 08 Jun 2023 10:50:30 -0400
X-Received-Bytes: 1444
 by: EricP - Thu, 8 Jun 2023 14:50 UTC

MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9

Ummm... in fig 3.b they show the original sort3 result as

A = mem[0]
B = mem[1]
C = mem[2]

mem[0] = min(A,B,C)
mem[1] = max(min(A,C),B)
mem[2] = max(A,C)

and in fig 3.c they show the AI optimized sort3 result as

mem[0] = min(A,B)
mem[1] = max(min(A,C),B)
mem[2] = max(A,C)

If initially A=2, B=3, C=1 then the original produces

mem[0] = min(1,2,3) = 1
mem[1] = max(min(2,1),3) = 3
mem[2] = max(2,1) = 2

and the AI optimized produces

mem[0] = min(2,3) = 2
mem[1] = max(min(2,1),3) = 3
mem[2] = max(2,1) = 2

neither of them is the correct ascending sort order.

Re: Faster Sort Algorithms

<u5t7rh$1jc9r$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32706&group=comp.arch#32706

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: m.delete@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Thu, 8 Jun 2023 20:49:20 +0200
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <u5t7rh$1jc9r$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jun 2023 18:49:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="43b50320215ff2b15063a3f2a89f36c5";
logging-data="1683771"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19AgAqTKEbEJyNIGsDDQtz5mrblsw1BNtA="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:16EaDC/F9JTpzaFiiorS1m03B6U=
In-Reply-To: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Thu, 8 Jun 2023 18:49 UTC

On 2023-06-07 kl. 23:40, MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }
>
> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.
>
> (*) it looks like C code to me instead of C++ code........

For MRISC32 I get one branch:

variable_sort_2:
sne r1, r1, #2
bs r1, 1f
ldw r3, [r2]
ldw r1, [r2, #4]
min r4, r1, r3
max r1, r1, r3
stw r4, [r2]
stw r1, [r2, #4]
1: ret

/Marcus

Re: Faster Sort Algorithms

<u5ta2b$1jqs0$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32707&group=comp.arch#32707

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sfuld@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Thu, 8 Jun 2023 12:27:05 -0700
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <u5ta2b$1jqs0$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<u5rvqt$1evb3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 8 Jun 2023 19:27:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="eaa7c9c8981c3d41b6f061ec35bf5abc";
logging-data="1698688"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18VunUX3Ftd4du4RdsfWRImLICZrn+9mQI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:l4/4HTKf88J/f/G5vkIdTEZynaQ=
Content-Language: en-US
In-Reply-To: <u5rvqt$1evb3$1@dont-email.me>
 by: Stephen Fuld - Thu, 8 Jun 2023 19:27 UTC

On 6/8/2023 12:26 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> An AI bot has figured out some faster sorting algorithms.
>>
>> https://www.nature.com/articles/s41586-023-06004-9
>>
>> Looking at the C++* code output from the paper::
>>
>> void variable_sort_2( int length, int *a )
>> {
>>       switch( length )
>>       {
>>        case 0:
>>        case 1:
>>                      return;
>>        case 2:
>>                      int temp = a[0];
>>                      a[0] = (a[1] < a[0] ) ? a[1] : a[0];
>>                      a[1] = (a[1] < temp ) ? temp : a[1];
>>       }
>> }
>
> This is branchless sorting, something I started to investigate decades
> ago: Every time I've done so I've found that since I force write
> operations to store back the sorted pair, even when it was already in
> order, the cost of those stores outweigh the advantage I got from being
> branchless.
>
> Things might be different now, but I did start with the assumption that
> stores back to an already $L1 resident cache line would be very close to
> free.

Note that the paper did give actual, experimental results. They showed
a not necessarily huge, but statistically significant improvement.

https://www.nature.com/articles/s41586-023-06004-9#Sec6

and the data given in table 1 linked to in that section.

There is some more discussion of the results in the news release related
to the article at

https://www.nature.com/articles/d41586-023-01883-4

together with a comment in the last paragraph that they would like to
apply the same methods to hardware design!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Faster Sort Algorithms

<u5ufij$1r60h$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32710&group=comp.arch#32710

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Fri, 9 Jun 2023 08:07:14 +0200
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <u5ufij$1r60h$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<u5rvqt$1evb3$1@dont-email.me> <u5ta2b$1jqs0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 9 Jun 2023 06:07:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="84b286e865bd8de3e3641a297f510669";
logging-data="1939473"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uWsgAyHwbLjZCfsvkFJV9R28UHD5yp/tY/b/RJwb1Uw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:lbjn40j0doRPMLGBVUzBDVm+pNE=
In-Reply-To: <u5ta2b$1jqs0$1@dont-email.me>
 by: Terje Mathisen - Fri, 9 Jun 2023 06:07 UTC

Stephen Fuld wrote:
> On 6/8/2023 12:26 AM, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> An AI bot has figured out some faster sorting algorithms.
>>>
>>> https://www.nature.com/articles/s41586-023-06004-9
>>>
>>> Looking at the C++* code output from the paper::
>>>
>>> void variable_sort_2( int length, int *a )
>>> {
>>>       switch( length )
>>>       {
>>>        case 0:
>>>        case 1:
>>>                      return;
>>>        case 2:
>>>                      int temp = a[0];
>>>                      a[0] = (a[1] < a[0] ) ? a[1]
>>> : a[0];
>>>                      a[1] = (a[1] < temp ) ? temp
>>> : a[1];
>>>       }
>>> }
>>
>> This is branchless sorting, something I started to investigate decades
>> ago: Every time I've done so I've found that since I force write
>> operations to store back the sorted pair, even when it was already in
>> order, the cost of those stores outweigh the advantage I got from
>> being branchless.
>>
>> Things might be different now, but I did start with the assumption
>> that stores back to an already $L1 resident cache line would be very
>> close to free.
>
> Note that the paper did give actual, experimental results.  They showed
> a not necessarily huge, but statistically significant improvement.

That's great, I'm reading the article now!
>
> https://www.nature.com/articles/s41586-023-06004-9#Sec6
>
> and the data given in table 1 linked to in that section.
>
> There is some more discussion of the results in the news release related
> to the article at
>
> https://www.nature.com/articles/d41586-023-01883-4

I'll have to check that more closely, but my immediate reaction for the
first part (better latency for VarSortN than human code) is that (a)
their assumption that human code doesn't take advantage of previous
compares is at least flawed, but more importantly, branchless x86 code
that uses CMOVs to get rid of the branches are almost always beaten by
less sophisticated code which does use branching to skip alread-sorted
parts.

I.e. I have had to work hard to generate random or even pessimal inputs
to get consistent speedups.

More importantly, their comparisons are always against human branchless
code, not also againsts hybrid/branchy code where the actual input data
strongly impacts the results. If they can use this approach to beat the
big iron sorting benchmarks, then they have something really worthwhile!

They emphasize how VarSort4 works by sorting the first 3 elements, then
adding in the last element in the length=4 case, but this is afaik just
a final insertion sort. Not at all surprisingly, this saves a lot of
instructions vs having to write and branch to three totally separate
Sort2, Sort3, Sort4 sections.

I'm in fact more optimistic about sorting lengths that fit inside a SIMD
register, using MIN/MAX operations or parallel compares and shuffles.

>
> together with a comment in the last paragraph that they would like to
> apply the same methods to hardware design!

This is probably more relevant: When you (i.e Mitch) can apply hw
resources in parallel, finding shorter paths to the final results looks
to me a lot like network optimizations that VLSI tools have employed for
decades.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u5ugf7$1r9kl$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32711&group=comp.arch#32711

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Fri, 9 Jun 2023 08:22:31 +0200
Organization: A noiseless patient Spider
Lines: 102
Message-ID: <u5ugf7$1r9kl$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<u5rvqt$1evb3$1@dont-email.me> <u5ta2b$1jqs0$1@dont-email.me>
<u5ufij$1r60h$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 9 Jun 2023 06:22:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="84b286e865bd8de3e3641a297f510669";
logging-data="1943189"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196p5eAxtz/qjlfwa+8U2HKd9EBnpdJpqLKsiW+Fv1spg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:2nvsuY5Y0fmE7GtfRxBjfTBjN0s=
In-Reply-To: <u5ufij$1r60h$1@dont-email.me>
 by: Terje Mathisen - Fri, 9 Jun 2023 06:22 UTC

I finally got to the part where they got a measured 1.7% speedup for
large sorts done with LLVM libc++ sort, with their patch accepted.

Well done, even if less amazing than the huge initial claims.

Re the final part about VarInt codeing/decoding, this was recently done
with some quite clever SIMD code, so that large parts could be handled
without branches or additiona loads.

Terje

Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 6/8/2023 12:26 AM, Terje Mathisen wrote:
>>> MitchAlsup wrote:
>>>> An AI bot has figured out some faster sorting algorithms.
>>>>
>>>> https://www.nature.com/articles/s41586-023-06004-9
>>>>
>>>> Looking at the C++* code output from the paper::
>>>>
>>>> void variable_sort_2( int length, int *a )
>>>> {
>>>>       switch( length )
>>>>       {
>>>>        case 0:
>>>>        case 1:
>>>>                      return;
>>>>        case 2:
>>>>                      int temp = a[0];
>>>>                      a[0] = (a[1] < a[0] ) ?
>>>> a[1] : a[0];
>>>>                      a[1] = (a[1] < temp ) ?
>>>> temp : a[1];
>>>>       }
>>>> }
>>>
>>> This is branchless sorting, something I started to investigate
>>> decades ago: Every time I've done so I've found that since I force
>>> write operations to store back the sorted pair, even when it was
>>> already in order, the cost of those stores outweigh the advantage I
>>> got from being branchless.
>>>
>>> Things might be different now, but I did start with the assumption
>>> that stores back to an already $L1 resident cache line would be very
>>> close to free.
>>
>> Note that the paper did give actual, experimental results.  They
>> showed a not necessarily huge, but statistically significant improvement.
>
> That's great, I'm reading the article now!
>>
>> https://www.nature.com/articles/s41586-023-06004-9#Sec6
>>
>> and the data given in table 1 linked to in that section.
>>
>> There is some more discussion of the results in the news release
>> related to the article at
>>
>> https://www.nature.com/articles/d41586-023-01883-4
>
> I'll have to check that more closely, but my immediate reaction for the
> first part (better latency for VarSortN than human code) is that (a)
> their assumption that human code doesn't take advantage of previous
> compares is at least flawed, but more importantly, branchless x86 code
> that uses CMOVs to get rid of the branches are almost always beaten by
> less sophisticated code which does use branching to skip alread-sorted
> parts.
>
> I.e. I have had to work hard to generate random or even pessimal inputs
> to get consistent speedups.
>
> More importantly, their comparisons are always against human branchless
> code, not also againsts hybrid/branchy code where the actual input data
> strongly impacts the results. If they can use this approach to beat the
> big iron sorting benchmarks, then they have something really worthwhile!
>
> They emphasize how VarSort4 works by sorting the first 3 elements, then
> adding in the last element in the length=4 case, but this is afaik just
> a final insertion sort. Not at all surprisingly, this saves a lot of
> instructions vs having to write and branch to three totally separate
> Sort2, Sort3, Sort4 sections.
>
> I'm in fact more optimistic about sorting lengths that fit inside a SIMD
> register, using MIN/MAX operations or parallel compares and shuffles.
>
>>
>> together with a comment in the last paragraph that they would like to
>> apply the same methods to hardware design!
>
> This is probably more relevant: When you (i.e Mitch) can apply hw
> resources in parallel, finding shorter paths to the final results looks
> to me a lot like network optimizations that VLSI tools have employed for
> decades.
>
> Terje
>

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32712&group=comp.arch#32712

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5846:0:b0:3f4:e3d3:f8ff with SMTP id h6-20020ac85846000000b003f4e3d3f8ffmr509206qth.4.1686305730322;
Fri, 09 Jun 2023 03:15:30 -0700 (PDT)
X-Received: by 2002:a05:6808:192a:b0:39a:a99a:218e with SMTP id
bf42-20020a056808192a00b0039aa99a218emr2325908oib.5.1686305730006; Fri, 09
Jun 2023 03:15:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jun 2023 03:15:29 -0700 (PDT)
In-Reply-To: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.136; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.136
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Fri, 09 Jun 2023 10:15:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2443
 by: Michael S - Fri, 9 Jun 2023 10:15 UTC

On Thursday, June 8, 2023 at 12:40:22 AM UTC+3, MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }
>
> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.
>

Your asm can be literally translated into 9 x86 (AVX) instructions.
https://godbolt.org/z/Gbz1Kh56f
However I suspect that on majority of real-word x86 cores 11 original
instructions are faster.

> (*) it looks like C code to me instead of C++ code........

To me it looks legal in both languages.

Re: Faster Sort Algorithms

<d94a6eab-66bf-4c05-a15f-381788a1f416n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32713&group=comp.arch#32713

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5788:0:b0:3e6:2fab:675 with SMTP id v8-20020ac85788000000b003e62fab0675mr412175qta.9.1686306610700;
Fri, 09 Jun 2023 03:30:10 -0700 (PDT)
X-Received: by 2002:a05:6830:57:b0:6af:975f:4af with SMTP id
d23-20020a056830005700b006af975f04afmr468621otp.1.1686306610368; Fri, 09 Jun
2023 03:30:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jun 2023 03:30:10 -0700 (PDT)
In-Reply-To: <u5ugf7$1r9kl$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.136; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.136
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<u5rvqt$1evb3$1@dont-email.me> <u5ta2b$1jqs0$1@dont-email.me>
<u5ufij$1r60h$1@dont-email.me> <u5ugf7$1r9kl$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d94a6eab-66bf-4c05-a15f-381788a1f416n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: already5chosen@yahoo.com (Michael S)
Injection-Date: Fri, 09 Jun 2023 10:30:10 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2455
 by: Michael S - Fri, 9 Jun 2023 10:30 UTC

On Friday, June 9, 2023 at 9:22:34 AM UTC+3, Terje Mathisen wrote:
> I finally got to the part where they got a measured 1.7% speedup for
> large sorts done with LLVM libc++ sort, with their patch accepted.
>

I would think that for large sorts you can find 1-2 or may be even 3%
of speed by playing with "small segment" threshold that decides where
to switch from quicksort to insertion sort. But this gain will be very target
and size dependent.

> Well done, even if less amazing than the huge initial claims.

Still, to me acceptance of the patch looks like a mistake on the part
of libc++ maintainers. The gain by itself is not worth the hassle of change..
May be, there were other factors, like the new code is less messy
than the one replaced?

>
> Re the final part about VarInt codeing/decoding, this was recently done
> with some quite clever SIMD code, so that large parts could be handled
> without branches or additiona loads.
>

IMHO, for practical purposes, sorting short segment is too small part of
bigger sorts to bother with heavy optimizations.
Which does not mean that it couldn't be done for enjoyment.

Re: Faster Sort Algorithms

<fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32719&group=comp.arch#32719

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e2c:0:b0:626:80e:c812 with SMTP id dm12-20020ad44e2c000000b00626080ec812mr334599qvb.13.1686324295619;
Fri, 09 Jun 2023 08:24:55 -0700 (PDT)
X-Received: by 2002:a05:6870:1a8e:b0:1a2:c17e:a17f with SMTP id
ef14-20020a0568701a8e00b001a2c17ea17fmr549279oab.0.1686324295294; Fri, 09 Jun
2023 08:24:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jun 2023 08:24:55 -0700 (PDT)
In-Reply-To: <323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b127:50c5:8d59:e8a2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b127:50c5:8d59:e8a2
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 09 Jun 2023 15:24:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2712
 by: MitchAlsup - Fri, 9 Jun 2023 15:24 UTC

On Friday, June 9, 2023 at 5:15:31 AM UTC-5, Michael S wrote:
> On Thursday, June 8, 2023 at 12:40:22 AM UTC+3, MitchAlsup wrote:
> > An AI bot has figured out some faster sorting algorithms.
> >
> > https://www.nature.com/articles/s41586-023-06004-9
> >
> > Looking at the C++* code output from the paper::
> >
> > void variable_sort_2( int length, int *a )
> > {
> > switch( length )
> > {
> > case 0:
> > case 1:
> > return;
> > case 2:
> > int temp = a[0];
> > a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> > a[1] = (a[1] < temp ) ? temp : a[1];
> > }
> > }
> >
> > There is x86 asm code in the article, consisting of 11 instructions.
> >
> > My 66000 ASM
> >
> > variable_sort_2:
> > CMP R3,R1,#2
> > PLT R3,TTTTTT
> > LD R4,[R2,0]
> > LD R5,[R2,4]
> > MAX R6,R4,R5
> > MIN R7,R4,R5
> > ST R7,[R2,0]
> > ST R5,[R2,4]
> > RET
> >
> > For a count of 9 instructions.
> >
> Your asm can be literally translated into 9 x86 (AVX) instructions.
> https://godbolt.org/z/Gbz1Kh56f
> However I suspect that on majority of real-word x86 cores 11 original
> instructions are faster.
> > (*) it looks like C code to me instead of C++ code........
> To me it looks legal in both languages.
<
Which is why it is not C++ -- it uses none of the C++ features which are above C.

Re: Faster Sort Algorithms

<0001HW.2A337EBF0122700F7000024EF38F@news.individual.net>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32720&group=comp.arch#32720

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: findlaybill@blueyonder.co.uk (Bill Findlay)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Fri, 09 Jun 2023 16:37:35 +0100
Organization: none
Lines: 56
Message-ID: <0001HW.2A337EBF0122700F7000024EF38F@news.individual.net>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com> <fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>
Reply-To: findlaybill@blueyonder.co.uk
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: individual.net LA+0DrU8VRShqX5a1teaEAhlMn7iEc7JgJYbX7PANCTr7dB4lp
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:o50qj2mgjk1G8om9nZiDIsidxO0=
User-Agent: Hogwasher/5.24
 by: Bill Findlay - Fri, 9 Jun 2023 15:37 UTC

On 9 Jun 2023, MitchAlsup wrote
(in article<fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>):

> On Friday, June 9, 2023 at 5:15:31???AM UTC-5, Michael S wrote:
> > On Thursday, June 8, 2023 at 12:40:22???AM UTC+3, MitchAlsup wrote:
> > > An AI bot has figured out some faster sorting algorithms.
> > >
> > > https://www.nature.com/articles/s41586-023-06004-9
> > >
> > > Looking at the C++* code output from the paper::
> > >
> > > void variable_sort_2( int length, int *a )
> > > {
> > > switch( length )
> > > {
> > > case 0:
> > > case 1:
> > > return;
> > > case 2:
> > > int temp = a[0];
> > > a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> > > a[1] = (a[1] < temp ) ? temp : a[1];
> > > }
> > > }
> > >
> > > There is x86 asm code in the article, consisting of 11 instructions.
> > >
> > > My 66000 ASM
> > >
> > > variable_sort_2:
> > > CMP R3,R1,#2
> > > PLT R3,TTTTTT
> > > LD R4,[R2,0]
> > > LD R5,[R2,4]
> > > MAX R6,R4,R5
> > > MIN R7,R4,R5
> > > ST R7,[R2,0]
> > > ST R5,[R2,4]
> > > RET
> > >
> > > For a count of 9 instructions.
> > Your asm can be literally translated into 9 x86 (AVX) instructions.
> > https://godbolt.org/z/Gbz1Kh56f
> > However I suspect that on majority of real-word x86 cores 11 original
> > instructions are faster.

KDF9, designed ca. 1960:

YA0; YA1; MAX; =YA1; =YA0; {VR;}

It sets the V flipflop if the values had to be reversed in the NEST,
"VR;" clears V if you don't care about that.
--
Bill Findlay

Re: Faster Sort Algorithms

<1OHgM.11457$fZx2.4099@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32721&group=comp.arch#32721

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Faster Sort Algorithms
Newsgroups: comp.arch
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com> <fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>
Lines: 54
Message-ID: <1OHgM.11457$fZx2.4099@fx14.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Fri, 09 Jun 2023 15:45:01 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Fri, 09 Jun 2023 15:45:01 GMT
X-Received-Bytes: 2525
 by: Scott Lurndal - Fri, 9 Jun 2023 15:45 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Friday, June 9, 2023 at 5:15:31=E2=80=AFAM UTC-5, Michael S wrote:
>> On Thursday, June 8, 2023 at 12:40:22=E2=80=AFAM UTC+3, MitchAlsup wrote:=
>=20
>> > An AI bot has figured out some faster sorting algorithms.=20
>> >=20
>> > https://www.nature.com/articles/s41586-023-06004-9=20
>> >=20
>> > Looking at the C++* code output from the paper::=20
>> >=20
>> > void variable_sort_2( int length, int *a )=20
>> > {=20
>> > switch( length )=20
>> > {=20
>> > case 0:=20
>> > case 1:=20
>> > return;=20
>> > case 2:=20
>> > int temp =3D a[0];=20
>> > a[0] =3D (a[1] < a[0] ) ? a[1] : a[0];=20
>> > a[1] =3D (a[1] < temp ) ? temp : a[1];=20
>> > }=20
>> > }=20
>> >=20
>> > There is x86 asm code in the article, consisting of 11 instructions.=20
>> >=20
>> > My 66000 ASM=20
>> >=20
>> > variable_sort_2:=20
>> > CMP R3,R1,#2=20
>> > PLT R3,TTTTTT=20
>> > LD R4,[R2,0]=20
>> > LD R5,[R2,4]=20
>> > MAX R6,R4,R5=20
>> > MIN R7,R4,R5=20
>> > ST R7,[R2,0]=20
>> > ST R5,[R2,4]=20
>> > RET=20
>> >=20
>> > For a count of 9 instructions.=20
>> >
>> Your asm can be literally translated into 9 x86 (AVX) instructions.=20
>> https://godbolt.org/z/Gbz1Kh56f=20
>> However I suspect that on majority of real-word x86 cores 11 original=20
>> instructions are faster.
>> > (*) it looks like C code to me instead of C++ code........
>> To me it looks legal in both languages.
><
>Which is why it is not C++ -- it uses none of the C++ features which are ab=
>ove C.

That doesn't mean that it isn't C++. Many of the soi disant "C++ features"
are syntactic sugar. The good parts (object orientation and data encapsulation)
can be used without using all the cruft.

Re: Faster Sort Algorithms

<fbc08953-3b9b-4d77-be87-9fd920d39dben@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32723&group=comp.arch#32723

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:6d1:0:b0:759:1798:d849 with SMTP id 200-20020a3706d1000000b007591798d849mr341670qkg.3.1686328145204;
Fri, 09 Jun 2023 09:29:05 -0700 (PDT)
X-Received: by 2002:a05:6870:7706:b0:19f:3568:d2c0 with SMTP id
dw6-20020a056870770600b0019f3568d2c0mr788295oab.1.1686328144906; Fri, 09 Jun
2023 09:29:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jun 2023 09:29:04 -0700 (PDT)
In-Reply-To: <1OHgM.11457$fZx2.4099@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b127:50c5:8d59:e8a2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b127:50c5:8d59:e8a2
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<323486e7-8473-4c23-958b-3b9028ccb3ban@googlegroups.com> <fd0b5dc1-6328-4a88-b951-98ed59bd3207n@googlegroups.com>
<1OHgM.11457$fZx2.4099@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fbc08953-3b9b-4d77-be87-9fd920d39dben@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 09 Jun 2023 16:29:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Fri, 9 Jun 2023 16:29 UTC

On Friday, June 9, 2023 at 10:45:05 AM UTC-5, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, June 9, 2023 at 5:15:31=E2=80=AFAM UTC-5, Michael S wrote:
> >> On Thursday, June 8, 2023 at 12:40:22=E2=80=AFAM UTC+3, MitchAlsup wrote:=
> >=20
> >> > An AI bot has figured out some faster sorting algorithms.=20
> >> >=20
> >> > https://www.nature.com/articles/s41586-023-06004-9=20
> >> >=20
> >> > Looking at the C++* code output from the paper::=20
> >> >=20
> >> > void variable_sort_2( int length, int *a )=20
> >> > {=20
> >> > switch( length )=20
> >> > {=20
> >> > case 0:=20
> >> > case 1:=20
> >> > return;=20
> >> > case 2:=20
> >> > int temp =3D a[0];=20
> >> > a[0] =3D (a[1] < a[0] ) ? a[1] : a[0];=20
> >> > a[1] =3D (a[1] < temp ) ? temp : a[1];=20
> >> > }=20
> >> > }=20
> >> >=20
> >> > There is x86 asm code in the article, consisting of 11 instructions.=20
> >> >=20
> >> > My 66000 ASM=20
> >> >=20
> >> > variable_sort_2:=20
> >> > CMP R3,R1,#2=20
> >> > PLT R3,TTTTTT=20
> >> > LD R4,[R2,0]=20
> >> > LD R5,[R2,4]=20
> >> > MAX R6,R4,R5=20
> >> > MIN R7,R4,R5=20
> >> > ST R7,[R2,0]=20
> >> > ST R5,[R2,4]=20
> >> > RET=20
> >> >=20
> >> > For a count of 9 instructions.=20
> >> >
> >> Your asm can be literally translated into 9 x86 (AVX) instructions.=20
> >> https://godbolt.org/z/Gbz1Kh56f=20
> >> However I suspect that on majority of real-word x86 cores 11 original=20
> >> instructions are faster.
> >> > (*) it looks like C code to me instead of C++ code........
> >> To me it looks legal in both languages.
> ><
> >Which is why it is not C++ -- it uses none of the C++ features which are ab=
> >ove C.
>
> That doesn't mean that it isn't C++. Many of the soi disant "C++ features"
> are syntactic sugar. The good parts (object orientation and data encapsulation)
> can be used without using all the cruft.
<
I remember a program written for the C obfuscation event many years ago
which was compliable in C, compliable in Fortran, and compliable in Pascal.
<
Does that make that program any of them ??

Re: Faster Sort Algorithms

<rPIgM.11831$fZx2.2801@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32727&group=comp.arch#32727

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad>
In-Reply-To: <FVlgM.4315$fZx2.2545@fx14.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 75
Message-ID: <rPIgM.11831$fZx2.2801@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 09 Jun 2023 16:54:47 UTC
Date: Fri, 09 Jun 2023 12:53:23 -0400
X-Received-Bytes: 3421
 by: EricP - Fri, 9 Jun 2023 16:53 UTC

EricP wrote:
> MitchAlsup wrote:
>> An AI bot has figured out some faster sorting algorithms.
>>
>> https://www.nature.com/articles/s41586-023-06004-9
>
> Ummm... in fig 3.b they show the original sort3 result as
> and in fig 3.c they show the AI optimized sort3 result as
>
> neither of them is the correct ascending sort order.

I think see what they were talking about. Those figures were examples of
arbitrary sort networks of 3 and 4 variables, and nothing to do with
the sort3 and sort4 routines.

Anyway, the code AlphaDev produced is here:

https://github.com/deepmind/alphadev/blob/main/sort_functions_test.cc

and is almost the same code I would have written as it results from
doing conditional swaps. For Sort3AlphaDev my only difference is I would
put the "mov (%0), %%r8d" at the front to get the load going ASAP.
Same for the other SortN's.

So I don't see what optimization AlphaDev could have done.

Remember that CMOV is often implemented as two uOps, one to move the
source if condition True, the other to copy the original dest if False.
That creates a dependency between the original dest register and
subsequent users of that same register that is not apparent in the code.
While register MOV can sometimes be optimized away by HW,
CMOV for both the True and False paths cannot.

The actual performance issues are obscured by the assembler.
Using CSWAP as a conditional register swap macro highlights the differences.

Sort3 is uninteresting because it is just
Sort3 (a, b, c)
CSWAP_GT (a,b)
CSWAP_GT (b,c)
CSWAP_GT (a,b)

Sort4 and others are interesting because some operations can be done
in parallel if a cpu can launch multiple ALU instructions per clock.

Sort4 (a, b, c, d)
CSWAP_GT (a,b)
CSWAP_GT (c,d)
CSWAP_GT (b,c)
CSWAP_GT (a,b)
CSWAP_GT (c,d)

For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
instruction along with SWAP rd,rs.

I also have Load Immediate Conditional LIC , Load Conditional LDC and
Store Conditional STC. LDC and STC have nothing to do with atomics.

LIC and LDC allow you to conditionally overwrite a register value.
STC allows you to optimize away the store if you didn't change a value.

For LDC and STC one idea I had was that rather than add a condition register
because that would mean STC had 4 source register, was to test whether
the base address register was zero/non-zero as the disable/enable.

Applied to the above register sort routines, using CSWAP replaces a MOV
and 2 CMOV. One could set register bit flags to 1 if any swap occurs.
Then later test that flag and a STC skip the write back if unchanged.
In this SortN case that's probably more trouble than its worth.

Re: Faster Sort Algorithms

<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32736&group=comp.arch#32736

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:41d4:0:b0:759:5e3e:2867 with SMTP id o203-20020a3741d4000000b007595e3e2867mr457945qka.12.1686339795782;
Fri, 09 Jun 2023 12:43:15 -0700 (PDT)
X-Received: by 2002:a05:6870:a8a6:b0:19f:ad60:4f54 with SMTP id
eb38-20020a056870a8a600b0019fad604f54mr905181oab.3.1686339795465; Fri, 09 Jun
2023 12:43:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jun 2023 12:43:15 -0700 (PDT)
In-Reply-To: <rPIgM.11831$fZx2.2801@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b127:50c5:8d59:e8a2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b127:50c5:8d59:e8a2
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Fri, 09 Jun 2023 19:43:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5012
 by: MitchAlsup - Fri, 9 Jun 2023 19:43 UTC

On Friday, June 9, 2023 at 11:54:51 AM UTC-5, EricP wrote:
> EricP wrote:
> > MitchAlsup wrote:
> >> An AI bot has figured out some faster sorting algorithms.
> >>
> >> https://www.nature.com/articles/s41586-023-06004-9
> >
> > Ummm... in fig 3.b they show the original sort3 result as
> > and in fig 3.c they show the AI optimized sort3 result as
> >
> > neither of them is the correct ascending sort order.
> I think see what they were talking about. Those figures were examples of
> arbitrary sort networks of 3 and 4 variables, and nothing to do with
> the sort3 and sort4 routines.
>
> Anyway, the code AlphaDev produced is here:
>
> https://github.com/deepmind/alphadev/blob/main/sort_functions_test.cc
>
> and is almost the same code I would have written as it results from
> doing conditional swaps. For Sort3AlphaDev my only difference is I would
> put the "mov (%0), %%r8d" at the front to get the load going ASAP.
> Same for the other SortN's.
>
> So I don't see what optimization AlphaDev could have done.
>
> Remember that CMOV is often implemented as two uOps, one to move the
> source if condition True, the other to copy the original dest if False.
<
This is why MAX and MIN instructions are simply better, fewer operands
just as easy a choice.
<
> That creates a dependency between the original dest register and
> subsequent users of that same register that is not apparent in the code.
> While register MOV can sometimes be optimized away by HW,
> CMOV for both the True and False paths cannot.
>
> The actual performance issues are obscured by the assembler.
<
Errrrrrr........ISA is the problem, ASM is simply how ISA is spelled using ASCII.
<
> Using CSWAP as a conditional register swap macro highlights the differences.
>
> Sort3 is uninteresting because it is just
> Sort3 (a, b, c)
> CSWAP_GT (a,b)
> CSWAP_GT (b,c)
> CSWAP_GT (a,b)
<
// All single result instructions
MAX TA,A,B
MIN TB,A,B
MAX TB,TB,C
MIN C,TB,C
MAX A,TA,TB
MIN B,TA,TB
<
I see no particular reason that MAX and MIN cannot be issued together
(in parallel) so we have 3 cycles of 2 instruction each.
>
> Sort4 and others are interesting because some operations can be done
> in parallel if a cpu can launch multiple ALU instructions per clock.
>
> Sort4 (a, b, c, d)
> CSWAP_GT (a,b)
> CSWAP_GT (c,d)
> CSWAP_GT (b,c)
> CSWAP_GT (a,b)
> CSWAP_GT (c,d)
>
> For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
> instruction along with SWAP rd,rs.
<
2 results per instruction is contrary to the tenets of RISC.
And most/many of us thing condition codes are contrary to tenets of RISC.
>
> I also have Load Immediate Conditional LIC , Load Conditional LDC and
> Store Conditional STC. LDC and STC have nothing to do with atomics.
>
> LIC and LDC allow you to conditionally overwrite a register value.
> STC allows you to optimize away the store if you didn't change a value.
>
> For LDC and STC one idea I had was that rather than add a condition register
> because that would mean STC had 4 source register, was to test whether
> the base address register was zero/non-zero as the disable/enable.
>
> Applied to the above register sort routines, using CSWAP replaces a MOV
> and 2 CMOV. One could set register bit flags to 1 if any swap occurs.
> Then later test that flag and a STC skip the write back if unchanged.
> In this SortN case that's probably more trouble than its worth.

Re: Faster Sort Algorithms

<u62lg3$2dfv8$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32759&group=comp.arch#32759

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Sat, 10 Jun 2023 22:12:51 +0200
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <u62lg3$2dfv8$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 10 Jun 2023 20:12:51 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d25307f99163a785e55c3e014c316655";
logging-data="2539496"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dksX728/VlOarDlkCprq/xz+KON2gjlEVeHuhlgaG7A=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:MFfYmx6a/0Ge8Js69qpDyOKp7fY=
In-Reply-To: <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
 by: Terje Mathisen - Sat, 10 Jun 2023 20:12 UTC

Re our recent sort thread:

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

cpp_vqsort_new() seems like the current winner in the generic sort
library race, with gains almost certainly well above the quoted 1.7% for
the AI-derived micro-optimizations that got into clang libc

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<CckhM.10390$NuA9.3884@fx03.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32764&group=comp.arch#32764

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
In-Reply-To: <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 63
Message-ID: <CckhM.10390$NuA9.3884@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 11 Jun 2023 13:44:02 UTC
Date: Sun, 11 Jun 2023 09:42:46 -0400
X-Received-Bytes: 3768
 by: EricP - Sun, 11 Jun 2023 13:42 UTC

MitchAlsup wrote:
> On Friday, June 9, 2023 at 11:54:51 AM UTC-5, EricP wrote:
>>
>> For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
>> instruction along with SWAP rd,rs.
> <
> 2 results per instruction is contrary to the tenets of RISC.
> And most/many of us thing condition codes are contrary to tenets of RISC.

I was driven to instructions with 2-dest results by my desire to
(a) not have condition codes and (b) wanting to handle add with carry
in a straight forward way. That leads to many new wide result instructions.

I see the RISC tenets as guidelines advising designers to
consider the benefits and costs rather than hidebound dogma.
And that cost/benefit consideration is made in the context of the
logic budgets of the current day, not locked to a date mid 1980's.

Variable length instructions opens the ISA to much useful functionality
like full size immediate operands but requires a more complicated fetch
unit than fixed size. This cost is generally accepted today.

I found your arguments for the utility of store immediate instructions
convincing but that requires allowing for up to two immediates,
one for the addressing mode and one for data. The cost of that internally
is it requires extra decode logic, extra flip-flops in the front end to
hold the second immediate and extra muxes to route it, and that puts
pressure on the budget designs.

Having up to two dest registers open an ISA to a whole host of new and
useful instructions. Mostly these are double-wide results for ADD, MUL,
various shifts and bit fields. It also allows integer DIV to return
both the results and remainder, and SWAP or CSWAP, plus many others.

As to the cost of 2 dest registers, at the low end for an in-order,
single issue pipeline with 1 register write port, taking an extra clock
to write a second result does not add any complexity.

For a multi-lane OoO uArch my concern was for Rename to NOT require
two register allocation units for each lane, adding greatly to the cost
and complexity, and which would mostly sit unused burning power.

My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
split across two consecutive lanes. Rename allocates the 1-dest registers
to uOps as normal, then fuses the two pieces and outputs a single uOp to
one output lane to Dispatch.

The OoO writeback of multiple dests is no more expensive because it
already requires multiple result and forwarding buses and write ports
so that it can catch up after bubbles.

With plausable solutions for each problem area, and the costs for 2-dest
registers under control, I green lighted it for general use by new
instructions.

So "CSWAP rdX, rdY, rsA, rsB, rsCC" slides right into the ISA:
IF (rsCC) rdX = rsA, rdY = rsB ELSE rdX = rsB, rdY = rsA END IF
It needs 3 source registers, but FMA and ADD-wide already need that,
and 2 dest registers, and could be implemented as a single ALU uOp.

Re: Faster Sort Algorithms

<2023Jun11.163044@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32765&group=comp.arch#32765

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Sun, 11 Jun 2023 14:30:44 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 52
Message-ID: <2023Jun11.163044@mips.complang.tuwien.ac.at>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
Injection-Info: dont-email.me; posting-host="f7fd5c081902d524687dc88deba18241";
logging-data="2879387"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8uLjbLyycT/oAPyf1FlBs"
Cancel-Lock: sha1:3Z+P905dJHPed6ab3+TOINNquKg=
X-newsreader: xrn 10.11
 by: Anton Ertl - Sun, 11 Jun 2023 14:30 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>I was driven to instructions with 2-dest results by my desire to
>(a) not have condition codes and (b) wanting to handle add with carry
>in a straight forward way.

I have expanded my answer to that problem into a paper:
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

Of course, the question is whether you consider this answer
straightforward.

>Variable length instructions opens the ISA to much useful functionality
>like full size immediate operands but requires a more complicated fetch
>unit than fixed size. This cost is generally accepted today.

Counterexample: ARM A64 (and ARM has ample experience with
variable-size instructions).

>As to the cost of 2 dest registers, at the low end for an in-order,
>single issue pipeline with 1 register write port, taking an extra clock
>to write a second result does not add any complexity.

At least not much.

But what's the difference from

add s0 = a, b
sltu carry0 = s0,a

which also takes two cycles? Or do you have a three-input
add-with-carry? That would certainly solve the major problem that
MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
carry-in.

>For a multi-lane OoO uArch my concern was for Rename to NOT require
>two register allocation units for each lane, adding greatly to the cost
>and complexity, and which would mostly sit unused burning power.
>
>My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>split across two consecutive lanes. Rename allocates the 1-dest registers
>to uOps as normal, then fuses the two pieces and outputs a single uOp to
>one output lane to Dispatch.

I wonder if that's what ARM are doing for their multi-result
instructions (load-pair with pre/post-indexed addressing mode consumes
three write ports).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Faster Sort Algorithms

<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32767&group=comp.arch#32767

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1841:b0:62d:e829:84a9 with SMTP id d1-20020a056214184100b0062de82984a9mr161777qvy.9.1686502771626;
Sun, 11 Jun 2023 09:59:31 -0700 (PDT)
X-Received: by 2002:aca:f2c4:0:b0:399:e5c2:f7d3 with SMTP id
q187-20020acaf2c4000000b00399e5c2f7d3mr639384oih.7.1686502771334; Sun, 11 Jun
2023 09:59:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jun 2023 09:59:31 -0700 (PDT)
In-Reply-To: <2023Jun11.163044@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Sun, 11 Jun 2023 16:59:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4715
 by: robf...@gmail.com - Sun, 11 Jun 2023 16:59 UTC

On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >I was driven to instructions with 2-dest results by my desire to
> >(a) not have condition codes and (b) wanting to handle add with carry
> >in a straight forward way.
> I have expanded my answer to that problem into a paper:
> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>
> Of course, the question is whether you consider this answer
> straightforward.
> >Variable length instructions opens the ISA to much useful functionality
> >like full size immediate operands but requires a more complicated fetch
> >unit than fixed size. This cost is generally accepted today.
> Counterexample: ARM A64 (and ARM has ample experience with
> variable-size instructions).

I just decided not to long ago to go with a fixed sized instruction because
variable length calculations became on the critical timing path for an FPGA
design. Postfixes can effectively extend the instruction with constants.

> >As to the cost of 2 dest registers, at the low end for an in-order,
> >single issue pipeline with 1 register write port, taking an extra clock
> >to write a second result does not add any complexity.
> At least not much.
>
> But what's the difference from
>
>
> add s0 = a, b
> sltu carry0 = s0,a
>
> which also takes two cycles? Or do you have a three-input
> add-with-carry? That would certainly solve the major problem that
> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> carry-in.
> >For a multi-lane OoO uArch my concern was for Rename to NOT require
> >two register allocation units for each lane, adding greatly to the cost
> >and complexity, and which would mostly sit unused burning power.
> >
> >My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >split across two consecutive lanes. Rename allocates the 1-dest registers
> >to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >one output lane to Dispatch.
> I wonder if that's what ARM are doing for their multi-result
> instructions (load-pair with pre/post-indexed addressing mode consumes
> three write ports).
>
An idea I have been toying with is for some operations caching the operands
and results. So, if a divide is done the remainder is available immediately
without having to do a full divide. Works for multiply high too. I think it should
work well with reciprocal calculations too, where the same reciprocal is used
in a loop for instance.

To deal with extended precision ADD in the past I have used a three operand
ADD, combined with a carry generate instruction. The carry generate
instruction takes the same operands as the add instruction but returns the
carry out instead of the sum. It can then be added in the next instruction. It
is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
I think the ADD and the carry-generate can be done in the same clock cycle
in a superscalar CPU.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Faster Sort Algorithms

<u658d3$2pstp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32770&group=comp.arch#32770

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Sun, 11 Jun 2023 14:47:44 -0500
Organization: A noiseless patient Spider
Lines: 193
Message-ID: <u658d3$2pstp$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 11 Jun 2023 19:47:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="718fec5b31553afa99f2415998948085";
logging-data="2945977"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jquhb635f00IRsTxnpLK8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:8Oupnaoj32xjLXNKoQkTFM27FWQ=
In-Reply-To: <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
Content-Language: en-US
 by: BGB - Sun, 11 Jun 2023 19:47 UTC

On 6/11/2023 11:59 AM, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>> I have expanded my answer to that problem into a paper:
>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.
>

I went with prefixes.

If I designed another (new) variable-length ISA, it is probable I would
still go with prefixes.

Though, I would likely start (this time) with the assumption of the
32-bit instructions as baseline, and the 16-bit as a shorthand. My
encoding is currently a little messy in that (in its original form) the
32-bit instructions were themselves a prefix encoding (of two 16-bit parts).

Granted, prefixes have the downside of turning the immediate values into
confetti... But, then again, there is RISC-V with fixed-length
instructions and the immediate fields are still confetti...

Doesn't really effect the CPU too much, but admittedly is kind of a pain
for the assembler and linker (linker needing a larger number of reloc
types). Though, one can reduce the annoyance, say, by making all PC-rel
offsets either Byte or Word (depending on the op), and all GBR-Rel ops
Byte based, ... At least cutting down on the needed number of reloc types.

Well, except when the encoding rules change from one instruction to
another, which is admittedly thus far why I haven't bothered with
RISC-V's 'C' extension.

Like, RVC is sort of like someone looked at Thumb and was then like
"Hold my beer!".

One other option would be, say, to have variable length ops with a
size/form bucket, say:
000: 32-bit op, 3R Forms
001: 32-bit op, 3RI Forms
010: 64-bit op, 3R/4R/etc forms
011: 64-bit op, 3RI Forms, Imm32
100: 96-bit op, -
101: 96-bit op, 3RI Forms, Imm64

Where, say:
000z-zzzz-zzdd-dddd ssss-sstt-tttt-zzzz //3R
001z-zzzz-zzdd-dddd ssss-ssii-iiii-iiii //3RI (Imm10)

011z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm32)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm

101z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm64)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm

Where, say, the Imm32 and Imm64 encodings operate in (mostly) the same
space as the 3R encodings, but will normally suppress the Rt field in
favor of an immediate. The Imm10 encodings would represent a slightly
different encoding space (likely exclusively Load/Store ops and ALU
Immediate instructions).

This would be nicer for an assembler and linker, but would implicitly
mean that the instruction stream would no longer be "self-aligning".

Say, with a self-aligning stream, if one starts decoding/disassembling
instructions from an arbitrary location in the "text" sections, then the
decoding will realign with the instruction stream within a few
instruction words (this property is useful for tools like debuggers).

>>> As to the cost of 2 dest registers, at the low end for an in-order,
>>> single issue pipeline with 1 register write port, taking an extra clock
>>> to write a second result does not add any complexity.
>> At least not much.
>>
>> But what's the difference from
>>
>>
>> add s0 = a, b
>> sltu carry0 = s0,a
>>
>> which also takes two cycles? Or do you have a three-input
>> add-with-carry? That would certainly solve the major problem that
>> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
>> carry-in.
>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>>> two register allocation units for each lane, adding greatly to the cost
>>> and complexity, and which would mostly sit unused burning power.
>>>
>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>>> split across two consecutive lanes. Rename allocates the 1-dest registers
>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>>> one output lane to Dispatch.
>> I wonder if that's what ARM are doing for their multi-result
>> instructions (load-pair with pre/post-indexed addressing mode consumes
>> three write ports).
>>
> An idea I have been toying with is for some operations caching the operands
> and results. So, if a divide is done the remainder is available immediately
> without having to do a full divide. Works for multiply high too. I think it should
> work well with reciprocal calculations too, where the same reciprocal is used
> in a loop for instance.
>
> To deal with extended precision ADD in the past I have used a three operand
> ADD, combined with a carry generate instruction. The carry generate
> instruction takes the same operands as the add instruction but returns the
> carry out instead of the sum. It can then be added in the next instruction. It
> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> I think the ADD and the carry-generate can be done in the same clock cycle
> in a superscalar CPU.
>

Possible, but likely to get rather annoying, rather fast.

One almost may as well have some designated architectural scratchpad
registers in this case.

Say:
DMULU R4, R5 //MACH:MACL = R4*R5
STS MACL, R2 // R2 = MACL

But, yeah, I know of an ISA that did the multiplier this way (cough, SH).

Though, to be fair, it is less bad than doing the multiplier via MMIO
(cough, MSP430).

In my case, the 32-bit widening multiply ops generate a 64-bit result:
MULU.L R4, R5, R2 //Zero-Extended
DMULU.L R4, R5, R2 //Widening

The 64-bit ops are only available in low/high variants, and slow. Had
considered a widening variant (with a 128-bit result), but given these
were only really added for sake of the RISC-V 'M' extension, and 'M'
doesn't have widening ops, I didn't bother in this case.

Though, a widening version would at least (in theory) make it more
useful for implementing 128-bit multiply, nevermind if it would still
currently be faster to build 128-bit multiply using the 32-bit multiply
ops (but, this part could be streamlined slightly if I added helper ops
to do the high/low and high-high multiplies without needing to use a
bunch of shift instructions...).

Well, and the (slow) 64-bit multiplier was still left out to try to
shoe-horn a 3-wide version of the BJX2 core into an XC7S50 (along with a
bunch of other features).

But, I still need to find a way to shave off some more LUTs, as I want
to be able to have a CPI based camera-interface module, and I still
don't currently have the LUT budget for this.

.... decided to leave out going into the specifics of the CPI camera
interface or how the module would work (or my annoyances trying to fit
all this into the XC7S50 on the Arty S7...).

But, also annoyingly: I can't use my Nexys A7 board either, as it
doesn't have enough unassigned IO (all those
LEDs/switches/7-segment-display/VGA-port/... eating up much of the
FPGA's total IO pins). Its "sibling" (the Arty A7) mostly just leaving a
lot more of this to PMODs and similar.

>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Faster Sort Algorithms

<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32771&group=comp.arch#32771

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4b6d:0:b0:62d:edad:8e15 with SMTP id m13-20020ad44b6d000000b0062dedad8e15mr28818qvx.1.1686516529538;
Sun, 11 Jun 2023 13:48:49 -0700 (PDT)
X-Received: by 2002:a05:6870:1aa9:b0:19f:36a8:d593 with SMTP id
ef41-20020a0568701aa900b0019f36a8d593mr2035100oab.1.1686516529269; Sun, 11
Jun 2023 13:48:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jun 2023 13:48:48 -0700 (PDT)
In-Reply-To: <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4c76:5eca:db1b:54ff;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4c76:5eca:db1b:54ff
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sun, 11 Jun 2023 20:48:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sun, 11 Jun 2023 20:48 UTC

On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:

> > >Variable length instructions opens the ISA to much useful functionality
> > >like full size immediate operands but requires a more complicated fetch
> > >unit than fixed size. This cost is generally accepted today.
> > Counterexample: ARM A64 (and ARM has ample experience with
> > variable-size instructions).
<
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.
<
I found an encoding scheme whereby determining the length of an
instruction is 4 gates of delay covering 1-5 word instructions. It is
entirely parallel and can be treeified at 1 gate per doubling of the
number of instructions decoded. Thus, in 1 stage of logic one can
parse 16-instructions out of 32-words.
<
Now, of course, all this takes place in unary form, which is the natural
form of hardware parsing of bit strings into useful containers. Why unary ?
Unary becomes the inputs to multiplexers--obviating the need for binary
to unary decoding.
<
Perhaps your variable length decoding mechanism was not very refined.
<
> > >As to the cost of 2 dest registers, at the low end for an in-order,
> > >single issue pipeline with 1 register write port, taking an extra clock
> > >to write a second result does not add any complexity.
<
For My 66000 CARRY, I get rid of the complexity of 2 result registers
by having a recirculating bus, aptly called the Carry-Bus, that recirculates
results to operands until the end of the CARRY shadow. And if the last
calculation does not produce a carry-out, the result does not get written.
Much of the time the carry result is never "written" into the register file..
<
> > At least not much.
> >
> > But what's the difference from
> >
> >
> > add s0 = a, b
> > sltu carry0 = s0,a
> >
> > which also takes two cycles? Or do you have a three-input
> > add-with-carry? That would certainly solve the major problem that
> > MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> > carry-in.
> > >For a multi-lane OoO uArch my concern was for Rename to NOT require
> > >two register allocation units for each lane, adding greatly to the cost
> > >and complexity, and which would mostly sit unused burning power.
> > >
> > >My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> > >split across two consecutive lanes. Rename allocates the 1-dest registers
> > >to uOps as normal, then fuses the two pieces and outputs a single uOp to
> > >one output lane to Dispatch.
> > I wonder if that's what ARM are doing for their multi-result
> > instructions (load-pair with pre/post-indexed addressing mode consumes
> > three write ports).
> >
> An idea I have been toying with is for some operations caching the operands
> and results. So, if a divide is done the remainder is available immediately
> without having to do a full divide. Works for multiply high too. I think it should
> work well with reciprocal calculations too, where the same reciprocal is used
> in a loop for instance.
<
My VEC instruction informs the hardware which registers hold live results when
the LOOP terminates. This means a majority of results produced inside a loop
never need to actually get written into the register file, the result just has to be
passed on as an operand. This saves power.
<
>
> To deal with extended precision ADD in the past I have used a three operand
> ADD, combined with a carry generate instruction. The carry generate
> instruction takes the same operands as the add instruction but returns the
> carry out instead of the sum. It can then be added in the next instruction. It
> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> I think the ADD and the carry-generate can be done in the same clock cycle
> in a superscalar CPU.
<
My 66000 - 256-bit integer add::
<
CARRY R16,{{I}{IO}{IO}{O}}
ADD R12,R4,R8 // carry Out only
ADD R13,R5,R9 // Carry In and Out
ADD R14,R6,R10 // Carry In and Out
ADD R15,R7,R11 // Carry In only
<
Here, in typical use, the carry register does not have to be read from the
register file nor does it have to get written at the end, either.
<
> > - anton
> > --
> > 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Faster Sort Algorithms

<eab37ebc-3115-4c16-8583-b87483ef7ff7n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32772&group=comp.arch#32772

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:3c6:0:b0:75b:24c4:797a with SMTP id 189-20020a3703c6000000b0075b24c4797amr455020qkd.10.1686517094057;
Sun, 11 Jun 2023 13:58:14 -0700 (PDT)
X-Received: by 2002:a05:6871:6a8a:b0:1a2:8347:7735 with SMTP id
zf10-20020a0568716a8a00b001a283477735mr2414249oab.6.1686517093753; Sun, 11
Jun 2023 13:58:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jun 2023 13:58:13 -0700 (PDT)
In-Reply-To: <u658d3$2pstp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4c76:5eca:db1b:54ff;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4c76:5eca:db1b:54ff
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<u658d3$2pstp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <eab37ebc-3115-4c16-8583-b87483ef7ff7n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Sun, 11 Jun 2023 20:58:14 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 9509
 by: MitchAlsup - Sun, 11 Jun 2023 20:58 UTC

On Sunday, June 11, 2023 at 2:47:51 PM UTC-5, BGB wrote:
> On 6/11/2023 11:59 AM, robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> >> EricP <ThatWould...@thevillage.com> writes:
> >>> I was driven to instructions with 2-dest results by my desire to
> >>> (a) not have condition codes and (b) wanting to handle add with carry
> >>> in a straight forward way.
> >> I have expanded my answer to that problem into a paper:
> >> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
> >>
> >> Of course, the question is whether you consider this answer
> >> straightforward.
> >>> Variable length instructions opens the ISA to much useful functionality
> >>> like full size immediate operands but requires a more complicated fetch
> >>> unit than fixed size. This cost is generally accepted today.
> >> Counterexample: ARM A64 (and ARM has ample experience with
> >> variable-size instructions).
> >
> > I just decided not to long ago to go with a fixed sized instruction because
> > variable length calculations became on the critical timing path for an FPGA
> > design. Postfixes can effectively extend the instruction with constants..
> >
> I went with prefixes.
>
> If I designed another (new) variable-length ISA, it is probable I would
> still go with prefixes.
>
> Though, I would likely start (this time) with the assumption of the
> 32-bit instructions as baseline, and the 16-bit as a shorthand. My
> encoding is currently a little messy in that (in its original form) the
> 32-bit instructions were themselves a prefix encoding (of two 16-bit parts).
>
>
>
> Granted, prefixes have the downside of turning the immediate values into
> confetti... But, then again, there is RISC-V with fixed-length
> instructions and the immediate fields are still confetti...
>
There is no particular reason to use RISC-V ISA as anything but a
stick in the sand......
>
> Doesn't really effect the CPU too much, but admittedly is kind of a pain
> for the assembler and linker (linker needing a larger number of reloc
> types). Though, one can reduce the annoyance, say, by making all PC-rel
> offsets either Byte or Word (depending on the op), and all GBR-Rel ops
> Byte based, ... At least cutting down on the needed number of reloc types..
>
I did not end up having to need any of that.
>
> Well, except when the encoding rules change from one instruction to
> another, which is admittedly thus far why I haven't bothered with
> RISC-V's 'C' extension.
>
> Like, RVC is sort of like someone looked at Thumb and was then like
> "Hold my beer!".
>
>
> One other option would be, say, to have variable length ops with a
> size/form bucket, say:
> 000: 32-bit op, 3R Forms
> 001: 32-bit op, 3RI Forms
> 010: 64-bit op, 3R/4R/etc forms
> 011: 64-bit op, 3RI Forms, Imm32
> 100: 96-bit op, -
> 101: 96-bit op, 3RI Forms, Imm64
<
For my part, every instruction gets 1 immediate and/or 1 displacement.
Calculation instructions get 1 immediate
Loads get 1 displacement
Stores get 1 displacement and 1 immediate
<
Also note: everything needed to determine the length of an instruction
is available in the first 32-bits of the instruction.
<
My rational was that the compiler would be "pretty good" at simplifying
the instructions such that when it ran into a DAG where a node needed
2 constants, a vast majority of the time it could perform the calculation
at compile time. Evidence has born this out.
>
>
> Where, say:
> 000z-zzzz-zzdd-dddd ssss-sstt-tttt-zzzz //3R
> 001z-zzzz-zzdd-dddd ssss-ssii-iiii-iiii //3RI (Imm10)
>
> 011z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm32)
> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
>
> 101z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm64)
> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
>
> Where, say, the Imm32 and Imm64 encodings operate in (mostly) the same
> space as the 3R encodings, but will normally suppress the Rt field in
> favor of an immediate. The Imm10 encodings would represent a slightly
> different encoding space (likely exclusively Load/Store ops and ALU
> Immediate instructions).
>
>
>
> This would be nicer for an assembler and linker, but would implicitly
> mean that the instruction stream would no longer be "self-aligning".
>
> Say, with a self-aligning stream, if one starts decoding/disassembling
> instructions from an arbitrary location in the "text" sections, then the
> decoding will realign with the instruction stream within a few
> instruction words (this property is useful for tools like debuggers).
> >>> As to the cost of 2 dest registers, at the low end for an in-order,
> >>> single issue pipeline with 1 register write port, taking an extra clock
> >>> to write a second result does not add any complexity.
> >> At least not much.
> >>
> >> But what's the difference from
> >>
> >>
> >> add s0 = a, b
> >> sltu carry0 = s0,a
> >>
> >> which also takes two cycles? Or do you have a three-input
> >> add-with-carry? That would certainly solve the major problem that
> >> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> >> carry-in.
> >>> For a multi-lane OoO uArch my concern was for Rename to NOT require
> >>> two register allocation units for each lane, adding greatly to the cost
> >>> and complexity, and which would mostly sit unused burning power.
> >>>
> >>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >>> split across two consecutive lanes. Rename allocates the 1-dest registers
> >>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >>> one output lane to Dispatch.
> >> I wonder if that's what ARM are doing for their multi-result
> >> instructions (load-pair with pre/post-indexed addressing mode consumes
> >> three write ports).
> >>
> > An idea I have been toying with is for some operations caching the operands
> > and results. So, if a divide is done the remainder is available immediately
> > without having to do a full divide. Works for multiply high too. I think it should
> > work well with reciprocal calculations too, where the same reciprocal is used
> > in a loop for instance.
> >
> > To deal with extended precision ADD in the past I have used a three operand
> > ADD, combined with a carry generate instruction. The carry generate
> > instruction takes the same operands as the add instruction but returns the
> > carry out instead of the sum. It can then be added in the next instruction. It
> > is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> > I think the ADD and the carry-generate can be done in the same clock cycle
> > in a superscalar CPU.
> >
> Possible, but likely to get rather annoying, rather fast.
>
> One almost may as well have some designated architectural scratchpad
> registers in this case.
>
> Say:
> DMULU R4, R5 //MACH:MACL = R4*R5
> STS MACL, R2 // R2 = MACL
>
> But, yeah, I know of an ISA that did the multiplier this way (cough, SH).
>
> Though, to be fair, it is less bad than doing the multiplier via MMIO
> (cough, MSP430).
>
Once you bite off on the fact you need to support (double) FMACs
all of the calculation resources are present to do integer multiply
fast........and it is used often enough to justify smelling like an integer
ADD in format and operand routing.

Re: Faster Sort Algorithms

<tothM.6673$mshf.4974@fx46.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32776&group=comp.arch#32776

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com> <e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
In-Reply-To: <e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 38
Message-ID: <tothM.6673$mshf.4974@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 12 Jun 2023 00:11:05 UTC
Date: Sun, 11 Jun 2023 20:10:33 -0400
X-Received-Bytes: 2650
 by: EricP - Mon, 12 Jun 2023 00:10 UTC

MitchAlsup wrote:
> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> <
>> To deal with extended precision ADD in the past I have used a three operand
>> ADD, combined with a carry generate instruction. The carry generate
>> instruction takes the same operands as the add instruction but returns the
>> carry out instead of the sum. It can then be added in the next instruction. It
>> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
>> I think the ADD and the carry-generate can be done in the same clock cycle
>> in a superscalar CPU.
> <
> My 66000 - 256-bit integer add::
> <
> CARRY R16,{{I}{IO}{IO}{O}}
> ADD R12,R4,R8 // carry Out only
> ADD R13,R5,R9 // Carry In and Out
> ADD R14,R6,R10 // Carry In and Out
> ADD R15,R7,R11 // Carry In only
> <
> Here, in typical use, the carry register does not have to be read from the
> register file nor does it have to get written at the end, either.

The various integer ADDs formats I came up with
(single and wide means one or two integer register specifiers)

#1 ADD single = single + single (3 registers)
ADDTS trap signed overflow
ADDTU trap unsigned carry
#2 ADDW2 wide = single + single (4 registers)
#3 ADDW3 wide = single + single + single (5 registers)
#4 ADDWW wide = wide + single (5 registers)

#2 and #3 are for propagating a carry horizontally along a chain.
#4 is for the equivalent of a carry-save add on a column.
The My 66000 CARRY prefix can do an arbitrary long horizontal
carry propagate but didn't seem to have a way to specify #4.

Pages:1234
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor