Rocksolid Light

Welcome to Rocksolid Light

mail  files  register  newsreader  groups  login

Message-ID:  

"I have not the slightest confidence in 'spiritual manifestations.'" -- Robert G. Ingersoll


devel / comp.arch / Re: Faster Sort Algorithms

SubjectAuthor
* Faster Sort AlgorithmsMitchAlsup
+- Re: Faster Sort AlgorithmsBGB
+* Re: Faster Sort AlgorithmsTerje Mathisen
|+- Re: Faster Sort AlgorithmsThomas Koenig
|`* Re: Faster Sort AlgorithmsStephen Fuld
| `* Re: Faster Sort AlgorithmsTerje Mathisen
|  `* Re: Faster Sort AlgorithmsTerje Mathisen
|   `- Re: Faster Sort AlgorithmsMichael S
+* Re: Faster Sort AlgorithmsEricP
|`* Re: Faster Sort AlgorithmsEricP
| `* Re: Faster Sort AlgorithmsMitchAlsup
|  +- Re: Faster Sort AlgorithmsTerje Mathisen
|  `* Re: Faster Sort AlgorithmsEricP
|   +* Re: Faster Sort AlgorithmsAnton Ertl
|   |+* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||+* Re: Faster Sort AlgorithmsBGB
|   |||`* Re: Faster Sort AlgorithmsMitchAlsup
|   ||| `- Re: Faster Sort AlgorithmsBGB
|   ||+* Re: Faster Sort AlgorithmsMitchAlsup
|   |||+- Re: Faster Sort AlgorithmsEricP
|   |||`* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||| `* Re: Faster Sort AlgorithmsBGB
|   |||  `* Re: Faster Sort Algorithmsrobf...@gmail.com
|   |||   +- Re: Faster Sort Algorithmsrobf...@gmail.com
|   |||   `- Re: Faster Sort AlgorithmsBGB
|   ||`* Re: Faster Sort AlgorithmsEricP
|   || +- Re: Faster Sort AlgorithmsMitchAlsup
|   || `* Re: Faster Sort Algorithmsrobf...@gmail.com
|   ||  `* Re: Faster Sort AlgorithmsThomas Koenig
|   ||   `- Re: Faster Sort Algorithmsrobf...@gmail.com
|   |`* Re: Faster Sort AlgorithmsEricP
|   | +- Re: Faster Sort AlgorithmsScott Lurndal
|   | +- Re: Faster Sort AlgorithmsMitchAlsup
|   | +* Re: Faster Sort AlgorithmsThomas Koenig
|   | |+* Re: Faster Sort AlgorithmsEricP
|   | ||+* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||`* Re: Faster Sort AlgorithmsThomas Koenig
|   | ||| `* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||  +- Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||  `* Re: Faster Sort AlgorithmsJohn Levine
|   | |||   `* Re: Faster Sort AlgorithmsScott Lurndal
|   | |||    `* Re: Faster Sort AlgorithmsBGB
|   | |||     `* Re: Faster Sort AlgorithmsJohn Levine
|   | |||      +* Re: Faster Sort AlgorithmsBGB
|   | |||      |`* Re: Faster Sort AlgorithmsMitchAlsup
|   | |||      | +- Re: Faster Sort AlgorithmsBGB
|   | |||      | +* Re: Faster Sort AlgorithmsJohn Levine
|   | |||      | |`- Re: Faster Sort AlgorithmsBGB
|   | |||      | `* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||      |  +* Re: Faster Sort AlgorithmsBGB
|   | |||      |  |`* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||      |  | `- Re: Faster Sort AlgorithmsBGB
|   | |||      |  `* Re: off topic - burying electrical linesGeorge Neuner
|   | |||      |   +- Re: off topic - burying electrical linesBGB
|   | |||      |   `- Re: off topic - burying electrical linesScott Lurndal
|   | |||      `- Re: Faster Sort AlgorithmsTerje Mathisen
|   | ||`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | |+* Re: Faster Sort AlgorithmsMitchAlsup
|   | ||+* Re: Faster Sort AlgorithmsThomas Koenig
|   | |||+* Re: Faster Sort AlgorithmsMichael S
|   | ||||`* Re: Faster Sort Algorithmsrobf...@gmail.com
|   | |||| `- Re: Faster Sort AlgorithmsMichael S
|   | |||`* Re: Faster Sort AlgorithmsMitchAlsup
|   | ||| `* Re: Faster Sort AlgorithmsThomas Koenig
|   | |||  `- Re: Faster Sort AlgorithmsMitchAlsup
|   | ||+* Re: Faster Sort AlgorithmsTerje Mathisen
|   | |||`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | ||`- Re: Faster Sort AlgorithmsScott Lurndal
|   | |`- Re: Faster Sort AlgorithmsTerje Mathisen
|   | `- Re: Faster Sort AlgorithmsAnton Ertl
|   `- Re: Faster Sort AlgorithmsTerje Mathisen
+- Re: Faster Sort AlgorithmsMarcus
`* Re: Faster Sort AlgorithmsMichael S
 `* Re: Faster Sort AlgorithmsMitchAlsup
  +- Re: Faster Sort AlgorithmsBill Findlay
  `* Re: Faster Sort AlgorithmsScott Lurndal
   `- Re: Faster Sort AlgorithmsMitchAlsup

Pages:1234
Re: Faster Sort Algorithms

<u65o30$2reef$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32777&group=comp.arch#32777

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Sun, 11 Jun 2023 19:15:25 -0500
Organization: A noiseless patient Spider
Lines: 241
Message-ID: <u65o30$2reef$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<u658d3$2pstp$1@dont-email.me>
<eab37ebc-3115-4c16-8583-b87483ef7ff7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Jun 2023 00:15:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b32067dde87924b4d4d03d2b4de6538";
logging-data="2996687"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18aunZN/mNUZs4Kvu2Bof5o"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:72FkCurZiieToRM3jO0KC9EWnj0=
In-Reply-To: <eab37ebc-3115-4c16-8583-b87483ef7ff7n@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 12 Jun 2023 00:15 UTC

On 6/11/2023 3:58 PM, MitchAlsup wrote:
> On Sunday, June 11, 2023 at 2:47:51 PM UTC-5, BGB wrote:
>> On 6/11/2023 11:59 AM, robf...@gmail.com wrote:
>>> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>>>> EricP <ThatWould...@thevillage.com> writes:
>>>>> I was driven to instructions with 2-dest results by my desire to
>>>>> (a) not have condition codes and (b) wanting to handle add with carry
>>>>> in a straight forward way.
>>>> I have expanded my answer to that problem into a paper:
>>>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>>>
>>>> Of course, the question is whether you consider this answer
>>>> straightforward.
>>>>> Variable length instructions opens the ISA to much useful functionality
>>>>> like full size immediate operands but requires a more complicated fetch
>>>>> unit than fixed size. This cost is generally accepted today.
>>>> Counterexample: ARM A64 (and ARM has ample experience with
>>>> variable-size instructions).
>>>
>>> I just decided not to long ago to go with a fixed sized instruction because
>>> variable length calculations became on the critical timing path for an FPGA
>>> design. Postfixes can effectively extend the instruction with constants.
>>>
>> I went with prefixes.
>>
>> If I designed another (new) variable-length ISA, it is probable I would
>> still go with prefixes.
>>
>> Though, I would likely start (this time) with the assumption of the
>> 32-bit instructions as baseline, and the 16-bit as a shorthand. My
>> encoding is currently a little messy in that (in its original form) the
>> 32-bit instructions were themselves a prefix encoding (of two 16-bit parts).
>>
>>
>>
>> Granted, prefixes have the downside of turning the immediate values into
>> confetti... But, then again, there is RISC-V with fixed-length
>> instructions and the immediate fields are still confetti...
>>
> There is no particular reason to use RISC-V ISA as anything but a
> stick in the sand......

RISC-V is effectively the "ISA de jure" at this moment of time.

Whether or not it is actually the best ISA possible in an engineering sense.

But, yeah, some of us would prefer that, if 20 bits are going to be put
next to each other and used as a 20 bit value, then the bits would also
be a contiguous 20 bit value...

Or, that there is some semblance of "actually has a good reason" for the
encodings of some instructions to differ from other instructions, ...

OTOH:
FEii_jjjj-F2nm_Zekk
Ends up with a 33 bit immediate as:
{ Ei, ii, jjjj, kk }

Which, while not quite as nice as a contiguous value, is still better
IMO than a bunch of bits with no particular order relative to each other.

Well, there is:
F8Zn_iiii
Which does have a contiguous immediate at the expense of being
inconsistent with the encodings for the rest of the ISA, but then again:
F8ni_Zeii
Is a bit more chewed up, so it is a tradeoff.

>>
>> Doesn't really effect the CPU too much, but admittedly is kind of a pain
>> for the assembler and linker (linker needing a larger number of reloc
>> types). Though, one can reduce the annoyance, say, by making all PC-rel
>> offsets either Byte or Word (depending on the op), and all GBR-Rel ops
>> Byte based, ... At least cutting down on the needed number of reloc types.
>>
> I did not end up having to need any of that.

In my case, Load/Store displacements were normally element-size scaled,
but for the PC-Rel and GBR-Rel cases, I ended up special casing them to
always use a 1-byte element scale.

Also sorta works given PC and GBR aren't normally accessible from GPR
space, and come at the expense of two other registers not being allowed
as a base register for memory Load/Store ops.

The alternative having been, instead, to have fewer GPRs and cut off
more of the register space for SPRs.

>>
>> Well, except when the encoding rules change from one instruction to
>> another, which is admittedly thus far why I haven't bothered with
>> RISC-V's 'C' extension.
>>
>> Like, RVC is sort of like someone looked at Thumb and was then like
>> "Hold my beer!".
>>
>>
>> One other option would be, say, to have variable length ops with a
>> size/form bucket, say:
>> 000: 32-bit op, 3R Forms
>> 001: 32-bit op, 3RI Forms
>> 010: 64-bit op, 3R/4R/etc forms
>> 011: 64-bit op, 3RI Forms, Imm32
>> 100: 96-bit op, -
>> 101: 96-bit op, 3RI Forms, Imm64
> <
> For my part, every instruction gets 1 immediate and/or 1 displacement.
> Calculation instructions get 1 immediate
> Loads get 1 displacement
> Stores get 1 displacement and 1 immediate
> <
> Also note: everything needed to determine the length of an instruction
> is available in the first 32-bits of the instruction.
> <
> My rational was that the compiler would be "pretty good" at simplifying
> the instructions such that when it ran into a DAG where a node needed
> 2 constants, a vast majority of the time it could perform the calculation
> at compile time. Evidence has born this out.

OK.

As noted, I don't really have any 2-immediate cases.

I didn't really see many cases where dealing with 2 immediate cases
seemed "worth it".

The rare partial exceptions being, say:
Store a small constant to memory;
Shift a small value by a constant.
Say: Imm4<<Imm6

But, these cases aren't quite "common enough" to justify the cost.

>>
>>
>> Where, say:
>> 000z-zzzz-zzdd-dddd ssss-sstt-tttt-zzzz //3R
>> 001z-zzzz-zzdd-dddd ssss-ssii-iiii-iiii //3RI (Imm10)
>>
>> 011z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm32)
>> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
>>
>> 101z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm64)
>> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
>> iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
>>
>> Where, say, the Imm32 and Imm64 encodings operate in (mostly) the same
>> space as the 3R encodings, but will normally suppress the Rt field in
>> favor of an immediate. The Imm10 encodings would represent a slightly
>> different encoding space (likely exclusively Load/Store ops and ALU
>> Immediate instructions).
>>
>>
>>
>> This would be nicer for an assembler and linker, but would implicitly
>> mean that the instruction stream would no longer be "self-aligning".
>>
>> Say, with a self-aligning stream, if one starts decoding/disassembling
>> instructions from an arbitrary location in the "text" sections, then the
>> decoding will realign with the instruction stream within a few
>> instruction words (this property is useful for tools like debuggers).
>>>>> As to the cost of 2 dest registers, at the low end for an in-order,
>>>>> single issue pipeline with 1 register write port, taking an extra clock
>>>>> to write a second result does not add any complexity.
>>>> At least not much.
>>>>
>>>> But what's the difference from
>>>>
>>>>
>>>> add s0 = a, b
>>>> sltu carry0 = s0,a
>>>>
>>>> which also takes two cycles? Or do you have a three-input
>>>> add-with-carry? That would certainly solve the major problem that
>>>> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
>>>> carry-in.
>>>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>>>>> two register allocation units for each lane, adding greatly to the cost
>>>>> and complexity, and which would mostly sit unused burning power.
>>>>>
>>>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>>>>> split across two consecutive lanes. Rename allocates the 1-dest registers
>>>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>>>>> one output lane to Dispatch.
>>>> I wonder if that's what ARM are doing for their multi-result
>>>> instructions (load-pair with pre/post-indexed addressing mode consumes
>>>> three write ports).
>>>>
>>> An idea I have been toying with is for some operations caching the operands
>>> and results. So, if a divide is done the remainder is available immediately
>>> without having to do a full divide. Works for multiply high too. I think it should
>>> work well with reciprocal calculations too, where the same reciprocal is used
>>> in a loop for instance.
>>>
>>> To deal with extended precision ADD in the past I have used a three operand
>>> ADD, combined with a carry generate instruction. The carry generate
>>> instruction takes the same operands as the add instruction but returns the
>>> carry out instead of the sum. It can then be added in the next instruction. It
>>> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
>>> I think the ADD and the carry-generate can be done in the same clock cycle
>>> in a superscalar CPU.
>>>
>> Possible, but likely to get rather annoying, rather fast.
>>
>> One almost may as well have some designated architectural scratchpad
>> registers in this case.
>>
>> Say:
>> DMULU R4, R5 //MACH:MACL = R4*R5
>> STS MACL, R2 // R2 = MACL
>>
>> But, yeah, I know of an ISA that did the multiplier this way (cough, SH).
>>
>> Though, to be fair, it is less bad than doing the multiplier via MMIO
>> (cough, MSP430).
>>
> Once you bite off on the fact you need to support (double) FMACs
> all of the calculation resources are present to do integer multiply
> fast........and it is used often enough to justify smelling like an integer
> ADD in format and operand routing.


Click here to read the complete article
Re: Faster Sort Algorithms

<d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32778&group=comp.arch#32778

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:134a:b0:3f6:c2a9:5fb1 with SMTP id w10-20020a05622a134a00b003f6c2a95fb1mr2600619qtk.2.1686529262105;
Sun, 11 Jun 2023 17:21:02 -0700 (PDT)
X-Received: by 2002:aca:c043:0:b0:39c:a4da:5de1 with SMTP id
q64-20020acac043000000b0039ca4da5de1mr676235oif.11.1686529261810; Sun, 11 Jun
2023 17:21:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jun 2023 17:21:01 -0700 (PDT)
In-Reply-To: <e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:a994:82f0:4c36:de2c;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:a994:82f0:4c36:de2c
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 12 Jun 2023 00:21:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 9103
 by: robf...@gmail.com - Mon, 12 Jun 2023 00:21 UTC

On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>
> > > >Variable length instructions opens the ISA to much useful functionality
> > > >like full size immediate operands but requires a more complicated fetch
> > > >unit than fixed size. This cost is generally accepted today.
> > > Counterexample: ARM A64 (and ARM has ample experience with
> > > variable-size instructions).
> <
> > I just decided not to long ago to go with a fixed sized instruction because
> > variable length calculations became on the critical timing path for an FPGA
> > design. Postfixes can effectively extend the instruction with constants..
> <
> I found an encoding scheme whereby determining the length of an
> instruction is 4 gates of delay covering 1-5 word instructions. It is
> entirely parallel and can be treeified at 1 gate per doubling of the
> number of instructions decoded. Thus, in 1 stage of logic one can
> parse 16-instructions out of 32-words.
> <
> Now, of course, all this takes place in unary form, which is the natural
> form of hardware parsing of bit strings into useful containers. Why unary ?
> Unary becomes the inputs to multiplexers--obviating the need for binary
> to unary decoding.
> <
> Perhaps your variable length decoding mechanism was not very refined.

I think it was probably okay, just a lack of pipelining at the fetch stage slowed
things down. The length was determined by a lookup of the first eight bits of
the instruction. I wanted to compact the length bits down to fractions of bits,
I thought it somewhat wasteful to have them for every instruction. The lookup
table was only 1 LUT deep, but a second layer of LUTs was needed for
decoding. Part of the issue was doing too much at the fetch stage without
pipelining it. The instruction alignment using a 512+ wide bit shift was also
being done in the same clock cycle. It was fed from the output of an odd, even
mux. A consequence of using 40-bit wide instructions. So, that took a couple
of layers of LUTs. On top of that LUTs were being read combo style from the
LUT ram. And on top of that NOP instructions were being inserted for cache
misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
stage. It needed more pipelining. I found it would be fast enough though by
removing the length lookup. Other pieces of the design showed up on the
critical path. It is scary how much logic takes place in one cycle. I did not
want to add extra pipeline stages at fetch as that creates more flushing of
instructions during branches.

I hit a kind of a minimum useful cycle with the results forwarding from the
output of the ALU back to the input appeared on the timing path. Removing
the forwarding muxes would cut the performance in half, so it was better to
go with a longer clock cycle. That gave me some leeway with the rest of the
design.

I had to slow the whole design down though as it got too large for the FPGA..
Reduced the size of the register file by using block RAMs instead of LUTs and
they needed to be clocked at double the CPU rate. Removed an ALU so an
FPU could be supported. A double frequency clock is fed to some parts of the
design like the multiplier / divider, square root.

Saving up for a larger FPGA.

A lot more could be fit into the FPGA at the cost of some performance. The
core is only about 62k LUTs with an FPU. The rest of the system including
MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
NNA, and graphics accelerator.

> <
> > > >As to the cost of 2 dest registers, at the low end for an in-order,
> > > >single issue pipeline with 1 register write port, taking an extra clock
> > > >to write a second result does not add any complexity.
> <
> For My 66000 CARRY, I get rid of the complexity of 2 result registers
> by having a recirculating bus, aptly called the Carry-Bus, that recirculates
> results to operands until the end of the CARRY shadow. And if the last
> calculation does not produce a carry-out, the result does not get written..
> Much of the time the carry result is never "written" into the register file.
> <
> > > At least not much.
> > >
> > > But what's the difference from
> > >
> > >
> > > add s0 = a, b
> > > sltu carry0 = s0,a
> > >
> > > which also takes two cycles? Or do you have a three-input
> > > add-with-carry? That would certainly solve the major problem that
> > > MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> > > carry-in.
> > > >For a multi-lane OoO uArch my concern was for Rename to NOT require
> > > >two register allocation units for each lane, adding greatly to the cost
> > > >and complexity, and which would mostly sit unused burning power.
> > > >
> > > >My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> > > >split across two consecutive lanes. Rename allocates the 1-dest registers
> > > >to uOps as normal, then fuses the two pieces and outputs a single uOp to
> > > >one output lane to Dispatch.
> > > I wonder if that's what ARM are doing for their multi-result
> > > instructions (load-pair with pre/post-indexed addressing mode consumes
> > > three write ports).
> > >
> > An idea I have been toying with is for some operations caching the operands
> > and results. So, if a divide is done the remainder is available immediately
> > without having to do a full divide. Works for multiply high too. I think it should
> > work well with reciprocal calculations too, where the same reciprocal is used
> > in a loop for instance.
> <
> My VEC instruction informs the hardware which registers hold live results when
> the LOOP terminates. This means a majority of results produced inside a loop
> never need to actually get written into the register file, the result just has to be
> passed on as an operand. This saves power.
> <
> >
> > To deal with extended precision ADD in the past I have used a three operand
> > ADD, combined with a carry generate instruction. The carry generate
> > instruction takes the same operands as the add instruction but returns the
> > carry out instead of the sum. It can then be added in the next instruction. It
> > is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> > I think the ADD and the carry-generate can be done in the same clock cycle
> > in a superscalar CPU.
> <
> My 66000 - 256-bit integer add::
> <
> CARRY R16,{{I}{IO}{IO}{O}}
> ADD R12,R4,R8 // carry Out only
> ADD R13,R5,R9 // Carry In and Out
> ADD R14,R6,R10 // Carry In and Out
> ADD R15,R7,R11 // Carry In only
> <
> Here, in typical use, the carry register does not have to be read from the
> register file nor does it have to get written at the end, either.

> <
> > > - anton
> > > --
> > > 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > > Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Faster Sort Algorithms

<u663ff$302pp$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32779&group=comp.arch#32779

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Sun, 11 Jun 2023 22:29:48 -0500
Organization: A noiseless patient Spider
Lines: 207
Message-ID: <u663ff$302pp$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
<d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Jun 2023 03:29:51 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b32067dde87924b4d4d03d2b4de6538";
logging-data="3148601"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187bnQk2PcPrQ/Xvm1F5oiz"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:ltgguNU+ORcVKp5XL2QJbhDYJPU=
Content-Language: en-US
In-Reply-To: <d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
 by: BGB - Mon, 12 Jun 2023 03:29 UTC

On 6/11/2023 7:21 PM, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
>> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
>>> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>>
>>>>> Variable length instructions opens the ISA to much useful functionality
>>>>> like full size immediate operands but requires a more complicated fetch
>>>>> unit than fixed size. This cost is generally accepted today.
>>>> Counterexample: ARM A64 (and ARM has ample experience with
>>>> variable-size instructions).
>> <
>>> I just decided not to long ago to go with a fixed sized instruction because
>>> variable length calculations became on the critical timing path for an FPGA
>>> design. Postfixes can effectively extend the instruction with constants.
>> <
>> I found an encoding scheme whereby determining the length of an
>> instruction is 4 gates of delay covering 1-5 word instructions. It is
>> entirely parallel and can be treeified at 1 gate per doubling of the
>> number of instructions decoded. Thus, in 1 stage of logic one can
>> parse 16-instructions out of 32-words.
>> <
>> Now, of course, all this takes place in unary form, which is the natural
>> form of hardware parsing of bit strings into useful containers. Why unary ?
>> Unary becomes the inputs to multiplexers--obviating the need for binary
>> to unary decoding.
>> <
>> Perhaps your variable length decoding mechanism was not very refined.
>
> I think it was probably okay, just a lack of pipelining at the fetch stage slowed
> things down. The length was determined by a lookup of the first eight bits of
> the instruction. I wanted to compact the length bits down to fractions of bits,
> I thought it somewhat wasteful to have them for every instruction. The lookup
> table was only 1 LUT deep, but a second layer of LUTs was needed for
> decoding. Part of the issue was doing too much at the fetch stage without
> pipelining it. The instruction alignment using a 512+ wide bit shift was also
> being done in the same clock cycle. It was fed from the output of an odd, even
> mux. A consequence of using 40-bit wide instructions. So, that took a couple
> of layers of LUTs. On top of that LUTs were being read combo style from the
> LUT ram. And on top of that NOP instructions were being inserted for cache
> misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
> stage. It needed more pipelining. I found it would be fast enough though by
> removing the length lookup. Other pieces of the design showed up on the
> critical path. It is scary how much logic takes place in one cycle. I did not
> want to add extra pipeline stages at fetch as that creates more flushing of
> instructions during branches.
>
> I hit a kind of a minimum useful cycle with the results forwarding from the
> output of the ALU back to the input appeared on the timing path. Removing
> the forwarding muxes would cut the performance in half, so it was better to
> go with a longer clock cycle. That gave me some leeway with the rest of the
> design.
>
> I had to slow the whole design down though as it got too large for the FPGA.
> Reduced the size of the register file by using block RAMs instead of LUTs and
> they needed to be clocked at double the CPU rate. Removed an ALU so an
> FPU could be supported. A double frequency clock is fed to some parts of the
> design like the multiplier / divider, square root.
>
> Saving up for a larger FPGA.
>
> A lot more could be fit into the FPGA at the cost of some performance. The
> core is only about 62k LUTs with an FPU. The rest of the system including
> MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
> NNA, and graphics accelerator.
>

Hmm...

My case:
BJX2 Core, with all of the current "fancy stuff" enabled:
3-wide, multiple ISA modes, 96-bit VAs, "all the stuff", ...
~ 40k LUT.
Current version which fits into the XC7S50:
3-wide (6R3W), FPU and SIMD, 48-bit VAs
~ 28k LUT
RISC-like subset:
1-wide, 32-bit address space, no FPU or MMU
3R1W register file.
~ 10k LUT

There was, at one point, a variant with a 4R2W register file, which was
unattractive as it had nearly the cost of the 3-wide case, while being
barely more capable than the 1-wide case.

Smallest cores I have done:
~ 5k LUTs
These being roughly:
16 GPRs (32-bit), 16-bit instructions, aligned-only memory
No FPU or MMU, no multiply, fixed-shift only
2R1W register file.

The 5k LUT cores would be fairly minimalist even vs RISC-V.
The only real way I know of to get down to this size being to omit such
niceties as unaligned memory access and a shift unit.

Also these had "L1 caches" which would only hold a single cache line
(having an array of cache-lines significantly increases the cost of the
L1 cache; but also leads to a fairly drastic performance improvement).

....

As-is, I can do:
2 cores on XC7A200T (device: 134k LUT)
1 core on XC7A100T (device: 63k LUT)
1 core (reduced features) on XC7S50 (device: 32k LUT)
1 core (RISC-like only) on XC7S25 (device: 14k LUT)

I could maybe fit 3 cores or do a dedicated GPU core or similar on the
XC7A200T.

I technically now have a board with an XC7K325T (gotten for surprisingly
cheap off AliExpress), but haven't had much luck with getting the
open-source toolchain working, and the free version of Vivado doesn't
support this FPGA, ...

Also the open-source tools are a little sketchy, as they apparently
depend on having a specific version of the Linux version of Vivado that
gets invoked in the background (and a proper FOSS toolchain should have
no dependence on the proprietary toolchain it is meant to replace).

Biggest in free version seems to be XC7K160T (but, QMTECH wasn't selling
any boards with this chip, nor is anyone else selling boards with these
for "affordable" prices).

The XC7A200T has more LUTs than the XC7K160T, but the latter could run
the core at higher clock speeds.

>> <
>>>>> As to the cost of 2 dest registers, at the low end for an in-order,
>>>>> single issue pipeline with 1 register write port, taking an extra clock
>>>>> to write a second result does not add any complexity.
>> <
>> For My 66000 CARRY, I get rid of the complexity of 2 result registers
>> by having a recirculating bus, aptly called the Carry-Bus, that recirculates
>> results to operands until the end of the CARRY shadow. And if the last
>> calculation does not produce a carry-out, the result does not get written.
>> Much of the time the carry result is never "written" into the register file.
>> <
>>>> At least not much.
>>>>
>>>> But what's the difference from
>>>>
>>>>
>>>> add s0 = a, b
>>>> sltu carry0 = s0,a
>>>>
>>>> which also takes two cycles? Or do you have a three-input
>>>> add-with-carry? That would certainly solve the major problem that
>>>> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
>>>> carry-in.
>>>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>>>>> two register allocation units for each lane, adding greatly to the cost
>>>>> and complexity, and which would mostly sit unused burning power.
>>>>>
>>>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>>>>> split across two consecutive lanes. Rename allocates the 1-dest registers
>>>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>>>>> one output lane to Dispatch.
>>>> I wonder if that's what ARM are doing for their multi-result
>>>> instructions (load-pair with pre/post-indexed addressing mode consumes
>>>> three write ports).
>>>>
>>> An idea I have been toying with is for some operations caching the operands
>>> and results. So, if a divide is done the remainder is available immediately
>>> without having to do a full divide. Works for multiply high too. I think it should
>>> work well with reciprocal calculations too, where the same reciprocal is used
>>> in a loop for instance.
>> <
>> My VEC instruction informs the hardware which registers hold live results when
>> the LOOP terminates. This means a majority of results produced inside a loop
>> never need to actually get written into the register file, the result just has to be
>> passed on as an operand. This saves power.
>> <
>>>
>>> To deal with extended precision ADD in the past I have used a three operand
>>> ADD, combined with a carry generate instruction. The carry generate
>>> instruction takes the same operands as the add instruction but returns the
>>> carry out instead of the sum. It can then be added in the next instruction. It
>>> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
>>> I think the ADD and the carry-generate can be done in the same clock cycle
>>> in a superscalar CPU.
>> <
>> My 66000 - 256-bit integer add::
>> <
>> CARRY R16,{{I}{IO}{IO}{O}}
>> ADD R12,R4,R8 // carry Out only
>> ADD R13,R5,R9 // Carry In and Out
>> ADD R14,R6,R10 // Carry In and Out
>> ADD R15,R7,R11 // Carry In only
>> <
>> Here, in typical use, the carry register does not have to be read from the
>> register file nor does it have to get written at the end, either.
>
>
>> <
>>>> - anton
>>>> --
>>>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>>>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>


Click here to read the complete article
Re: Faster Sort Algorithms

<b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32781&group=comp.arch#32781

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:4607:0:b0:760:8f5a:e927 with SMTP id t7-20020a374607000000b007608f5ae927mr71676qka.6.1686549722354;
Sun, 11 Jun 2023 23:02:02 -0700 (PDT)
X-Received: by 2002:a05:6870:b7a8:b0:19e:e72d:8576 with SMTP id
ed40-20020a056870b7a800b0019ee72d8576mr2697118oab.8.1686549722043; Sun, 11
Jun 2023 23:02:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jun 2023 23:02:01 -0700 (PDT)
In-Reply-To: <u663ff$302pp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:a994:82f0:4c36:de2c;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:a994:82f0:4c36:de2c
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com> <d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
<u663ff$302pp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 12 Jun 2023 06:02:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Mon, 12 Jun 2023 06:02 UTC

On Sunday, June 11, 2023 at 11:29:55 PM UTC-4, BGB wrote:
> On 6/11/2023 7:21 PM, robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
> >> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> >>> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> >>
> >>>>> Variable length instructions opens the ISA to much useful functionality
> >>>>> like full size immediate operands but requires a more complicated fetch
> >>>>> unit than fixed size. This cost is generally accepted today.
> >>>> Counterexample: ARM A64 (and ARM has ample experience with
> >>>> variable-size instructions).
> >> <
> >>> I just decided not to long ago to go with a fixed sized instruction because
> >>> variable length calculations became on the critical timing path for an FPGA
> >>> design. Postfixes can effectively extend the instruction with constants.
> >> <
> >> I found an encoding scheme whereby determining the length of an
> >> instruction is 4 gates of delay covering 1-5 word instructions. It is
> >> entirely parallel and can be treeified at 1 gate per doubling of the
> >> number of instructions decoded. Thus, in 1 stage of logic one can
> >> parse 16-instructions out of 32-words.
> >> <
> >> Now, of course, all this takes place in unary form, which is the natural
> >> form of hardware parsing of bit strings into useful containers. Why unary ?
> >> Unary becomes the inputs to multiplexers--obviating the need for binary
> >> to unary decoding.
> >> <
> >> Perhaps your variable length decoding mechanism was not very refined.
> >
> > I think it was probably okay, just a lack of pipelining at the fetch stage slowed
> > things down. The length was determined by a lookup of the first eight bits of
> > the instruction. I wanted to compact the length bits down to fractions of bits,
> > I thought it somewhat wasteful to have them for every instruction. The lookup
> > table was only 1 LUT deep, but a second layer of LUTs was needed for
> > decoding. Part of the issue was doing too much at the fetch stage without
> > pipelining it. The instruction alignment using a 512+ wide bit shift was also
> > being done in the same clock cycle. It was fed from the output of an odd, even
> > mux. A consequence of using 40-bit wide instructions. So, that took a couple
> > of layers of LUTs. On top of that LUTs were being read combo style from the
> > LUT ram. And on top of that NOP instructions were being inserted for cache
> > misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
> > stage. It needed more pipelining. I found it would be fast enough though by
> > removing the length lookup. Other pieces of the design showed up on the
> > critical path. It is scary how much logic takes place in one cycle. I did not
> > want to add extra pipeline stages at fetch as that creates more flushing of
> > instructions during branches.
> >
> > I hit a kind of a minimum useful cycle with the results forwarding from the
> > output of the ALU back to the input appeared on the timing path. Removing
> > the forwarding muxes would cut the performance in half, so it was better to
> > go with a longer clock cycle. That gave me some leeway with the rest of the
> > design.
> >
> > I had to slow the whole design down though as it got too large for the FPGA.
> > Reduced the size of the register file by using block RAMs instead of LUTs and
> > they needed to be clocked at double the CPU rate. Removed an ALU so an
> > FPU could be supported. A double frequency clock is fed to some parts of the
> > design like the multiplier / divider, square root.
> >
> > Saving up for a larger FPGA.
> >
> > A lot more could be fit into the FPGA at the cost of some performance. The
> > core is only about 62k LUTs with an FPU. The rest of the system including
> > MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
> > NNA, and graphics accelerator.
> >
> Hmm...
>
> My case:
> BJX2 Core, with all of the current "fancy stuff" enabled:
> 3-wide, multiple ISA modes, 96-bit VAs, "all the stuff", ...
> ~ 40k LUT.
> Current version which fits into the XC7S50:
> 3-wide (6R3W), FPU and SIMD, 48-bit VAs
> ~ 28k LUT
> RISC-like subset:
> 1-wide, 32-bit address space, no FPU or MMU
> 3R1W register file.
> ~ 10k LUT
>
> There was, at one point, a variant with a 4R2W register file, which was
> unattractive as it had nearly the cost of the 3-wide case, while being
> barely more capable than the 1-wide case.

A 10r2w register file is being used for the Thor core. It is quite large. It
was originally made from LUTs and could run at 50 MHz. But it is now
block RAM. 3 source operand registers, plus the target register and
predicate register, for two lanes of execution. But it now also supports
multiple register sets, should I get around to a SMT version of the core.
>
> Smallest cores I have done:
> ~ 5k LUTs
> These being roughly:
> 16 GPRs (32-bit), 16-bit instructions, aligned-only memory
> No FPU or MMU, no multiply, fixed-shift only
> 2R1W register file.
>
> The 5k LUT cores would be fairly minimalist even vs RISC-V.
> The only real way I know of to get down to this size being to omit such
> niceties as unaligned memory access and a shift unit.
>
> Also these had "L1 caches" which would only hold a single cache line
> (having an array of cache-lines significantly increases the cost of the
> L1 cache; but also leads to a fairly drastic performance improvement).
>

I started small, limited by a XC4035? FPGA. A round 600 LUTs. I had wire-
wrapped my own board with 64kx8 dram attached and composite video
output. I have constantly gotten to larger and larger cores. I still have a
couple of the original chips somewhere. Maybe collector's items now.

> ...
>
>
> As-is, I can do:
> 2 cores on XC7A200T (device: 134k LUT)
> 1 core on XC7A100T (device: 63k LUT)
> 1 core (reduced features) on XC7S50 (device: 32k LUT)
> 1 core (RISC-like only) on XC7S25 (device: 14k LUT)
>
> I could maybe fit 3 cores or do a dedicated GPU core or similar on the
> XC7A200T.
>
>
>
> I technically now have a board with an XC7K325T (gotten for surprisingly
> cheap off AliExpress), but haven't had much luck with getting the
> open-source toolchain working, and the free version of Vivado doesn't
> support this FPGA, ...

I was looking at a XC7K410T on ebay, but the 410 was not listed as
supported by the open-source tools; the 420 is. I could not find a 420
for sale.

>
> Also the open-source tools are a little sketchy, as they apparently
> depend on having a specific version of the Linux version of Vivado that
> gets invoked in the background (and a proper FOSS toolchain should have
> no dependence on the proprietary toolchain it is meant to replace).

That is good to know. It sounds like the open-source tools have a ways to
go yet.

>
>
> Biggest in free version seems to be XC7K160T (but, QMTECH wasn't selling
> any boards with this chip, nor is anyone else selling boards with these
> for "affordable" prices).
>
> The XC7A200T has more LUTs than the XC7K160T, but the latter could run
> the core at higher clock speeds.
> >> <
> >>>>> As to the cost of 2 dest registers, at the low end for an in-order,
> >>>>> single issue pipeline with 1 register write port, taking an extra clock
> >>>>> to write a second result does not add any complexity.
> >> <
> >> For My 66000 CARRY, I get rid of the complexity of 2 result registers
> >> by having a recirculating bus, aptly called the Carry-Bus, that recirculates
> >> results to operands until the end of the CARRY shadow. And if the last
> >> calculation does not produce a carry-out, the result does not get written.
> >> Much of the time the carry result is never "written" into the register file.
> >> <
> >>>> At least not much.
> >>>>
> >>>> But what's the difference from
> >>>>
> >>>>
> >>>> add s0 = a, b
> >>>> sltu carry0 = s0,a
> >>>>
> >>>> which also takes two cycles? Or do you have a three-input
> >>>> add-with-carry? That would certainly solve the major problem that
> >>>> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> >>>> carry-in.
> >>>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
> >>>>> two register allocation units for each lane, adding greatly to the cost
> >>>>> and complexity, and which would mostly sit unused burning power.
> >>>>>
> >>>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >>>>> split across two consecutive lanes. Rename allocates the 1-dest registers
> >>>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >>>>> one output lane to Dispatch.
> >>>> I wonder if that's what ARM are doing for their multi-result
> >>>> instructions (load-pair with pre/post-indexed addressing mode consumes
> >>>> three write ports).
> >>>>
> >>> An idea I have been toying with is for some operations caching the operands
> >>> and results. So, if a divide is done the remainder is available immediately
> >>> without having to do a full divide. Works for multiply high too. I think it should
> >>> work well with reciprocal calculations too, where the same reciprocal is used
> >>> in a loop for instance.
> >> <
> >> My VEC instruction informs the hardware which registers hold live results when
> >> the LOOP terminates. This means a majority of results produced inside a loop
> >> never need to actually get written into the register file, the result just has to be
> >> passed on as an operand. This saves power.
> >> <
> >>>
> >>> To deal with extended precision ADD in the past I have used a three operand
> >>> ADD, combined with a carry generate instruction. The carry generate
> >>> instruction takes the same operands as the add instruction but returns the
> >>> carry out instead of the sum. It can then be added in the next instruction. It
> >>> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> >>> I think the ADD and the carry-generate can be done in the same clock cycle
> >>> in a superscalar CPU.
> >> <
> >> My 66000 - 256-bit integer add::
> >> <
> >> CARRY R16,{{I}{IO}{IO}{O}}
> >> ADD R12,R4,R8 // carry Out only
> >> ADD R13,R5,R9 // Carry In and Out
> >> ADD R14,R6,R10 // Carry In and Out
> >> ADD R15,R7,R11 // Carry In only
> >> <
> >> Here, in typical use, the carry register does not have to be read from the
> >> register file nor does it have to get written at the end, either.
> >
> >
> >> <
> >>>> - anton
> >>>> --
> >>>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >>>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>


Click here to read the complete article
Re: Faster Sort Algorithms

<u66f9h$319b8$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32784&group=comp.arch#32784

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Mon, 12 Jun 2023 08:51:29 +0200
Organization: A noiseless patient Spider
Lines: 71
Message-ID: <u66f9h$319b8$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Jun 2023 06:51:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="55367020f904f2ee7dd151c09fde27ae";
logging-data="3188072"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19w1tuDuZ/HCMrLz+d9tXwxxyN4IgFHKWfLew4GJ4Luiw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:jLkfyF3f370Nk7H3U0OQOPkW8wQ=
In-Reply-To: <CckhM.10390$NuA9.3884@fx03.iad>
 by: Terje Mathisen - Mon, 12 Jun 2023 06:51 UTC

EricP wrote:
> MitchAlsup wrote:
>> On Friday, June 9, 2023 at 11:54:51 AM UTC-5, EricP wrote:
>>>
>>> For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
>>> instruction along with SWAP rd,rs.
>> <
>> 2 results per instruction is contrary to the tenets of RISC.
>> And most/many of us thing condition codes are contrary to tenets of RISC.
>
> I was driven to instructions with 2-dest results by my desire to
> (a) not have condition codes and (b) wanting to handle add with carry
> in a straight forward way. That leads to many new wide result instructions.
>
> I see the RISC tenets as guidelines advising designers to
> consider the benefits and costs rather than hidebound dogma.
> And that cost/benefit consideration is made in the context of the
> logic budgets of the current day, not locked to a date mid 1980's.
>
> Variable length instructions opens the ISA to much useful functionality
> like full size immediate operands but requires a more complicated fetch
> unit than fixed size. This cost is generally accepted today.
>
> I found your arguments for the utility of store immediate instructions
> convincing but that requires allowing for up to two immediates,
> one for the addressing mode and one for data. The cost of that internally
> is it requires extra decode logic, extra flip-flops in the front end to
> hold the second immediate and extra muxes to route it, and that puts
> pressure on the budget designs.
>
> Having up to two dest registers open an ISA to a whole host of new and
> useful instructions. Mostly these are double-wide results for ADD, MUL,
> various shifts and bit fields. It also allows integer DIV to return
> both the results and remainder, and SWAP or CSWAP, plus many others.
>
> As to the cost of 2 dest registers, at the low end for an in-order,
> single issue pipeline with 1 register write port, taking an extra clock
> to write a second result does not add any complexity.
>
> For a multi-lane OoO uArch my concern was for Rename to NOT require
> two register allocation units for each lane, adding greatly to the cost
> and complexity, and which would mostly sit unused burning power.
>
> My solution is for Decode to spit out 2-dest instructions as two 1-dest
> uOps
> split across two consecutive lanes. Rename allocates the 1-dest registers
> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> one output lane to Dispatch.
>
> The OoO writeback of multiple dests is no more expensive because it
> already requires multiple result and forwarding buses and write ports
> so that it can catch up after bubbles.
>
> With plausable solutions for each problem area, and the costs for 2-dest
> registers under control, I green lighted it for general use by new
> instructions.
>
> So "CSWAP rdX, rdY, rsA, rsB, rsCC" slides right into the ISA:
> IF (rsCC) rdX = rsA, rdY = rsB ELSE rdX = rsB, rdY = rsA END IF
> It needs 3 source registers, but FMA and ADD-wide already need that,
> and 2 dest registers, and could be implemented as a single ALU uOp.

Presumably, the new exact AugmentedAddition/Multiplication FP ops are a
part of that new set. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<03e6707d-a328-4481-ba00-e44335bd01e8n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32785&group=comp.arch#32785

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:44c:0:b0:75e:b8b2:8639 with SMTP id 73-20020a37044c000000b0075eb8b28639mr649318qke.15.1686553597733;
Mon, 12 Jun 2023 00:06:37 -0700 (PDT)
X-Received: by 2002:a05:6870:954f:b0:192:551f:2533 with SMTP id
v15-20020a056870954f00b00192551f2533mr2778290oal.1.1686553597445; Mon, 12 Jun
2023 00:06:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 00:06:37 -0700 (PDT)
In-Reply-To: <b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:a994:82f0:4c36:de2c;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:a994:82f0:4c36:de2c
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com> <d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
<u663ff$302pp$1@dont-email.me> <b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <03e6707d-a328-4481-ba00-e44335bd01e8n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 12 Jun 2023 07:06:37 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 310
 by: robf...@gmail.com - Mon, 12 Jun 2023 07:06 UTC

On Monday, June 12, 2023 at 2:02:04 AM UTC-4, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 11:29:55 PM UTC-4, BGB wrote:
> > On 6/11/2023 7:21 PM, robf...@gmail.com wrote:
> > > On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
> > >> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail..com wrote:
> > >>> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> > >>
> > >>>>> Variable length instructions opens the ISA to much useful functionality
> > >>>>> like full size immediate operands but requires a more complicated fetch
> > >>>>> unit than fixed size. This cost is generally accepted today.
> > >>>> Counterexample: ARM A64 (and ARM has ample experience with
> > >>>> variable-size instructions).
> > >> <
> > >>> I just decided not to long ago to go with a fixed sized instruction because
> > >>> variable length calculations became on the critical timing path for an FPGA
> > >>> design. Postfixes can effectively extend the instruction with constants.
> > >> <
> > >> I found an encoding scheme whereby determining the length of an
> > >> instruction is 4 gates of delay covering 1-5 word instructions. It is
> > >> entirely parallel and can be treeified at 1 gate per doubling of the
> > >> number of instructions decoded. Thus, in 1 stage of logic one can
> > >> parse 16-instructions out of 32-words.
> > >> <
> > >> Now, of course, all this takes place in unary form, which is the natural
> > >> form of hardware parsing of bit strings into useful containers. Why unary ?
> > >> Unary becomes the inputs to multiplexers--obviating the need for binary
> > >> to unary decoding.
> > >> <
> > >> Perhaps your variable length decoding mechanism was not very refined..
> > >
> > > I think it was probably okay, just a lack of pipelining at the fetch stage slowed
> > > things down. The length was determined by a lookup of the first eight bits of
> > > the instruction. I wanted to compact the length bits down to fractions of bits,
> > > I thought it somewhat wasteful to have them for every instruction. The lookup
> > > table was only 1 LUT deep, but a second layer of LUTs was needed for
> > > decoding. Part of the issue was doing too much at the fetch stage without
> > > pipelining it. The instruction alignment using a 512+ wide bit shift was also
> > > being done in the same clock cycle. It was fed from the output of an odd, even
> > > mux. A consequence of using 40-bit wide instructions. So, that took a couple
> > > of layers of LUTs. On top of that LUTs were being read combo style from the
> > > LUT ram. And on top of that NOP instructions were being inserted for cache
> > > misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
> > > stage. It needed more pipelining. I found it would be fast enough though by
> > > removing the length lookup. Other pieces of the design showed up on the
> > > critical path. It is scary how much logic takes place in one cycle. I did not
> > > want to add extra pipeline stages at fetch as that creates more flushing of
> > > instructions during branches.
> > >
> > > I hit a kind of a minimum useful cycle with the results forwarding from the
> > > output of the ALU back to the input appeared on the timing path. Removing
> > > the forwarding muxes would cut the performance in half, so it was better to
> > > go with a longer clock cycle. That gave me some leeway with the rest of the
> > > design.
> > >
> > > I had to slow the whole design down though as it got too large for the FPGA.
> > > Reduced the size of the register file by using block RAMs instead of LUTs and
> > > they needed to be clocked at double the CPU rate. Removed an ALU so an
> > > FPU could be supported. A double frequency clock is fed to some parts of the
> > > design like the multiplier / divider, square root.
> > >
> > > Saving up for a larger FPGA.
> > >
> > > A lot more could be fit into the FPGA at the cost of some performance.. The
> > > core is only about 62k LUTs with an FPU. The rest of the system including
> > > MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
> > > NNA, and graphics accelerator.
> > >
> > Hmm...
> >
> > My case:
> > BJX2 Core, with all of the current "fancy stuff" enabled:
> > 3-wide, multiple ISA modes, 96-bit VAs, "all the stuff", ...
> > ~ 40k LUT.
> > Current version which fits into the XC7S50:
> > 3-wide (6R3W), FPU and SIMD, 48-bit VAs
> > ~ 28k LUT
> > RISC-like subset:
> > 1-wide, 32-bit address space, no FPU or MMU
> > 3R1W register file.
> > ~ 10k LUT
> >
> > There was, at one point, a variant with a 4R2W register file, which was
> > unattractive as it had nearly the cost of the 3-wide case, while being
> > barely more capable than the 1-wide case.
> A 10r2w register file is being used for the Thor core. It is quite large. It
> was originally made from LUTs and could run at 50 MHz. But it is now
> block RAM. 3 source operand registers, plus the target register and
> predicate register, for two lanes of execution. But it now also supports
> multiple register sets, should I get around to a SMT version of the core.
> >
> > Smallest cores I have done:
> > ~ 5k LUTs
> > These being roughly:
> > 16 GPRs (32-bit), 16-bit instructions, aligned-only memory
> > No FPU or MMU, no multiply, fixed-shift only
> > 2R1W register file.
> >
> > The 5k LUT cores would be fairly minimalist even vs RISC-V.
> > The only real way I know of to get down to this size being to omit such
> > niceties as unaligned memory access and a shift unit.
> >
> > Also these had "L1 caches" which would only hold a single cache line
> > (having an array of cache-lines significantly increases the cost of the
> > L1 cache; but also leads to a fairly drastic performance improvement).
> >
> I started small, limited by a XC4035? FPGA. A round 600 LUTs. I had wire-
> wrapped my own board with 64kx8 dram attached and composite video
> output. I have constantly gotten to larger and larger cores. I still have a
> couple of the original chips somewhere. Maybe collector's items now.

Ooops. I think that was a XC4005/4010.

> > ...
> >
> >
> > As-is, I can do:
> > 2 cores on XC7A200T (device: 134k LUT)
> > 1 core on XC7A100T (device: 63k LUT)
> > 1 core (reduced features) on XC7S50 (device: 32k LUT)
> > 1 core (RISC-like only) on XC7S25 (device: 14k LUT)
> >
> > I could maybe fit 3 cores or do a dedicated GPU core or similar on the
> > XC7A200T.
> >
> >
> >
> > I technically now have a board with an XC7K325T (gotten for surprisingly
> > cheap off AliExpress), but haven't had much luck with getting the
> > open-source toolchain working, and the free version of Vivado doesn't
> > support this FPGA, ...
> I was looking at a XC7K410T on ebay, but the 410 was not listed as
> supported by the open-source tools; the 420 is. I could not find a 420
> for sale.
> >
> > Also the open-source tools are a little sketchy, as they apparently
> > depend on having a specific version of the Linux version of Vivado that
> > gets invoked in the background (and a proper FOSS toolchain should have
> > no dependence on the proprietary toolchain it is meant to replace).
> That is good to know. It sounds like the open-source tools have a ways to
> go yet.
> >
> >
> > Biggest in free version seems to be XC7K160T (but, QMTECH wasn't selling
> > any boards with this chip, nor is anyone else selling boards with these
> > for "affordable" prices).
> >
> > The XC7A200T has more LUTs than the XC7K160T, but the latter could run
> > the core at higher clock speeds.
> > >> <
> > >>>>> As to the cost of 2 dest registers, at the low end for an in-order,
> > >>>>> single issue pipeline with 1 register write port, taking an extra clock
> > >>>>> to write a second result does not add any complexity.
> > >> <
> > >> For My 66000 CARRY, I get rid of the complexity of 2 result registers
> > >> by having a recirculating bus, aptly called the Carry-Bus, that recirculates
> > >> results to operands until the end of the CARRY shadow. And if the last
> > >> calculation does not produce a carry-out, the result does not get written.
> > >> Much of the time the carry result is never "written" into the register file.
> > >> <
> > >>>> At least not much.
> > >>>>
> > >>>> But what's the difference from
> > >>>>
> > >>>>
> > >>>> add s0 = a, b
> > >>>> sltu carry0 = s0,a
> > >>>>
> > >>>> which also takes two cycles? Or do you have a three-input
> > >>>> add-with-carry? That would certainly solve the major problem that
> > >>>> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> > >>>> carry-in.
> > >>>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
> > >>>>> two register allocation units for each lane, adding greatly to the cost
> > >>>>> and complexity, and which would mostly sit unused burning power.
> > >>>>>
> > >>>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> > >>>>> split across two consecutive lanes. Rename allocates the 1-dest registers
> > >>>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> > >>>>> one output lane to Dispatch.
> > >>>> I wonder if that's what ARM are doing for their multi-result
> > >>>> instructions (load-pair with pre/post-indexed addressing mode consumes
> > >>>> three write ports).
> > >>>>
> > >>> An idea I have been toying with is for some operations caching the operands
> > >>> and results. So, if a divide is done the remainder is available immediately
> > >>> without having to do a full divide. Works for multiply high too. I think it should
> > >>> work well with reciprocal calculations too, where the same reciprocal is used
> > >>> in a loop for instance.
> > >> <
> > >> My VEC instruction informs the hardware which registers hold live results when
> > >> the LOOP terminates. This means a majority of results produced inside a loop
> > >> never need to actually get written into the register file, the result just has to be
> > >> passed on as an operand. This saves power.
> > >> <
> > >>>
> > >>> To deal with extended precision ADD in the past I have used a three operand
> > >>> ADD, combined with a carry generate instruction. The carry generate
> > >>> instruction takes the same operands as the add instruction but returns the
> > >>> carry out instead of the sum. It can then be added in the next instruction. It
> > >>> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> > >>> I think the ADD and the carry-generate can be done in the same clock cycle
> > >>> in a superscalar CPU.
> > >> <
> > >> My 66000 - 256-bit integer add::
> > >> <
> > >> CARRY R16,{{I}{IO}{IO}{O}}
> > >> ADD R12,R4,R8 // carry Out only
> > >> ADD R13,R5,R9 // Carry In and Out
> > >> ADD R14,R6,R10 // Carry In and Out
> > >> ADD R15,R7,R11 // Carry In only
> > >> <
> > >> Here, in typical use, the carry register does not have to be read from the
> > >> register file nor does it have to get written at the end, either.
> > >
> > >
> > >> <
> > >>>> - anton
> > >>>> --
> > >>>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > >>>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>


Click here to read the complete article
Re: Faster Sort Algorithms

<u66nt1$326n5$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32787&group=comp.arch#32787

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88192@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Mon, 12 Jun 2023 04:18:22 -0500
Organization: A noiseless patient Spider
Lines: 277
Message-ID: <u66nt1$326n5$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<e9d29bd6-fcac-4928-9296-57e262f0d64cn@googlegroups.com>
<d0abe42e-a038-4cb4-bb86-f4fe98a4cecen@googlegroups.com>
<u663ff$302pp$1@dont-email.me>
<b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Jun 2023 09:18:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b32067dde87924b4d4d03d2b4de6538";
logging-data="3218149"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+hJPn6krHbi0IX6VkPF9ib"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.11.2
Cancel-Lock: sha1:R7sryGHlFnlRz3sU5MJigeIUITQ=
Content-Language: en-US
In-Reply-To: <b4557e9e-51cd-465a-b3f7-595fd97259d6n@googlegroups.com>
 by: BGB - Mon, 12 Jun 2023 09:18 UTC

On 6/12/2023 1:02 AM, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 11:29:55 PM UTC-4, BGB wrote:
>> On 6/11/2023 7:21 PM, robf...@gmail.com wrote:
>>> On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
>>>> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
>>>>> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>>>>
>>>>>>> Variable length instructions opens the ISA to much useful functionality
>>>>>>> like full size immediate operands but requires a more complicated fetch
>>>>>>> unit than fixed size. This cost is generally accepted today.
>>>>>> Counterexample: ARM A64 (and ARM has ample experience with
>>>>>> variable-size instructions).
>>>> <
>>>>> I just decided not to long ago to go with a fixed sized instruction because
>>>>> variable length calculations became on the critical timing path for an FPGA
>>>>> design. Postfixes can effectively extend the instruction with constants.
>>>> <
>>>> I found an encoding scheme whereby determining the length of an
>>>> instruction is 4 gates of delay covering 1-5 word instructions. It is
>>>> entirely parallel and can be treeified at 1 gate per doubling of the
>>>> number of instructions decoded. Thus, in 1 stage of logic one can
>>>> parse 16-instructions out of 32-words.
>>>> <
>>>> Now, of course, all this takes place in unary form, which is the natural
>>>> form of hardware parsing of bit strings into useful containers. Why unary ?
>>>> Unary becomes the inputs to multiplexers--obviating the need for binary
>>>> to unary decoding.
>>>> <
>>>> Perhaps your variable length decoding mechanism was not very refined.
>>>
>>> I think it was probably okay, just a lack of pipelining at the fetch stage slowed
>>> things down. The length was determined by a lookup of the first eight bits of
>>> the instruction. I wanted to compact the length bits down to fractions of bits,
>>> I thought it somewhat wasteful to have them for every instruction. The lookup
>>> table was only 1 LUT deep, but a second layer of LUTs was needed for
>>> decoding. Part of the issue was doing too much at the fetch stage without
>>> pipelining it. The instruction alignment using a 512+ wide bit shift was also
>>> being done in the same clock cycle. It was fed from the output of an odd, even
>>> mux. A consequence of using 40-bit wide instructions. So, that took a couple
>>> of layers of LUTs. On top of that LUTs were being read combo style from the
>>> LUT ram. And on top of that NOP instructions were being inserted for cache
>>> misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
>>> stage. It needed more pipelining. I found it would be fast enough though by
>>> removing the length lookup. Other pieces of the design showed up on the
>>> critical path. It is scary how much logic takes place in one cycle. I did not
>>> want to add extra pipeline stages at fetch as that creates more flushing of
>>> instructions during branches.
>>>
>>> I hit a kind of a minimum useful cycle with the results forwarding from the
>>> output of the ALU back to the input appeared on the timing path. Removing
>>> the forwarding muxes would cut the performance in half, so it was better to
>>> go with a longer clock cycle. That gave me some leeway with the rest of the
>>> design.
>>>
>>> I had to slow the whole design down though as it got too large for the FPGA.
>>> Reduced the size of the register file by using block RAMs instead of LUTs and
>>> they needed to be clocked at double the CPU rate. Removed an ALU so an
>>> FPU could be supported. A double frequency clock is fed to some parts of the
>>> design like the multiplier / divider, square root.
>>>
>>> Saving up for a larger FPGA.
>>>
>>> A lot more could be fit into the FPGA at the cost of some performance. The
>>> core is only about 62k LUTs with an FPU. The rest of the system including
>>> MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
>>> NNA, and graphics accelerator.
>>>
>> Hmm...
>>
>> My case:
>> BJX2 Core, with all of the current "fancy stuff" enabled:
>> 3-wide, multiple ISA modes, 96-bit VAs, "all the stuff", ...
>> ~ 40k LUT.
>> Current version which fits into the XC7S50:
>> 3-wide (6R3W), FPU and SIMD, 48-bit VAs
>> ~ 28k LUT
>> RISC-like subset:
>> 1-wide, 32-bit address space, no FPU or MMU
>> 3R1W register file.
>> ~ 10k LUT
>>
>> There was, at one point, a variant with a 4R2W register file, which was
>> unattractive as it had nearly the cost of the 3-wide case, while being
>> barely more capable than the 1-wide case.
>
> A 10r2w register file is being used for the Thor core. It is quite large. It
> was originally made from LUTs and could run at 50 MHz. But it is now
> block RAM. 3 source operand registers, plus the target register and
> predicate register, for two lanes of execution. But it now also supports
> multiple register sets, should I get around to a SMT version of the core.
>

Yeah...

I had looked at a 12R6W regfile at one point, but quick estimates were
that this was not going to be friendly to resource cost...

My current 6R3W regfile is implemented in LUTRAM.
Each normal instruction is only reserved 2 register ports;
Instructions that need more than 2 ports will use multiple lanes.

This mostly works with 64 GPRs and my current strategy of only having a
single set of registers.

Having multiple bank-swapped register sets would be a problem though...

>>
>> Smallest cores I have done:
>> ~ 5k LUTs
>> These being roughly:
>> 16 GPRs (32-bit), 16-bit instructions, aligned-only memory
>> No FPU or MMU, no multiply, fixed-shift only
>> 2R1W register file.
>>
>> The 5k LUT cores would be fairly minimalist even vs RISC-V.
>> The only real way I know of to get down to this size being to omit such
>> niceties as unaligned memory access and a shift unit.
>>
>> Also these had "L1 caches" which would only hold a single cache line
>> (having an array of cache-lines significantly increases the cost of the
>> L1 cache; but also leads to a fairly drastic performance improvement).
>>
>
> I started small, limited by a XC4035? FPGA. A round 600 LUTs. I had wire-
> wrapped my own board with 64kx8 dram attached and composite video
> output. I have constantly gotten to larger and larger cores. I still have a
> couple of the original chips somewhere. Maybe collector's items now.
>

OK.

I got into things more recently, when the 7-series chips were already in
common use.

First board I got had an XC7S50.

Trying to keep size reasonable, but always more stuff to burn LUTs on.

For example, just went and added an experimental hack so that the
integer multiplier can communicate with the Shift-Add unit, and if the
multiplier sees a division it can handle by turning it into a multiply
by reciprocal, it can signal the divide unit to let the multiplier
handle it instead.

This tweak can boost Dhrystone score to a little over 80k (0.91 DMIPS/MHz).

>> ...
>>
>>
>> As-is, I can do:
>> 2 cores on XC7A200T (device: 134k LUT)
>> 1 core on XC7A100T (device: 63k LUT)
>> 1 core (reduced features) on XC7S50 (device: 32k LUT)
>> 1 core (RISC-like only) on XC7S25 (device: 14k LUT)
>>
>> I could maybe fit 3 cores or do a dedicated GPU core or similar on the
>> XC7A200T.
>>
>>
>>
>> I technically now have a board with an XC7K325T (gotten for surprisingly
>> cheap off AliExpress), but haven't had much luck with getting the
>> open-source toolchain working, and the free version of Vivado doesn't
>> support this FPGA, ...
>
> I was looking at a XC7K410T on ebay, but the 410 was not listed as
> supported by the open-source tools; the 420 is. I could not find a 420
> for sale.
>

Well, and not supported by the free Vivado WebPack either...

Annoying that Vivado goes from free to "very expensive".
Would have been nice if one could be like, "Hey, I have this FPGA board
I got for $100 off AliExpress, can I just give you guys, say, $35 or
something so I can generate bitsteams for it?..."

>>
>> Also the open-source tools are a little sketchy, as they apparently
>> depend on having a specific version of the Linux version of Vivado that
>> gets invoked in the background (and a proper FOSS toolchain should have
>> no dependence on the proprietary toolchain it is meant to replace).
>
> That is good to know. It sounds like the open-source tools have a ways to
> go yet.
>


Click here to read the complete article
Re: Faster Sort Algorithms

<6mIhM.58859$d1y5.51140@fx17.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32792&group=comp.arch#32792

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at>
In-Reply-To: <2023Jun11.163044@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 74
Message-ID: <6mIhM.58859$d1y5.51140@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 12 Jun 2023 17:12:34 UTC
Date: Mon, 12 Jun 2023 13:11:15 -0400
X-Received-Bytes: 3953
 by: EricP - Mon, 12 Jun 2023 17:11 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> I was driven to instructions with 2-dest results by my desire to
>> (a) not have condition codes and (b) wanting to handle add with carry
>> in a straight forward way.
>
> I have expanded my answer to that problem into a paper:
> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>
> Of course, the question is whether you consider this answer
> straightforward.
>
>> Variable length instructions opens the ISA to much useful functionality
>> like full size immediate operands but requires a more complicated fetch
>> unit than fixed size. This cost is generally accepted today.
>
> Counterexample: ARM A64 (and ARM has ample experience with
> variable-size instructions).

Yes, it is currently aligned 4-byte instruction words.
It also has condition codes, address modes like pre/post index,
load/store pair. So the risc tenets are already abandon.

If they want to expand their ISA for, say, instructions requiring more
registers or longer immediates then they would pretty much have to do so
in 4-byte granules. That could result in lots of pad 0's.

>> As to the cost of 2 dest registers, at the low end for an in-order,
>> single issue pipeline with 1 register write port, taking an extra clock
>> to write a second result does not add any complexity.
>
> At least not much.
>
> But what's the difference from
>
>
> add s0 = a, b
> sltu carry0 = s0,a
>
> which also takes two cycles? Or do you have a three-input
> add-with-carry? That would certainly solve the major problem that
> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> carry-in.

Yeah, I crossed the Rubicon and went with 2 and 3 input, wide output.
Its not just add. There are so many other instructions that compute
most or all of a wide result, then throw the high part away.
Then a second instruction repeats the first but throws the low part away.

Better to have a single instruction that does what the customer wants.

>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>> two register allocation units for each lane, adding greatly to the cost
>> and complexity, and which would mostly sit unused burning power.
>>
>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>> split across two consecutive lanes. Rename allocates the 1-dest registers
>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>> one output lane to Dispatch.
>
> I wonder if that's what ARM are doing for their multi-result
> instructions (load-pair with pre/post-indexed addressing mode consumes
> three write ports).
>
> - anton

Hard to guess without knowing more about the uArch as different design
approaches, separate future register file (FRF) (aka ROB) and
architecture register file (ARF), or merged physical register file (PRF),
have different allocation, free, and renaming mechanisms.
And this all interacts with branch checkpoint/rollback,
how uOp scheduling and launch works, etc.

Re: Faster Sort Algorithms

<tLIhM.33348$fZx2.29907@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32793&group=comp.arch#32793

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: scott@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Faster Sort Algorithms
Newsgroups: comp.arch
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad>
Lines: 31
Message-ID: <tLIhM.33348$fZx2.29907@fx14.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 12 Jun 2023 17:39:37 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 12 Jun 2023 17:39:37 GMT
X-Received-Bytes: 2170
 by: Scott Lurndal - Mon, 12 Jun 2023 17:39 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>>
>> I have expanded my answer to that problem into a paper:
>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>>
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
>Yes, it is currently aligned 4-byte instruction words.
>It also has condition codes, address modes like pre/post index,
>load/store pair. So the risc tenets are already abandon.
>
>If they want to expand their ISA for, say, instructions requiring more
>registers or longer immediates then they would pretty much have to do so
>in 4-byte granules. That could result in lots of pad 0's.

They already have instructions that require eight registers. They're
just required to be numbered consecutively.

Re: Faster Sort Algorithms

<ce03b6dd-4bb6-4de5-b98f-42b17987a010n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32796&group=comp.arch#32796

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e82:0:b0:62d:e0b0:cee7 with SMTP id dy2-20020ad44e82000000b0062de0b0cee7mr757659qvb.1.1686592983467;
Mon, 12 Jun 2023 11:03:03 -0700 (PDT)
X-Received: by 2002:a05:6870:a89d:b0:1a6:7cf9:36e7 with SMTP id
eb29-20020a056870a89d00b001a67cf936e7mr1545658oab.5.1686592982870; Mon, 12
Jun 2023 11:03:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 11:03:02 -0700 (PDT)
In-Reply-To: <6mIhM.58859$d1y5.51140@fx17.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7567:8988:6d81:ea5d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7567:8988:6d81:ea5d
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ce03b6dd-4bb6-4de5-b98f-42b17987a010n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jun 2023 18:03:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Mon, 12 Jun 2023 18:03 UTC

On Monday, June 12, 2023 at 12:12:37 PM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> I was driven to instructions with 2-dest results by my desire to
> >> (a) not have condition codes and (b) wanting to handle add with carry
> >> in a straight forward way.
> >
> > I have expanded my answer to that problem into a paper:
> > <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
> >
> > Of course, the question is whether you consider this answer
> > straightforward.
> >
> >> Variable length instructions opens the ISA to much useful functionality
> >> like full size immediate operands but requires a more complicated fetch
> >> unit than fixed size. This cost is generally accepted today.
> >
> > Counterexample: ARM A64 (and ARM has ample experience with
> > variable-size instructions).
> Yes, it is currently aligned 4-byte instruction words.
> It also has condition codes, address modes like pre/post index,
> load/store pair. So the risc tenets are already abandon.
>
> If they want to expand their ISA for, say, instructions requiring more
> registers or longer immediates then they would pretty much have to do so
> in 4-byte granules. That could result in lots of pad 0's.
> >> As to the cost of 2 dest registers, at the low end for an in-order,
> >> single issue pipeline with 1 register write port, taking an extra clock
> >> to write a second result does not add any complexity.
> >
> > At least not much.
> >
> > But what's the difference from
> >
> >
> > add s0 = a, b
> > sltu carry0 = s0,a
> >
> > which also takes two cycles? Or do you have a three-input
> > add-with-carry? That would certainly solve the major problem that
> > MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> > carry-in.
> Yeah, I crossed the Rubicon and went with 2 and 3 input, wide output.
> Its not just add. There are so many other instructions that compute
> most or all of a wide result, then throw the high part away.
> Then a second instruction repeats the first but throws the low part away.
>
> Better to have a single instruction that does what the customer wants.
> >> For a multi-lane OoO uArch my concern was for Rename to NOT require
> >> two register allocation units for each lane, adding greatly to the cost
> >> and complexity, and which would mostly sit unused burning power.
> >>
> >> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >> split across two consecutive lanes. Rename allocates the 1-dest registers
> >> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >> one output lane to Dispatch.
> >
> > I wonder if that's what ARM are doing for their multi-result
> > instructions (load-pair with pre/post-indexed addressing mode consumes
> > three write ports).
> >
> > - anton
<
> Hard to guess without knowing more about the uArch as different design
> approaches, separate future register file (FRF) (aka ROB) and
> architecture register file (ARF), or merged physical register file (PRF),
> have different allocation, free, and renaming mechanisms.
> And this all interacts with branch checkpoint/rollback,
> how uOp scheduling and launch works, etc.
<
Mc 88120 used a PRF with logical CAMs for the architectural registers
regular decoders for the physical registers, and a history table of CAM
valid bits to backup the architectural "names" and for allocation of
registers. We could back the mispredicted branch up and be inserting
instructions at the actual target back into the window in the same clock.
....................

Re: Faster Sort Algorithms

<rkJhM.5473$x7Q8.4240@fx06.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32800&group=comp.arch#32800

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx06.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
In-Reply-To: <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 96
Message-ID: <rkJhM.5473$x7Q8.4240@fx06.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 12 Jun 2023 18:19:03 UTC
Date: Mon, 12 Jun 2023 14:18:43 -0400
X-Received-Bytes: 5701
 by: EricP - Mon, 12 Jun 2023 18:18 UTC

robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>> I have expanded my answer to that problem into a paper:
>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.

You say elsewhere that yours is a 40 bit fixed instruction,
presumably byte aligned. That byte alignment kinda nullifies
most of the advantages of fixed size.

The savings for fixed size come from being aligned on a nice boundary,
these days usually 4 bytes.

Savings come from not having to deal with instruction straddles:
fetch buffer straddles, line straddles, page straddles.
Straddles require extra sequencing logic, machine states, clocks,
stalls because it only has half of something, dealing with stall aborts.

Then there is instruction alignment shifter. For 4-byte aligned
instructions and 16-byte fetch buffers that's a 32 * 4:1 muxes while
byte aligned is 8 * 16:1 muxes plus 8 * 5:1 muxes to deal with
partial fetched straddles.

And since handling the 4-byte aligned instructions is so simple
they might be run directly into the decoder from the fetch buffer,
combining fetch and decode into a single stage for a shorter pipeline.
Whereas the byte aligned likely requires a separate fetch stage.

All in all, it's just another brick in the wall.

Variable length needs all of the above too, but gets flexibility
to add more kinds of instructions in return, although that does
lay more complexity pressure onto the decoder too.

>>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>>> two register allocation units for each lane, adding greatly to the cost
>>> and complexity, and which would mostly sit unused burning power.
>>>
>>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>>> split across two consecutive lanes. Rename allocates the 1-dest registers
>>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>>> one output lane to Dispatch.
>> I wonder if that's what ARM are doing for their multi-result
>> instructions (load-pair with pre/post-indexed addressing mode consumes
>> three write ports).
>>
> An idea I have been toying with is for some operations caching the operands
> and results. So, if a divide is done the remainder is available immediately
> without having to do a full divide. Works for multiply high too. I think it should
> work well with reciprocal calculations too, where the same reciprocal is used
> in a loop for instance.
>
> To deal with extended precision ADD in the past I have used a three operand
> ADD, combined with a carry generate instruction. The carry generate
> instruction takes the same operands as the add instruction but returns the
> carry out instead of the sum. It can then be added in the next instruction. It
> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> I think the ADD and the carry-generate can be done in the same clock cycle
> in a superscalar CPU.

That's what I did at first too. Sticking with risc tenet of 2 source,
1 dest registers I had ADD for the low part and ADDH for high carry out.

The 2 sources means each of those is really a half adder so it takes
4 instructions to do a full add with carry-in and carry-out.
Breaking with purist risc and allowing 3 source, 1 dest registers
keeps it to the two ADD and ADDH instructions.

Those xxxH instructions show up all over the place. MULHU, MULHS.
The 3 source allows some wide shift and bit field instructions,
but the 1 dest restriction still means you have both xxx and xxxH variants.

And in many of them you are doing most or all the full work
then throwing away either the top or bottom half.
Then repeating it immediately for the other half.

All of that goes back to the 1-dest restriction and the assumption
that doing otherwise adds too much complexity.

Re: Faster Sort Algorithms

<df4c26ca-3788-447f-98d3-fb291b2ac233n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32802&group=comp.arch#32802

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4e13:0:b0:62d:e508:6e4e with SMTP id dl19-20020ad44e13000000b0062de5086e4emr595039qvb.1.1686601495880;
Mon, 12 Jun 2023 13:24:55 -0700 (PDT)
X-Received: by 2002:a05:6870:3a09:b0:1a6:773c:b956 with SMTP id
du9-20020a0568703a0900b001a6773cb956mr1870062oab.7.1686601495658; Mon, 12 Jun
2023 13:24:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 13:24:55 -0700 (PDT)
In-Reply-To: <rkJhM.5473$x7Q8.4240@fx06.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d3b:7550:6167:839f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d3b:7550:6167:839f
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<rkJhM.5473$x7Q8.4240@fx06.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <df4c26ca-3788-447f-98d3-fb291b2ac233n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jun 2023 20:24:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7006
 by: MitchAlsup - Mon, 12 Jun 2023 20:24 UTC

On Monday, June 12, 2023 at 1:19:07 PM UTC-5, EricP wrote:
> robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> >> EricP <ThatWould...@thevillage.com> writes:
> >>> I was driven to instructions with 2-dest results by my desire to
> >>> (a) not have condition codes and (b) wanting to handle add with carry
> >>> in a straight forward way.
> >> I have expanded my answer to that problem into a paper:
> >> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
> >>
> >> Of course, the question is whether you consider this answer
> >> straightforward.
> >>> Variable length instructions opens the ISA to much useful functionality
> >>> like full size immediate operands but requires a more complicated fetch
> >>> unit than fixed size. This cost is generally accepted today.
> >> Counterexample: ARM A64 (and ARM has ample experience with
> >> variable-size instructions).
> >
> > I just decided not to long ago to go with a fixed sized instruction because
> > variable length calculations became on the critical timing path for an FPGA
> > design. Postfixes can effectively extend the instruction with constants..
> You say elsewhere that yours is a 40 bit fixed instruction,
> presumably byte aligned. That byte alignment kinda nullifies
> most of the advantages of fixed size.
>
> The savings for fixed size come from being aligned on a nice boundary,
> these days usually 4 bytes.
>
> Savings come from not having to deal with instruction straddles:
> fetch buffer straddles, line straddles, page straddles.
> Straddles require extra sequencing logic, machine states, clocks,
> stalls because it only has half of something, dealing with stall aborts.
>
> Then there is instruction alignment shifter. For 4-byte aligned
> instructions and 16-byte fetch buffers that's a 32 * 4:1 muxes while
> byte aligned is 8 * 16:1 muxes plus 8 * 5:1 muxes to deal with
> partial fetched straddles.
>
> And since handling the 4-byte aligned instructions is so simple
> they might be run directly into the decoder from the fetch buffer,
> combining fetch and decode into a single stage for a shorter pipeline.
> Whereas the byte aligned likely requires a separate fetch stage.
>
> All in all, it's just another brick in the wall.
>
> Variable length needs all of the above too, but gets flexibility
> to add more kinds of instructions in return, although that does
> lay more complexity pressure onto the decoder too.
> >>> For a multi-lane OoO uArch my concern was for Rename to NOT require
> >>> two register allocation units for each lane, adding greatly to the cost
> >>> and complexity, and which would mostly sit unused burning power.
> >>>
> >>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >>> split across two consecutive lanes. Rename allocates the 1-dest registers
> >>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >>> one output lane to Dispatch.
> >> I wonder if that's what ARM are doing for their multi-result
> >> instructions (load-pair with pre/post-indexed addressing mode consumes
> >> three write ports).
> >>
> > An idea I have been toying with is for some operations caching the operands
> > and results. So, if a divide is done the remainder is available immediately
> > without having to do a full divide. Works for multiply high too. I think it should
> > work well with reciprocal calculations too, where the same reciprocal is used
> > in a loop for instance.
> >
> > To deal with extended precision ADD in the past I have used a three operand
> > ADD, combined with a carry generate instruction. The carry generate
> > instruction takes the same operands as the add instruction but returns the
> > carry out instead of the sum. It can then be added in the next instruction. It
> > is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> > I think the ADD and the carry-generate can be done in the same clock cycle
> > in a superscalar CPU.
> That's what I did at first too. Sticking with risc tenet of 2 source,
> 1 dest registers I had ADD for the low part and ADDH for high carry out.
>
> The 2 sources means each of those is really a half adder so it takes
> 4 instructions to do a full add with carry-in and carry-out.
> Breaking with purist risc and allowing 3 source, 1 dest registers
> keeps it to the two ADD and ADDH instructions.
<
CARRY solves these problems, the carry-bus solves the routing problem.
Most of the time, the extra register does not need to be read at the
beginning o written at the end...............
>
> Those xxxH instructions show up all over the place. MULHU, MULHS.
> The 3 source allows some wide shift and bit field instructions,
> but the 1 dest restriction still means you have both xxx and xxxH variants.
>
> And in many of them you are doing most or all the full work
> then throwing away either the top or bottom half.
> Then repeating it immediately for the other half.
>
> All of that goes back to the 1-dest restriction and the assumption
> that doing otherwise adds too much complexity.

Re: Faster Sort Algorithms

<u681a9$2v4pe$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32804&group=comp.arch#32804

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Mon, 12 Jun 2023 21:05:13 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u681a9$2v4pe$1@newsreader4.netcologne.de>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad>
Injection-Date: Mon, 12 Jun 2023 21:05:13 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2d6e:0:7285:c2ff:fe6c:992d";
logging-data="3117870"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 12 Jun 2023 21:05 UTC

Hmm.. a couple of thoughts.

Assumme we have the three-operand instructions

MIN Rt,Ra,Rb,Rc
MID Rt,Ra,Rb,Rc
MAX Rt,Ra,Rb,Rc

which would, as the name suggests, select the minimum, the mitpoint
and the maximum of three values. Are these instructions which
could reasonably be done in a single cycle, and issued in parallel
on an OoO machine? That should certainly speed up three-way
sorting, if this is reasonable.

Another idea, potentially even wilder (and probably worse :-)
A "load double and sort" instruction, so

LS Rmin,Rmax,[Rs]

would load the maximum of [Rs] and [Rs+offs] into Rmax and the
minimum of [Rs] and [Rs+offs], with offs being the size of the
data loaded.

Comments? Too horrible even to think about? Has been tried and
does not work because...?

Re: Faster Sort Algorithms

<5UMhM.36287$fZx2.13431@fx14.iad>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32805&group=comp.arch#32805

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
From: ThatWouldBeTelling@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad> <u681a9$2v4pe$1@newsreader4.netcologne.de>
In-Reply-To: <u681a9$2v4pe$1@newsreader4.netcologne.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 40
Message-ID: <5UMhM.36287$fZx2.13431@fx14.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 12 Jun 2023 22:21:53 UTC
Date: Mon, 12 Jun 2023 18:21:53 -0400
X-Received-Bytes: 2103
 by: EricP - Mon, 12 Jun 2023 22:21 UTC

Thomas Koenig wrote:
> Hmm.. a couple of thoughts.
>
> Assumme we have the three-operand instructions
>
> MIN Rt,Ra,Rb,Rc
> MID Rt,Ra,Rb,Rc
> MAX Rt,Ra,Rb,Rc
>
> which would, as the name suggests, select the minimum, the mitpoint
> and the maximum of three values. Are these instructions which
> could reasonably be done in a single cycle, and issued in parallel
> on an OoO machine? That should certainly speed up three-way
> sorting, if this is reasonable.
>
> Another idea, potentially even wilder (and probably worse :-)
> A "load double and sort" instruction, so
>
> LS Rmin,Rmax,[Rs]
>
> would load the maximum of [Rs] and [Rs+offs] into Rmax and the
> minimum of [Rs] and [Rs+offs], with offs being the size of the
> data loaded.
>
> Comments? Too horrible even to think about? Has been tried and
> does not work because...?
>

Many sorts are not just for an integer vector but also associated tag values.
That is,
Sort (a, b, taga, tagb)
if a > b then
swap(a,b)
swap(taga,tagb)
end if
end Sort

The CMP, CMOV, CMOV way can handle tags, as can CMP CSWAP.
But MIN and MAX approach can't.

Re: Faster Sort Algorithms

<d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32806&group=comp.arch#32806

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1633:b0:62d:e259:6a2a with SMTP id e19-20020a056214163300b0062de2596a2amr838449qvw.3.1686609050149;
Mon, 12 Jun 2023 15:30:50 -0700 (PDT)
X-Received: by 2002:aca:f38a:0:b0:395:e7d8:c3fc with SMTP id
r132-20020acaf38a000000b00395e7d8c3fcmr1440971oih.9.1686609049881; Mon, 12
Jun 2023 15:30:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 15:30:49 -0700 (PDT)
In-Reply-To: <u681a9$2v4pe$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d3b:7550:6167:839f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d3b:7550:6167:839f
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad>
<u681a9$2v4pe$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jun 2023 22:30:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3224
 by: MitchAlsup - Mon, 12 Jun 2023 22:30 UTC

On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
> Hmm.. a couple of thoughts.
>
> Assumme we have the three-operand instructions
>
> MIN Rt,Ra,Rb,Rc
> MID Rt,Ra,Rb,Rc
> MAX Rt,Ra,Rb,Rc
<
In the 1980s and very early '90s, some designers thought about using
CAM sort arrays* in hardware. Software would load up a number of
values and write them into the "sorter", after loading all values, soft-
ware would read out all the value in order. No instruction was needed
to cause the sorter to sort, the writing of values in performed sorting.
<
These never went anywhere--draw your own conclusions.
<
(*) CAMs can be made to do things other than exact match detection.
>
> which would, as the name suggests, select the minimum, the mitpoint
> and the maximum of three values. Are these instructions which
> could reasonably be done in a single cycle, and issued in parallel
> on an OoO machine? That should certainly speed up three-way
> sorting, if this is reasonable.
<
Yours is only the 3-operand version, the array I mentioned above
was 32 entry or 64 entry (its been too long).
>
> Another idea, potentially even wilder (and probably worse :-)
> A "load double and sort" instruction, so
>
> LS Rmin,Rmax,[Rs]
>
> would load the maximum of [Rs] and [Rs+offs] into Rmax and the
> minimum of [Rs] and [Rs+offs], with offs being the size of the
> data loaded.
>
> Comments? Too horrible even to think about? Has been tried and
> does not work because...?
<
I, personally, do not think it a good idea to add a comparison stage
to the already difficult AGEN-CACHE-ALIGN portion of the memory
pipeline.

Re: Faster Sort Algorithms

<92f105ce-b9d9-4779-b0f3-9400509af7adn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32807&group=comp.arch#32807

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:176d:b0:62d:e137:5d51 with SMTP id et13-20020a056214176d00b0062de1375d51mr873657qvb.12.1686609380903;
Mon, 12 Jun 2023 15:36:20 -0700 (PDT)
X-Received: by 2002:a9d:7cc9:0:b0:6b1:4f1e:92ae with SMTP id
r9-20020a9d7cc9000000b006b14f1e92aemr3294942otn.4.1686609380615; Mon, 12 Jun
2023 15:36:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 15:36:20 -0700 (PDT)
In-Reply-To: <5UMhM.36287$fZx2.13431@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d3b:7550:6167:839f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d3b:7550:6167:839f
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad>
<u681a9$2v4pe$1@newsreader4.netcologne.de> <5UMhM.36287$fZx2.13431@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <92f105ce-b9d9-4779-b0f3-9400509af7adn@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: MitchAlsup@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jun 2023 22:36:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3406
 by: MitchAlsup - Mon, 12 Jun 2023 22:36 UTC

On Monday, June 12, 2023 at 5:21:57 PM UTC-5, EricP wrote:
> Thomas Koenig wrote:
> > Hmm.. a couple of thoughts.
> >
> > Assumme we have the three-operand instructions
> >
> > MIN Rt,Ra,Rb,Rc
> > MID Rt,Ra,Rb,Rc
> > MAX Rt,Ra,Rb,Rc
> >
> > which would, as the name suggests, select the minimum, the mitpoint
> > and the maximum of three values. Are these instructions which
> > could reasonably be done in a single cycle, and issued in parallel
> > on an OoO machine? That should certainly speed up three-way
> > sorting, if this is reasonable.
> >
> > Another idea, potentially even wilder (and probably worse :-)
> > A "load double and sort" instruction, so
> >
> > LS Rmin,Rmax,[Rs]
> >
> > would load the maximum of [Rs] and [Rs+offs] into Rmax and the
> > minimum of [Rs] and [Rs+offs], with offs being the size of the
> > data loaded.
> >
> > Comments? Too horrible even to think about? Has been tried and
> > does not work because...?
> >
> Many sorts are not just for an integer vector but also associated tag values.
> That is,
> Sort (a, b, taga, tagb)
> if a > b then
> swap(a,b)
> swap(taga,tagb)
> end if
> end Sort
<
Many sort algorithms allow the user to supply the subroutine that determines
the sort order::
<
SORT( A, length, comparison )
{
for( j = 0; j < length-1; j++ )
for( i = 0; I < length; i++ )
if( comparison( A[i], A[j] )
SWAP( A[i], A[j] );
}
<
Yes, I know this is not a good sort routine.
>
> The CMP, CMOV, CMOV way can handle tags, as can CMP CSWAP.
> But MIN and MAX approach can't.

Re: Faster Sort Algorithms

<fb1ec155-b47b-4b47-a569-901418d6d6bcn@googlegroups.com>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32809&group=comp.arch#32809

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:4655:0:b0:75c:eff8:8665 with SMTP id t82-20020a374655000000b0075ceff88665mr1744325qka.8.1686636596075;
Mon, 12 Jun 2023 23:09:56 -0700 (PDT)
X-Received: by 2002:a05:6870:8c35:b0:1a6:8071:a46d with SMTP id
ec53-20020a0568708c3500b001a68071a46dmr1776095oab.0.1686636595739; Mon, 12
Jun 2023 23:09:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jun 2023 23:09:55 -0700 (PDT)
In-Reply-To: <rkJhM.5473$x7Q8.4240@fx06.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:34e3:12b3:d6b5:91c6;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:34e3:12b3:d6b5:91c6
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at> <fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<rkJhM.5473$x7Q8.4240@fx06.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fb1ec155-b47b-4b47-a569-901418d6d6bcn@googlegroups.com>
Subject: Re: Faster Sort Algorithms
From: robfi680@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 13 Jun 2023 06:09:56 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7698
 by: robf...@gmail.com - Tue, 13 Jun 2023 06:09 UTC

On Monday, June 12, 2023 at 2:19:07 PM UTC-4, EricP wrote:
> robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> >> EricP <ThatWould...@thevillage.com> writes:
> >>> I was driven to instructions with 2-dest results by my desire to
> >>> (a) not have condition codes and (b) wanting to handle add with carry
> >>> in a straight forward way.
> >> I have expanded my answer to that problem into a paper:
> >> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
> >>
> >> Of course, the question is whether you consider this answer
> >> straightforward.
> >>> Variable length instructions opens the ISA to much useful functionality
> >>> like full size immediate operands but requires a more complicated fetch
> >>> unit than fixed size. This cost is generally accepted today.
> >> Counterexample: ARM A64 (and ARM has ample experience with
> >> variable-size instructions).
> >
> > I just decided not to long ago to go with a fixed sized instruction because
> > variable length calculations became on the critical timing path for an FPGA
> > design. Postfixes can effectively extend the instruction with constants..
> You say elsewhere that yours is a 40 bit fixed instruction,
> presumably byte aligned. That byte alignment kinda nullifies
> most of the advantages of fixed size.

Yes.
>
> The savings for fixed size come from being aligned on a nice boundary,
> these days usually 4 bytes.

I think this may make it easier to hack things. There is less entropy.

> Savings come from not having to deal with instruction straddles:
> fetch buffer straddles, line straddles, page straddles.
> Straddles require extra sequencing logic, machine states, clocks,
> stalls because it only has half of something, dealing with stall aborts.

Straddles can be handled by using even, odd pairs of lines and fetching
both at the same time, no sequencing logic, state machines or stalls.

> Then there is instruction alignment shifter. For 4-byte aligned
> instructions and 16-byte fetch buffers that's a 32 * 4:1 muxes while
> byte aligned is 8 * 16:1 muxes plus 8 * 5:1 muxes to deal with
> partial fetched straddles.
>
> And since handling the 4-byte aligned instructions is so simple
> they might be run directly into the decoder from the fetch buffer,
> combining fetch and decode into a single stage for a shorter pipeline.
> Whereas the byte aligned likely requires a separate fetch stage.
>
> All in all, it's just another brick in the wall.

I have been considering using 128-bit bundles containing three
40-bit instructions in fixed locations. That would get back some of the
benefits of the 4-byte aligned instructions, but it wastes 6% of memory.
The I$ could be made 120-bits wide so no wastage there.

>
> Variable length needs all of the above too, but gets flexibility
> to add more kinds of instructions in return, although that does
> lay more complexity pressure onto the decoder too.
> >>> For a multi-lane OoO uArch my concern was for Rename to NOT require
> >>> two register allocation units for each lane, adding greatly to the cost
> >>> and complexity, and which would mostly sit unused burning power.
> >>>
> >>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >>> split across two consecutive lanes. Rename allocates the 1-dest registers
> >>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >>> one output lane to Dispatch.
> >> I wonder if that's what ARM are doing for their multi-result
> >> instructions (load-pair with pre/post-indexed addressing mode consumes
> >> three write ports).
> >>
> > An idea I have been toying with is for some operations caching the operands
> > and results. So, if a divide is done the remainder is available immediately
> > without having to do a full divide. Works for multiply high too. I think it should
> > work well with reciprocal calculations too, where the same reciprocal is used
> > in a loop for instance.
> >
> > To deal with extended precision ADD in the past I have used a three operand
> > ADD, combined with a carry generate instruction. The carry generate
> > instruction takes the same operands as the add instruction but returns the
> > carry out instead of the sum. It can then be added in the next instruction. It
> > is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> > I think the ADD and the carry-generate can be done in the same clock cycle
> > in a superscalar CPU.
> That's what I did at first too. Sticking with risc tenet of 2 source,
> 1 dest registers I had ADD for the low part and ADDH for high carry out.
>
> The 2 sources means each of those is really a half adder so it takes
> 4 instructions to do a full add with carry-in and carry-out.
> Breaking with purist risc and allowing 3 source, 1 dest registers
> keeps it to the two ADD and ADDH instructions.
>
> Those xxxH instructions show up all over the place. MULHU, MULHS.
> The 3 source allows some wide shift and bit field instructions,
> but the 1 dest restriction still means you have both xxx and xxxH variants.

Yes. It is ugly and a lot of instructions. It is a simple approach. There are
about 300 instructions in the Thor ISA. Trying to keep it under 400. I read
somewhere for the linguistic elements, most people can remember about
400 different things. For example 400 DOS calls. Go beyond that and it is
a lookup nightmare.

>
> And in many of them you are doing most or all the full work
> then throwing away either the top or bottom half.
> Then repeating it immediately for the other half.

The work is not wasted if the results are cached.
>
> All of that goes back to the 1-dest restriction and the assumption
> that doing otherwise adds too much complexity.

Re: Faster Sort Algorithms

<u693iu$2vqpg$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32810&group=comp.arch#32810

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 06:50:07 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u693iu$2vqpg$1@newsreader4.netcologne.de>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad>
<u681a9$2v4pe$1@newsreader4.netcologne.de>
<d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 06:50:07 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2d6e:0:7285:c2ff:fe6c:992d";
logging-data="3140400"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 13 Jun 2023 06:50 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.
>>
>> Assumme we have the three-operand instructions
>>
>> MIN Rt,Ra,Rb,Rc
>> MID Rt,Ra,Rb,Rc
>> MAX Rt,Ra,Rb,Rc
><
> In the 1980s and very early '90s, some designers thought about using
> CAM sort arrays* in hardware. Software would load up a number of
> values and write them into the "sorter", after loading all values, soft-
> ware would read out all the value in order. No instruction was needed
> to cause the sorter to sort, the writing of values in performed sorting.
><
> These never went anywhere--draw your own conclusions.

The pendulum of reintegration swung the other way, quickly.

Although there appears to be a SORTL instruction on zSystem, according to
https://blog.share.org/Article/peeking-under-the-hood-of-sort-acceleration-on-z15
which appears to bring some improvement, so...

><
> (*) CAMs can be made to do things other than exact match detection.

Interesting.

MIN and MAX of three registers is straightforward two instructions,
but MID is more challenging. It could be something like

MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rtmp1,Rt1,Rt2
MAX Rtmp1,Rt1,Rt3

with (assuming single-cycle latency) anything between three
(for a sufficiently wide OoO machine) and five cycles of latency.

The last two instructions would be a three-way MAX, so this
could be shortened to

MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rtmp1,Rtmp1,Rtmp2,Rttmp3

in that case, for between two and four cycles (assuming that
three-way MAX is a single cycle). For OoO, MID is then probably
not worth it, especially if it is can not be done in a single
cycle.

Three-way sort could then be

MIN Rmin,Ra,Rb,Rc
MAX Rmax,Ra,Rb,Rc
MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rmid,Rtmp1,Rtmp2,Rttmp3

so between two and six cycles.

Re: Faster Sort Algorithms

<2023Jun13.100433@mips.complang.tuwien.ac.at>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32812&group=comp.arch#32812

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 08:04:33 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 25
Message-ID: <2023Jun13.100433@mips.complang.tuwien.ac.at>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com> <FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad> <90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com> <CckhM.10390$NuA9.3884@fx03.iad> <2023Jun11.163044@mips.complang.tuwien.ac.at> <6mIhM.58859$d1y5.51140@fx17.iad>
Injection-Info: dont-email.me; posting-host="66064d46a222b5841e367fd93f06bac1";
logging-data="3724114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3HPjwci3f8HC2/V7E5xv7"
Cancel-Lock: sha1:tV+I0wwraoLJ1D6f3KUj/nx6AbQ=
X-newsreader: xrn 10.11
 by: Anton Ertl - Tue, 13 Jun 2023 08:04 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
[ARM A64]
>It also has condition codes, address modes like pre/post index,
>load/store pair. So the risc tenets are already abandon.

You mean the MIPS tenets. Condition codes are in SPARC (and, I guess,
Berkeley RISC), PA-RISC, and Power(PC). Autoincrement addressing
modes (however they are called) are in PA-RISC and Power(PC).
Load/Store multiple are in Power(PC).

The biggest deviation from classical RISC principles here seems
load/store pair (or multiple), because it means that several cache
lines and several pages can be accessed by one load or store. But
given that unaligned accesses have won (even RV32I (the bottom tier of
RISC-V) supports misaligned accesses), so any load can access two
cache lines and two pages, load/store pair is no additional burden
apart from the register port issue, which is what we are discussing
here. The load/store pair instruction certainly are a good way to
make the most of the expensive load/store units without incurring the
problems of load/store multiple.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Faster Sort Algorithms

<u699ps$3hl0p$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32813&group=comp.arch#32813

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 10:36:11 +0200
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <u699ps$3hl0p$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad> <u681a9$2v4pe$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 13 Jun 2023 08:36:12 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3b364792274804d5bbe8f882351abd6b";
logging-data="3724313"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+n3kjF39qQH89qzyaw5vVhVmsX591o1R/Dkie8CIPgwg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:AzPDW/AyTit0XBztdC8OpvkqBgc=
In-Reply-To: <u681a9$2v4pe$1@newsreader4.netcologne.de>
 by: Terje Mathisen - Tue, 13 Jun 2023 08:36 UTC

Thomas Koenig wrote:
> Hmm.. a couple of thoughts.
>
> Assumme we have the three-operand instructions
>
> MIN Rt,Ra,Rb,Rc
> MID Rt,Ra,Rb,Rc
> MAX Rt,Ra,Rb,Rc
>
> which would, as the name suggests, select the minimum, the mitpoint
> and the maximum of three values. Are these instructions which
> could reasonably be done in a single cycle, and issued in parallel
> on an OoO machine? That should certainly speed up three-way
> sorting, if this is reasonable.
>
> Another idea, potentially even wilder (and probably worse :-)
> A "load double and sort" instruction, so
>
> LS Rmin,Rmax,[Rs]
>
> would load the maximum of [Rs] and [Rs+offs] into Rmax and the
> minimum of [Rs] and [Rs+offs], with offs being the size of the
> data loaded.
>
> Comments? Too horrible even to think about? Has been tried and
> does not work because...?
>
When writing SW emulation code for FADD/FSUB it is really nice to have
an instruction which takes two FP variables and compares/sorts them by
magnitude...

I.e. ignore the top bit, then do a compare on exp+mantissa and return
with (A,B) where |A| >= |B|

At this point it becomes easy to handle the rest of the algorithm.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u69a6e$3hmh5$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32814&group=comp.arch#32814

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 10:42:53 +0200
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <u69a6e$3hmh5$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad> <u681a9$2v4pe$1@newsreader4.netcologne.de>
<5UMhM.36287$fZx2.13431@fx14.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 08:42:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3b364792274804d5bbe8f882351abd6b";
logging-data="3725861"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ZpzsplWkjwVLLtZwAGIKZDuFko4ZWVrDOXQr3MvXt9w=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:SrwardW702ZwfyDCVBrPK9liCzQ=
In-Reply-To: <5UMhM.36287$fZx2.13431@fx14.iad>
 by: Terje Mathisen - Tue, 13 Jun 2023 08:42 UTC

EricP wrote:
> Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.
>>
>> Assumme we have the three-operand instructions
>>
>>     MIN    Rt,Ra,Rb,Rc
>>     MID    Rt,Ra,Rb,Rc
>>     MAX    Rt,Ra,Rb,Rc
>>
>> which would, as the name suggests, select the minimum, the mitpoint
>> and the maximum of three values.  Are these instructions which
>> could reasonably be done in a single cycle, and issued in parallel
>> on an OoO machine?  That should certainly speed up three-way
>> sorting, if this is reasonable.
>>
>> Another idea, potentially even wilder (and probably worse :-)
>> A "load double and sort" instruction, so
>>
>>     LS    Rmin,Rmax,[Rs]
>>
>> would load the maximum of [Rs] and [Rs+offs] into Rmax and the
>> minimum of [Rs] and [Rs+offs], with offs being the size of the
>> data loaded.
>>
>> Comments?  Too horrible even to think about?  Has been tried and
>> does not work because...?
>>
>
> Many sorts are not just for an integer vector but also associated tag
> values.
> That is,
>   Sort (a, b, taga, tagb)
>     if a > b then
>       swap(a,b)
>       swap(taga,tagb)
>     end if
>   end Sort
>
> The CMP, CMOV, CMOV way can handle tags, as can CMP CSWAP.
> But MIN and MAX approach can't.
>
I've previously looked at doing an approximate sort with the (possibly
truncated) key in the top part of a 64-bit container and the index of
the full record in the lower part. (The splitting point could vary
depending upon how many items you need to sort and/or the key size.)

This would be a stable sort when the full keys fit, for larger keys I
would need a final mini-sort for each stretch of identical extracted
keys. They here being that MIN/MAX would work very well.

For very wide keys, like text strings with potentially very long
indentical prefixes, I would need a better algorithm!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u69b9b$3hpje$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32816&group=comp.arch#32816

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 11:01:31 +0200
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <u69b9b$3hpje$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad> <u681a9$2v4pe$1@newsreader4.netcologne.de>
<d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 09:01:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3b364792274804d5bbe8f882351abd6b";
logging-data="3729006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qr/Y4GTh+YWT2vDeGArxWRzWmDVywtfRLijvSq/QqxA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:Vf0G8vWgFbiKQS4PrvzdQw3PBBA=
In-Reply-To: <d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>
 by: Terje Mathisen - Tue, 13 Jun 2023 09:01 UTC

MitchAlsup wrote:
> On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.
>>
>> Assumme we have the three-operand instructions
>>
>> MIN Rt,Ra,Rb,Rc
>> MID Rt,Ra,Rb,Rc
>> MAX Rt,Ra,Rb,Rc
> <
> In the 1980s and very early '90s, some designers thought about using
> CAM sort arrays* in hardware. Software would load up a number of
> values and write them into the "sorter", after loading all values, soft-
> ware would read out all the value in order. No instruction was needed
> to cause the sorter to sort, the writing of values in performed sorting.
> <
> These never went anywhere--draw your own conclusions.

Oh, but the sorting chip was actually produced, and was supposed to be
the "secret sauce" behind FAST (which Microsoft aquired for $2B, and is
still the core of Bing search).

You are absolutely correct that it never actually went anywhere, because
FAST had some very smart people who quickly figured out that this was a
SW and not a HW problem.

> <
> (*) CAMs can be made to do things other than exact match detection.
>>
>> which would, as the name suggests, select the minimum, the mitpoint
>> and the maximum of three values. Are these instructions which
>> could reasonably be done in a single cycle, and issued in parallel
>> on an OoO machine? That should certainly speed up three-way
>> sorting, if this is reasonable.
> <
> Yours is only the 3-operand version, the array I mentioned above
> was 32 entry or 64 entry (its been too long).

IBM had an earlier patent for a ladder layout where each item would
compare itself with its horizontal neighbor, then (during load) the
loser would drop down to the level below, making room for the incoming
loser from the layer above. On the next (bus) cycle after loading had
finished with the last item, the process would reverse and the winner
would be produced from the top, with each layer winner bubbling up on
each cycle.

I.e. N bus cycles to load N items, then N more cycles to return the
sorted data.

IBM also decided to not produce this chip.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u69eml$3i5rr$1@dont-email.me>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32817&group=comp.arch#32817

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: terje.mathisen@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 11:59:48 +0200
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <u69eml$3i5rr$1@dont-email.me>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad> <u681a9$2v4pe$1@newsreader4.netcologne.de>
<d8406fa3-b88e-4ba5-885e-ed03b3cdd1b6n@googlegroups.com>
<u69b9b$3hpje$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 13 Jun 2023 09:59:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3b364792274804d5bbe8f882351abd6b";
logging-data="3741563"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AMyK5EpnFGHH0MaGMs/cuseD2VCKD8WQHhUsanpvL1A=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:0XJbdDY+7I8sx4CcNKjDWimYfPc=
In-Reply-To: <u69b9b$3hpje$1@dont-email.me>
 by: Terje Mathisen - Tue, 13 Jun 2023 09:59 UTC

Mea Culpa!

I misremembered the FAST chip: It was of course intended for Search, not
Sort, returning the N best matches, but still a dead end.

Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
>>> Hmm.. a couple of thoughts.
>>>
>>> Assumme we have the three-operand instructions
>>>
>>> MIN Rt,Ra,Rb,Rc
>>> MID Rt,Ra,Rb,Rc
>>> MAX Rt,Ra,Rb,Rc
>> <
>> In the 1980s and very early '90s, some designers thought about using
>> CAM sort arrays* in hardware. Software would load up a number of
>> values and write them into the "sorter", after loading all values, soft-
>> ware would read out all the value in order. No instruction was needed
>> to cause the sorter to sort, the writing of values in performed sorting.
>> <
>> These never went anywhere--draw your own conclusions.
>
> Oh, but the sorting chip was actually produced, and was supposed to be
> the "secret sauce" behind FAST (which Microsoft aquired for $2B, and is
> still the core of Bing search).
>
> You are absolutely correct that it never actually went anywhere, because
> FAST had some very smart people who quickly figured out that this was a
> SW and not a HW problem.
>
>> <
>> (*) CAMs can be made to do things other than exact match detection.
>>>
>>> which would, as the name suggests, select the minimum, the mitpoint
>>> and the maximum of three values. Are these instructions which
>>> could reasonably be done in a single cycle, and issued in parallel
>>> on an OoO machine? That should certainly speed up three-way
>>> sorting, if this is reasonable.
>> <
>> Yours is only the 3-operand version, the array I mentioned above
>> was 32 entry or 64 entry (its been too long).
>
> IBM had an earlier patent for a ladder layout where each item would
> compare itself with its horizontal neighbor, then (during load) the
> loser would drop down to the level below, making room for the incoming
> loser from the layer above. On the next (bus) cycle after loading had
> finished with the last item, the process would reverse and the winner
> would be produced from the top, with each layer winner bubbling up on
> each cycle.
>
> I.e. N bus cycles to load N items, then N more cycles to return the
> sorted data.
>
> IBM also decided to not produce this chip.
>
> Terje
>
>

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Faster Sort Algorithms

<u69fjl$300kh$1@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32818&group=comp.arch#32818

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 10:15:17 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u69fjl$300kh$1@newsreader4.netcologne.de>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<6mIhM.58859$d1y5.51140@fx17.iad>
<u681a9$2v4pe$1@newsreader4.netcologne.de>
<5UMhM.36287$fZx2.13431@fx14.iad>
<92f105ce-b9d9-4779-b0f3-9400509af7adn@googlegroups.com>
Injection-Date: Tue, 13 Jun 2023 10:15:17 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2d6e:0:7285:c2ff:fe6c:992d";
logging-data="3146385"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 13 Jun 2023 10:15 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> Many sort algorithms allow the user to supply the subroutine that determines
> the sort order::
><
> SORT( A, length, comparison )
> {
> for( j = 0; j < length-1; j++ )
> for( i = 0; I < length; i++ )
> if( comparison( A[i], A[j] )
> SWAP( A[i], A[j] );
> }

An argument for either templates or LTO if there ever was one.
Making a function call to compare two integers or floats sounds
wrong, however low the overhead may be.

Re: Faster Sort Algorithms

<u69g9c$300kh$2@newsreader4.netcologne.de>

  copy mid

https://news.novabbs.org/devel/article-flat.php?id=32819&group=comp.arch#32819

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoenig@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Faster Sort Algorithms
Date: Tue, 13 Jun 2023 10:26:52 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <u69g9c$300kh$2@newsreader4.netcologne.de>
References: <4abb287e-88eb-485a-916e-50a0a3b0d68cn@googlegroups.com>
<FVlgM.4315$fZx2.2545@fx14.iad> <rPIgM.11831$fZx2.2801@fx14.iad>
<90ea8cc3-2ae8-4401-97a1-f9285e7588c5n@googlegroups.com>
<CckhM.10390$NuA9.3884@fx03.iad>
<2023Jun11.163044@mips.complang.tuwien.ac.at>
<fce79f80-b4c9-434c-8375-2efead1a637dn@googlegroups.com>
<rkJhM.5473$x7Q8.4240@fx06.iad>
<fb1ec155-b47b-4b47-a569-901418d6d6bcn@googlegroups.com>
Injection-Date: Tue, 13 Jun 2023 10:26:52 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2d6e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2d6e:0:7285:c2ff:fe6c:992d";
logging-data="3146385"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 13 Jun 2023 10:26 UTC

robf...@gmail.com <robfi680@gmail.com> schrieb:

> I have been considering using 128-bit bundles containing three
> 40-bit instructions in fixed locations. That would get back some of the
> benefits of the 4-byte aligned instructions, but it wastes 6% of memory.
> The I$ could be made 120-bits wide so no wastage there.

We previously discussed 128-bit bundles with instructions being
a multiple of 20 bits, with stop bits in the header.

Frequent instructions like "add ra,rb,rb" or "add ra,rb,#1" could
then fall in the 20-bit category.

If you manage 40% of your instructions to use 20 bits and 60% to use
40 bits, you break even in code density with a 32-bit encoding.
But 40 bits are a better fit if you have more than 32 registers.

Putting 32-bit or 64-bit constants into the stream is less
convenient, and will lead to losses in code density vs.,
for example, Mitch's constant encoding.

You'll need to run a frequency analysis on instructions to see
if it makes sense for you or not.


devel / comp.arch / Re: Faster Sort Algorithms

Pages:1234
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor